Trevor Smith bio photo

Trevor Smith

Yo.

Email Twitter Facebook LinkedIn Instagram Github

Post in progress

I have always wanted to take on a project involving drug usage data because I have always been fascinated by the how many over the counter (OTC) and prescription drugs are available in the U.S. The pharmaceutical industry is massive, and isn’t poised to slow down any time soon because the drugs they formulate impact millions of Americans every single day. Mostly, these drugs serve their users very well and help relieve or cure many different ailments. However, there are many cases out there where drugs actually have a negative impact on the people that take them, but this data is rarely made available to the public. Until now.

FDA OPEN DATA

Recently the FDA decided to start the FDA Open Data project in order to make FDA data easily available to developers across the world. The hope is that by doing this, insights can be had from their rich data sets and people all over the world will benefit from this open data work. I recently came across a competition they started called the OpenFDA Developer Challenge which is “An open call to tap public data and improve public health”. A description of the competition is below:

Overview

Option I’m Choosing

As you can see above, I’ll be taking on option number 2 which is related to looking for patterns in product labels and seeing if I can identify any inefficiencies. To me, this screams both clustering and text analysis. I’m excited to see if I can build a product that helps the FDA better understand their data.

Conclusion

Well… I’m very VERY excited to be working on this project! It is one of the richest data sets I have seen and the potential impact of great analysis can be massive! I encourage all of you reading this blog to participate as well and see what insights you can come up with. To follow my progress on this project, please visit my github FDA OpenData Project.

Misc.

On a more technical note, I’ve been working in MongoDB for this project because the data comes from the API in JSON format and MongoDB is perfect for this sort of data. I had some issues with the API at first with rate limits and pagination, but now seem to have it all sorted out. If you want to get some boiler plate code to get your data into MongoDB, feel free to check out the code snippet below. I’m using PyMongo which is a nice python wrapper for MongoDB.

"""Note: you need to have MongoDB running for any of this to work.
  Also, you need to have your own api key so that you don't exceed
  the rate limit parameters I'm looping with."""

### IMPORTS ###
import requests
import json
import cnfg
import pprint
from pymongo import MongoClient
import time

### EXTRACTING DRUG LABELS DATA ###
# there's around 70k drug labels
client = MongoClient()
labels = client.drugs.drug_labeling


for i in range(0,50):
    j = int(i*100)
    if i % 240 == 0:
        time.sleep(30)
    print "on iteration: " + str(i)
    response = requests.get("https://api.fda.gov/drug/label.json?api_key=h84pEssgNdT4A19c3CiHp1cKK3Gh3Aajc6GNtIq9&search=effective_time:[20150101+TO+20150331]&limit=100&skip=" + str(j) + "\"")
    labels.insert(response.json())


### EXTRACTING ADVERSE EVENTS DATA ###
# there's over 4mm events from 2004

""" I broke it down into two years so that I could utilize two api keys
to download the data at the same time.  Only showing 'year1' script here.

Also, there is a skip parameter maximum, so I had to download data for two
week periods at a time.  Hence all the loops. """

years1 = ['2008', '2009', '2010', '2011']
years2 = ['2012', '2013', '2014']
months = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']

starts = ['01', '15']
ends = ['14', '31', '14', '28','14','31','14','30','14','31','14','30','14','31','14','31','14','30','14','31','14','30','14','31']

loops = 0
for year in years1:
    print "starting year " + str(year)
    count = 0
    for month in months:
        print "starting month " + str(month)
        for ind, start in enumerate(starts):
            for i in range(0,51):
                j = int(i*100)
                if loops % 240 == 0:
                    print "done with: " + str(loops)
                    time.sleep(59)
                try:
                    response = requests.get("https://api.fda.gov/drug/event.json?api_key=h84pEssgNdT4A19c3CiHp1cKK3Gh3Aajc6GNtIq9&search=receivedate:["+year+month+start+'+TO+'+year+month+ends[count]+"]&limit=100&skip=" + str(j) + "\"")
                    events.insert(response.json())
                except:
                    print "json not good on: "+year+month+start+'TO'+year+month+ends[count]
                loops +=1
            count +=1

## EXTRACTING RECALLS DATA ###
# there's around 4,000 events

client = MongoClient()
enforcement = client.drugs.enforcement

print "starting the enforecement extraction..."
for i in range(0,39):
    j = int(i*100)
    response = requests.get("https://api.fda.gov/drug/enforcement.json?api_key=h84pEssgNdT4A19c3CiHp1cKK3Gh3Aajc6GNtIq9&search=&limit=100&skip=" + str(j) + "\"")
    enforcement.insert(response.json())
    print "finished iteration: " + str(i)
print "all done!"

~ Trevor