Using APIs¶
American Community Survey¶
The American Community Survey(ACS) is the largest survey conducted by the U.S. Census Bureau. It collects basic demographic information about people living across the United States.
The survey is conducted every year, and, due to processing time, releases with a lag. For instance, the 2018 survey data is released in the latter part of 2019. The ACS data is relased not only in one-year estimates (e.g. 2018 survey data) but also in five-year estimates (e.g. 2014-2018 survey data). The one-year estimates are, naturally, more recent. However, five-year estimates may be necessary in some applications. The Census Bureau does not release survey information for small populations, due to anonymity concerns, every year. Rather, some information is available exclusively in the five-year survey data (which averages over a five year period).
A brief overview of the data available through the ACS is available here. Note that the ACS data includes things like age, marital status, income, employment status, and educational attainment.
The full list of ACS data is available here.
Register for an API key with the U.S. Census Bureau here. This step is required to continue with the lecture notes!
Enter your API key as a string here:
api_key = ''
To keep my key secret, I’ve pickled the string that stores it, and will re-load it here. Do not attempt to run the following block.
import pickle
with open('census_key.p', 'rb') as f:
api_key = pickle.load(f)
The formula to get data from the ACS is:
'https://api.census.gov/data/'
+ <year>
+ '/acs/acs1?get=NAME,'
+ <variable name>
+ '&for='
+ <geography>
+ ':*&key='
+ <API key>
For instance, to get median household income (which has <variable name> = B19013_001E
) for <year> = 2010
over the entire U.S. (<geography> = us
), we would use the string
'https://api.census.gov/data/' + '2010' + '/acs/acs1?get=NAME,' + 'B19013_001E' + '&for=' + 'us' + ':*&key=' + api_key
as our URL. An example request is shown below:
import requests
r = requests.get('https://api.census.gov/data/2010/acs/acs1?get=NAME,B19013_001E&for=us:*&key='+api_key).json()
print(r)
[['NAME', 'B19013_001E', 'us'], ['United States', '50046', '1']]
Likewise, to get data for every county in the country, we would replace us
with county
following the for=
piece of the string. This will return data on many counties. Instead of printing them all, below, we’ve printed just a sample, as well as the total number of counties for which median household income data is available via ACS.
r = requests.get('https://api.census.gov/data/2010/acs/acs1?get=NAME,B19013_001E&for=county:*&key='+api_key).json()
print(r[0:10])
print(len(r))
[['NAME', 'B19013_001E', 'state', 'county'], ['Baldwin County, Alabama', '47502', '01', '003'], ['Calhoun County, Alabama', '36675', '01', '015'], ['Cullman County, Alabama', '35786', '01', '043'], ['DeKalb County, Alabama', '35967', '01', '049'], ['Elmore County, Alabama', '52222', '01', '051'], ['Etowah County, Alabama', '35897', '01', '055'], ['Houston County, Alabama', '38579', '01', '069'], ['Jefferson County, Alabama', '41583', '01', '073'], ['Lauderdale County, Alabama', '38661', '01', '077']]
819
Note that the first item in this list is:
print(r[0])
['NAME', 'B19013_001E', 'state', 'county']
which is simply a set of headers. That is, for each additional item in the list, the data in position 0 is the county name and the data in position 1 corresponds to the value for the variable B19013_001E
. The numbers in positions 2 and 3 are the county’s FIPS code. States have 2-digit FIPS codes, and counties have 3-digit FIPS codes. If we put the state code and county code together to a 5-digit number (state code then county code), we have a unique identifier for the county. This FIPS code is used by many different data providers.
Recall that we can remove an element from a list with the .pop()
function. For instance, let’s remove the first element of our requested ACS data (the set of headers).
headers = r.pop(0)
Do not run the above line of code multiple times! Python will continue popping out elements of your list.
The value for headers
is thus:
print(headers)
['NAME', 'B19013_001E', 'state', 'county']
as expected. Now, the requested ACS data no longer has that element. Rather, what is left is
print(r[0:10])
[['Baldwin County, Alabama', '47502', '01', '003'], ['Calhoun County, Alabama', '36675', '01', '015'], ['Cullman County, Alabama', '35786', '01', '043'], ['DeKalb County, Alabama', '35967', '01', '049'], ['Elmore County, Alabama', '52222', '01', '051'], ['Etowah County, Alabama', '35897', '01', '055'], ['Houston County, Alabama', '38579', '01', '069'], ['Jefferson County, Alabama', '41583', '01', '073'], ['Lauderdale County, Alabama', '38661', '01', '077'], ['Lee County, Alabama', '39381', '01', '081']]
simply data. This is useful, because we can not store all of the data we have in a DataFrame. The command to do this is pd.DataFrame()
. This function takes two arguments. First, Python expects to receive a list of lists. This list of lists should correspond to the data that we want to store. The list of list is a list of rows of data, where each row may be a list of multiple values (multiple columns). The second argument is a list of column names.
import pandas as pd
census = pd.DataFrame(r, columns=headers)
census.head()
NAME | B19013_001E | state | county | |
---|---|---|---|---|
0 | Baldwin County, Alabama | 47502 | 01 | 003 |
1 | Calhoun County, Alabama | 36675 | 01 | 015 |
2 | Cullman County, Alabama | 35786 | 01 | 043 |
3 | DeKalb County, Alabama | 35967 | 01 | 049 |
4 | Elmore County, Alabama | 52222 | 01 | 051 |
census['state'] = census['state'].astype(int)
census['county'] = census['county'].astype(int)
census.dtypes
NAME object
B19013_001E object
state int64
county int64
dtype: object
census['B19013_001E'] = census['B19013_001E'].astype(int)
Zillow Home Values¶
County median incomes give a sense of the economic well-being of a geographic area. Suppose that our interest is in how home prices (for most households, the most valuable asset that the household owns), correlates with income.
Housing data is available from Zillow here.
The URL for data on all home prices in a county (single-family residential and condos) is 'http://files.zillowstatic.com/research/public_v2/zhvi/County_zhvi_uc_sfr_tier_0.33_0.67_sm_sa_mon.csv'
. Pandas can read online csv files directly in to Python.
zhvi = pd.read_csv('http://files.zillowstatic.com/research/public_v2/zhvi/County_zhvi_uc_sfr_tier_0.33_0.67_sm_sa_mon.csv')
The .shape
property of a Pandas DataFrame reports the numbers of rows and columns.
zhvi.shape
(2850, 306)
There are a lot of columns on this file! Recall that the .columns
property of a DataFrame returns a list of columns. Recall too that we can subset a list named x
to items m
through n
via x[m:n]
. Thus, print out only the first 20 columns of the DataFrame below (to get a sense of what’s included on this DataFrame).
zhvi.columns[:20]
Index(['RegionID', 'SizeRank', 'RegionName', 'RegionType', 'StateName',
'State', 'Metro', 'StateCodeFIPS', 'MunicipalCodeFIPS', '1996-01-31',
'1996-02-29', '1996-03-31', '1996-04-30', '1996-05-31', '1996-06-30',
'1996-07-31', '1996-08-31', '1996-09-30', '1996-10-31', '1996-11-30'],
dtype='object')
Likewise, the last n
items of list x
are accessible via x[-n:]
.
zhvi.columns[-20:]
Index(['2019-02-28', '2019-03-31', '2019-04-30', '2019-05-31', '2019-06-30',
'2019-07-31', '2019-08-31', '2019-09-30', '2019-10-31', '2019-11-30',
'2019-12-31', '2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
'2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31', '2020-09-30'],
dtype='object')
Given what we observe about the data, define a list of column names to keep. All other columns outside of this list will be removed.
keep_list = [col for col in zhvi.columns[:9]] + ['2018-12-31']
print(keep_list)
['RegionID', 'SizeRank', 'RegionName', 'RegionType', 'StateName', 'State', 'Metro', 'StateCodeFIPS', 'MunicipalCodeFIPS', '2018-12-31']
zhvi = zhvi[keep_list]
print(zhvi.head())
RegionID SizeRank RegionName RegionType StateName State \
0 3101 0 Los Angeles County County CA CA
1 139 1 Cook County County IL IL
2 1090 2 Harris County County TX TX
3 2402 3 Maricopa County County AZ AZ
4 2841 4 San Diego County County CA CA
Metro StateCodeFIPS MunicipalCodeFIPS \
0 Los Angeles-Long Beach-Anaheim 6 37
1 Chicago-Naperville-Elgin 17 31
2 Houston-The Woodlands-Sugar Land 48 201
3 Phoenix-Mesa-Scottsdale 4 13
4 San Diego-Carlsbad 6 73
2018-12-31
0 640852.0
1 256662.0
2 198043.0
3 282261.0
4 610196.0
We will combine datasets using state and county codes. Check that the ZHVI data imported as numbers.
zhvi.dtypes
RegionID int64
SizeRank int64
RegionName object
RegionType object
StateName object
State object
Metro object
StateCodeFIPS int64
MunicipalCodeFIPS int64
2018-12-31 float64
dtype: object
Now we can merge. The format for merging df1
and df2
is to use df1.merge(df2, left_on=, right_on=)
. The left_on
and right_on
specify variables on the left (df11)
and right (df2
) datasets that should link the datasets together. Note that df1
is the left dataset because it appears before .merge
, whereas df2
is the right data because it appears after .merge
. It is also possible to use df2.merge(df1, left_on=, right_on=)
. In this scenario, the lists passed to left_on
and right_on
would flip from the earlier usage (since now df2
is the left dataset).
df = zhvi.merge(census, left_on=['StateCodeFIPS','MunicipalCodeFIPS'], right_on=['state','county'])
df.head()
RegionID | SizeRank | RegionName | RegionType | StateName | State | Metro | StateCodeFIPS | MunicipalCodeFIPS | 2018-12-31 | NAME | B19013_001E | state | county | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3101 | 0 | Los Angeles County | County | CA | CA | Los Angeles-Long Beach-Anaheim | 6 | 37 | 640852.0 | Los Angeles County, California | 52684 | 6 | 37 |
1 | 139 | 1 | Cook County | County | IL | IL | Chicago-Naperville-Elgin | 17 | 31 | 256662.0 | Cook County, Illinois | 51466 | 17 | 31 |
2 | 1090 | 2 | Harris County | County | TX | TX | Houston-The Woodlands-Sugar Land | 48 | 201 | 198043.0 | Harris County, Texas | 50422 | 48 | 201 |
3 | 2402 | 3 | Maricopa County | County | AZ | AZ | Phoenix-Mesa-Scottsdale | 4 | 13 | 282261.0 | Maricopa County, Arizona | 50410 | 4 | 13 |
4 | 2841 | 4 | San Diego County | County | CA | CA | San Diego-Carlsbad | 6 | 73 | 610196.0 | San Diego County, California | 59923 | 6 | 73 |
This merged dataset has both county median income and an index of home values in the area. Are income and home prices correlated? We can quickly visualize the question with seaborn
.
import seaborn as sns
sns.lmplot(x='B19013_001E', y='2018-12-31', data=df)
<seaborn.axisgrid.FacetGrid at 0x7f898ae77910>

There do appear to be some outliers here. To check, try a boxplot.
sns.boxplot(zhvi['2018-12-31'])
/mnt/software/anaconda3/envs/fds/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
<AxesSubplot:xlabel='2018-12-31'>

Additionally, print out a description of the home value index data.
zhvi['2018-12-31'].describe()
count 2.850000e+03
mean 1.667050e+05
std 1.307382e+05
min 2.544400e+04
25% 9.602900e+04
50% 1.330440e+05
75% 1.979808e+05
max 2.925333e+06
Name: 2018-12-31, dtype: float64
One method of removing outliers is to ignore (i.e. delete) the data that is “far out” on the boxplot (i.e., well beyond the whiskers).
The numpy
module has a quantile()
function to retrieve the 25th and 75th percentiles, as reported above in the summary statistics.
import numpy as np
quantiles = np.quantile(zhvi['2018-12-31'], q=[.25, .75])
print(quantiles)
[ 96029. 197980.75]
The interquartile range (IQR) is defined as the difference between the 75th and 25th percentiles of data. One reasonable way to remove outliers is to eliminate observations that exceed the 75th percentile by more than 1.5*IQR or are less than the 25th percentile by more than 1.5*IQR.
iqr = quantiles[1]-quantiles[0]
sns.lmplot(x='B19013_001E', y='2018-12-31', data=df[ (df['2018-12-31'] < quantiles[1]+1.5*iqr) &
(df['2018-12-31'] > quantiles[0]-1.5*iqr) ])
<seaborn.axisgrid.FacetGrid at 0x7f89893476d0>
