Metapack Access Example

Examples for opening Metapack packages from a CKAN repository.

Metatab is a system for documenting data set metadata, which the program metapack uses to create data packages. You can also us the metapack python module to access data packages from the web, which provides easy access to documentation and pandas data frames in Jupyter notebooks.

To start, you’ll need to install metatab, which you should be able to do with:

pip install metatab

The system is under active development, so if there are problems, you can install the latest development versions of the important modules with the development requirements.txt file on github:

pip install -r https://raw.githubusercontent.com/CivicKnowledge/metapack/master/dev/requirements.txt

After installing the module, you should be able to run the code in this notebook.

Metapack Packages

Metapack packages are collections of files that contain data and metadata. Metapack has several package types, including Excel, Zip, CSV, File systen and S3. Most of the time with Jupyter notebooks, you will use the CSV packages, but the ZIP and Excel packages will also work.

First, you’ll need to get a reference to a package. Most often, you’ll get these from our (CKAN Data Repository at data.sandiegodata.org)[http://data.sandiegodata.org]. In this example, well use the Community Reinvestment Act Disclosure Files.

First, visit the data package page in the data repository. The files list will have both data package files and data files. The data package files are the ones that start with the name of the package, ffiec.gov-cra_disclosure_smb_orig-2010_2015-2. SO, these are package files:

ffiec.gov-cra_disclosure_smb_orig-2010_2015-2.csv
ffiec.gov-cra_disclosure_smb_orig-2010_2015-2.zip
ffiec.gov-cra_disclosure_smb_orig-2010_2015-2.xlsx

The last one, the .csv file, is the CSV package. Using CSV packages is usually most efficient because you only need to download the data files that you use. So, the first step is to get the CSV package URL. From the data package page on the CKAN repository you can:

Click on the “Explore” button next to the CSV package file, then right-click on “Go to resource” to copy the URL.
Click on the name of the CSV package, then copy the URL at the top of the following page.

After you have the package URL, pass it into the open_package function, as shown in next cell. The function will return a data package object, which Jupyter will print by showing the package documentation.

import metapack as mp

pkg =  mp.open_package('http://library.metatab.org/ffiec.gov-cra_disclosure_smb_orig-2010_2015-2.csv')

pkg

Community Reinvestment Act Disclosure Files

ffiec.gov-cra_disclosure_smb_orig-2010_2015-2

Multi-year CRA disclosures for small business originations.

metapack+http://library.metatab.org/ffiec.gov-cra_disclosure_smb_orig-2010_2015-2.csv

Contacts

Wrangler: Eric Busboom San Diego Regional Data Library

Resources

sb_loan_orig - http://library.metatab.org/ffiec.gov-cra_disclosure_smb_orig-2010_2015-2/data/sb_loan_orig.csv Table D1-1, small business disclosire records, for years 2010 to 2015 inclusive

References

discl_15 - https://www.ffiec.gov/cra/xls/15exp_discl.zip
discl_14 - https://www.ffiec.gov/cra/xls/14exp_discl.zip
discl_13 - https://www.ffiec.gov/cra/xls/13exp_discl.zip
discl_12 - https://www.ffiec.gov/cra/xls/12exp_discl.zip
discl_11 - https://www.ffiec.gov/cra/xls/11exp_discl.zip
discl_10 - https://www.ffiec.gov/cra/xls/10exp_discl.zip

The Resources section lists the datafiles in the package, while the References section show the links to datafiles that were used to create the resources. You can use the name of a resource in a call to pkg.resource to create a resource object, which like the package object, can be pretty printed in Jupyter.

r = pkg.resource('sb_loan_orig')
r

sb_loan_orig

http://library.metatab.org/ffiec.gov-cra_disclosure_smb_orig-2010_2015-2/data/sb_loan_orig.csv

Header	Type	Description
table_id	text	Value is D1-1
respondent_id	text	Assigned by regulatory agency (same as HMDAID if applicable); Right justified with leading zeros
agency	integer	Values are 1=OCC, 2=FRS, 3=FDIC, or 4=OTS
year	integer	Four digit year (e.g. 2012)
loan_type	integer	Value is 4 (Small Business)
action	integer	Value is 1 (Originations)
state	integer	FIPS code with leading zeros or blank for totals across all states
county	integer	FIPS code with leading zeros or blank for totals across all counties
msa	integer	As defined by OMB; Right justified with leading zeros, NA left justified for areas outside of MSA/MD or blank for totals across all MSA/MDs
assessment_area	text	Values are 0001 through 9999; Right justified with leading zeros, NA left justified for areas outside of an Assessment Area (including predominately military areas) OR blank for totals across all Assessment Areas
partial_county	text	Values are Y = Yes N = No OR blank for totals
split_county	text	Values are Y = Yes N = No OR blank for totals
pop_class	text	Values are S= counties with < 500,000 in population L= counties with >500,000 in population OR blank for totals
income_total	text	Values are 1= < 10% of Median Family Income(MFI) 2= 10% to 20% of MFI 3= 20% to 30% of MFI 4= 30% to 40% of MFI 5= 40% to 50% of MFI 6= 50% to 60% of MFI 7= 60% to 70% of MFI 8= 70% to 80% of MFI 9= 80% to 90% of MFI 10= 90% to 100% of MFI 11= 100% to 110% of MFI 12= 110% to 120% of MFI 13= > 120% of MFI 14= MFI not known (income percentage = 0) 15= Tract not Known (reported as NA) 101= Low Income (< 50% of MFI - excluding 0) 102= Moderate Income (50% to 80% of MFI) 103= Middle Income (80% to 120% of MFI) 104= Upper Income (> 120% of MFI) 105= Income Not Known (0) 106= Tract not Known (NA) Right justified with leading zeros or blank for totals
report_level	text	Values are 4= Total Inside & Outside Assessment Area (AA) (across all states) 6= Total Inside AA (across all states) 8= Total Outside AA (across all states) 10= State Total 20= Total Inside AA in State 30= Total Outside AA in State 40= County Total 50= Total Inside AA in County 60= Total Outside AA in County Right justified with leading zeros or blank if not a total
num_orig_bus_lt100k	integer	Right justified with leading zeros
tot_orig_bus_lt100k	integer	Amount is in thousands {e.g. 00000025 indicates $25,000); Right justified with leading zeros
num_orig_bus_lt250k	integer	Right justified with leading zeros
tot_orig_bus_gt100k_lt250k	integer	Amount is in thousands {e.g. 00000125 indicates $125,000); Right justified with leading zeros
num_orig_bus_gt250k_lt1m	integer	Right justified with leading zeros
tot_orig_bus_gt250k_lt1m	integer	Amount is in thousands {e.g. 00000300 indicates $300,000); Right justified with leading zeros
num_orig_bus_lt1m	integer	Right justified with leading zeros
tot_orig_bus_lt1m	integer	Amount is in thousands {e.g. 00000025 indicates $25,000); Right justified with leading zeros
num_orig_bus_al	integer	Right justified with leading zeros
tot_orig_bus_al	integer	Amount is in thousands {e.g. 00000025 indicates $25,000); Right justified with leading zeros

The final step of access is to create a dataframe from the resource. This is really easy, just use the .dataframe() method. Note, however, for this dataset, it can take almost 10 minutes to create the whole dataframe, as the data file is very large.

%%time
df = r.dataframe()

CPU times: user 9min 7s, sys: 8.66 s, total: 9min 16s
Wall time: 9min 36s

len(df)