San Diego Data Knowledge Base

Metapack Access Example

Examples for opening Metapack packages from a CKAN repository.

View Full Notebook on Github

Metatab is a system for documenting data set metadata, which the program metapack uses to create data packages. You can also us the metapack python module to access data packages from the web, which provides easy access to documentation and pandas data frames in Jupyter notebooks.

To start, you’ll need to install metatab, which you should be able to do with:

pip install metatab

The system is under active development, so if there are problems, you can install the latest development versions of the important modules with the development requirements.txt file on github:

pip install -r https://raw.githubusercontent.com/CivicKnowledge/metapack/master/dev/requirements.txt

After installing the module, you should be able to run the code in this notebook.

Metapack Packages

Metapack packages are collections of files that contain data and metadata. Metapack has several package types, including Excel, Zip, CSV, File systen and S3. Most of the time with Jupyter notebooks, you will use the CSV packages, but the ZIP and Excel packages will also work.

First, you’ll need to get a reference to a package. Most often, you’ll get these from our (CKAN Data Repository at data.sandiegodata.org)[http://data.sandiegodata.org]. In this example, well use the Community Reinvestment Act Disclosure Files.

First, visit the data package page in the data repository. The files list will have both data package files and data files. The data package files are the ones that start with the name of the package, ffiec.gov-cra_disclosure_smb_orig-2010_2015-2. SO, these are package files:

  • ffiec.gov-cra_disclosure_smb_orig-2010_2015-2.csv
  • ffiec.gov-cra_disclosure_smb_orig-2010_2015-2.zip
  • ffiec.gov-cra_disclosure_smb_orig-2010_2015-2.xlsx

The last one, the .csv file, is the CSV package. Using CSV packages is usually most efficient because you only need to download the data files that you use. So, the first step is to get the CSV package URL. From the data package page on the CKAN repository you can:

  1. Click on the “Explore” button next to the CSV package file, then right-click on “Go to resource” to copy the URL.
  2. Click on the name of the CSV package, then copy the URL at the top of the following page.

After you have the package URL, pass it into the open_package function, as shown in next cell. The function will return a data package object, which Jupyter will print by showing the package documentation.

import metapack as mp

pkg =  mp.open_package('http://library.metatab.org/ffiec.gov-cra_disclosure_smb_orig-2010_2015-2.csv')

pkg

Community Reinvestment Act Disclosure Files

ffiec.gov-cra_disclosure_smb_orig-2010_2015-2

Multi-year CRA disclosures for small business originations.

metapack+http://library.metatab.org/ffiec.gov-cra_disclosure_smb_orig-2010_2015-2.csv

Contacts

Wrangler: Eric Busboom San Diego Regional Data Library

Resources

  1. sb_loan_orig - http://library.metatab.org/ffiec.gov-cra_disclosure_smb_orig-2010_2015-2/data/sb_loan_orig.csv Table D1-1, small business disclosire records, for years 2010 to 2015 inclusive

References

  1. discl_15 - https://www.ffiec.gov/cra/xls/15exp_discl.zip

  2. discl_14 - https://www.ffiec.gov/cra/xls/14exp_discl.zip

  3. discl_13 - https://www.ffiec.gov/cra/xls/13exp_discl.zip

  4. discl_12 - https://www.ffiec.gov/cra/xls/12exp_discl.zip

  5. discl_11 - https://www.ffiec.gov/cra/xls/11exp_discl.zip

  6. discl_10 - https://www.ffiec.gov/cra/xls/10exp_discl.zip

The Resources section lists the datafiles in the package, while the References section show the links to datafiles that were used to create the resources. You can use the name of a resource in a call to pkg.resource to create a resource object, which like the package object, can be pretty printed in Jupyter.

r = pkg.resource('sb_loan_orig')
r

sb_loan_orig

http://library.metatab.org/ffiec.gov-cra_disclosure_smb_orig-2010_2015-2/data/sb_loan_orig.csv

HeaderTypeDescription
table_idtextValue is D1-1
respondent_idtextAssigned by regulatory agency (same as HMDAID if applicable);
Right justified with leading zeros
agencyintegerValues are 1=OCC, 2=FRS, 3=FDIC, or 4=OTS
yearintegerFour digit year (e.g. 2012)
loan_typeintegerValue is 4 (Small Business)
actionintegerValue is 1 (Originations)
stateintegerFIPS code with leading zeros or blank for totals across all states
countyintegerFIPS code with leading zeros or blank for totals across all counties
msaintegerAs defined by OMB; Right justified with leading zeros, NA left justified for areas outside of MSA/MD or blank for totals across all MSA/MDs
assessment_areatextValues are 0001 through 9999; Right justified with leading zeros, NA left justified for areas outside of an Assessment Area (including predominately military areas) OR blank for totals across all Assessment Areas
partial_countytextValues are Y = Yes N = No OR blank for totals
split_countytextValues are Y = Yes N = No OR blank for totals
pop_classtextValues are S= counties with < 500,000 in population L= counties with >500,000 in population OR blank for totals
income_totaltextValues are 1= < 10% of Median Family Income(MFI) 2= 10% to 20% of MFI 3= 20% to 30% of MFI 4= 30% to 40% of MFI 5= 40% to 50% of MFI 6= 50% to 60% of MFI 7= 60% to 70% of MFI 8= 70% to 80% of MFI 9= 80% to 90% of MFI 10= 90% to 100% of MFI 11= 100% to 110% of MFI 12= 110% to 120% of MFI 13= > 120% of MFI 14= MFI not known (income percentage = 0) 15= Tract not Known (reported as NA) 101= Low Income (< 50% of MFI - excluding 0) 102= Moderate Income (50% to 80% of MFI) 103= Middle Income (80% to 120% of MFI) 104= Upper Income (> 120% of MFI) 105= Income Not Known (0) 106= Tract not Known (NA) Right justified with leading zeros or blank for totals
report_leveltextValues are 4= Total Inside & Outside Assessment Area (AA) (across all states) 6= Total Inside AA (across all states) 8= Total Outside AA (across all states) 10= State Total 20= Total Inside AA in State 30= Total Outside AA in State 40= County Total 50= Total Inside AA in County 60= Total Outside AA in County Right justified with leading zeros or blank if not a total
num_orig_bus_lt100kintegerRight justified with leading zeros
tot_orig_bus_lt100kintegerAmount is in thousands {e.g. 00000025 indicates $25,000); Right justified with leading zeros
num_orig_bus_lt250kintegerRight justified with leading zeros
tot_orig_bus_gt100k_lt250kintegerAmount is in thousands {e.g. 00000125 indicates $125,000); Right justified with leading zeros
num_orig_bus_gt250k_lt1mintegerRight justified with leading zeros
tot_orig_bus_gt250k_lt1mintegerAmount is in thousands {e.g. 00000300 indicates $300,000); Right justified with leading zeros
num_orig_bus_lt1mintegerRight justified with leading zeros
tot_orig_bus_lt1mintegerAmount is in thousands {e.g. 00000025 indicates $25,000); Right justified with leading zeros
num_orig_bus_alintegerRight justified with leading zeros
tot_orig_bus_alintegerAmount is in thousands {e.g. 00000025 indicates $25,000); Right justified with leading zeros

The final step of access is to create a dataframe from the resource. This is really easy, just use the .dataframe() method. Note, however, for this dataset, it can take almost 10 minutes to create the whole dataframe, as the data file is very large.

%%time
df = r.dataframe()
CPU times: user 9min 7s, sys: 8.66 s, total: 9min 16s
Wall time: 9min 36s
len(df)
4391391