Django app to consume and store 990 data and metadata. Depends on IRSx (which is installed as a dependency below).
-
git clone this repository
git clone https://github.com/jsfenfen/990-xml-database.git
and$ cd 990-xml-database
-
install the requirements with
pip install -r requirements.txt
. This is Django 2, so only python3 is supported. -
copy the irsdb/local_settings.py-example file to irsdb/local_settings.py and edit it to reflect your database settings.
-
run
python manage.py makemigrations metadata
to generate the metadata migrations, and then run them withpython manage.py migrate metadata
. -
Load the metadata with from source csv files in generated_schemas with the management command:
python manage.py load_metadata
. This command erases the metadata before loading, so it can be rerun if it somehow breaks in the middle. -
If the csv files have changed you can generate migrations for the db by generating the models with
python manage.py generate_schemas_from_metadata
which puts the new models file in generated_schemas/ asdjango_models_auto.py
and then moving the generated models file into return/models.py and runningpython manage.py makemigrations return
.
The IRS releases metadata files which include the unique id, EIN and other information about each .xml filing. We need to put this in the database to make sense of the raw filings.
-
run
python manage.py makemigrations filing
to generate the filing migrations, and then run them withpython manage.py migrate filing
. -
Run
$ python manage.py enter_yearly_submissions <YYYY>
where YYYY is a the year corresponding to a yearly index file that has already been downloaded. { If it hasn't been downloaded you can retrieve it with irsx_index --year=YYYY }. This script checks to see if the IRS' index file is any bigger than the one one disk, and only runs if it has. You can force it to try to enter any new filings (regardless of whether the file is updated) with the--enter
option.
There's a problem with the 2014 index file. An internal comma has "broken" the .csv format for some time. You can fix it with a perl one liner (which first backs the file up to index_2014.csv.bak before modifying it)
$ perl -i.bak -p -e 's/SILVERCREST ASSET ,AMAGEMENT/SILVERCREST ASSET MANAGEMENT/g' index_2014.csv
We can see that it worked by diffing it.
$ diff index_2014.csv index_2014.csv.bak
39569c39569
< 11146506,EFILE,136171217,201212,1/14/2014,MOSTYN FOUNDATION INC CO SILVERCREST ASSET MANAGEMENT,990PF,93491211007003,201302119349100700
---
> 11146506,EFILE,136171217,201212,1/14/2014,MOSTYN FOUNDATION INC CO SILVERCREST ASSET ,AMAGEMENT,990PF,93491211007003,201302119349100700
For more details see here.
Part 5: Generate the schema files - Not recommended, this is only used when regenerating models for a new IRSX version
Run $ python manage.py generate_schemas_from_metadata
to generate a django models file (to the directory generated_models). You can modify these and put them into return/models.
Create the tables in the return model by running the migrations.
$ python manage.py makemigrations return
To make the migrations and
$ python manage.py migrate return
to run them.
Actually enter the filings into the database with
$ python manage.py load_filings <YYYY>
.
This script will take a while to run--probably at least several hours per year. You'll likely want to run it using nohup, so something like this:
$ nohup python manage.py load_filings <YYYY> &
Which detaches the terminal from the process, so if your connection times out the command keeps running.
You may want to adjust your postgres settings for better loading, but you'll need to pay attention to overall memory and resource uses.
The loading process uses columns in the filing model to track load process (and to insure the same files aren't loaded twice).
TK - explanation of keyerrors
There's a sql script that will remove all entered rows from all return tables and reset the fields in filing as if they were new.
If you want to live dangerously, you can run it from the console like this:
$ python manage.py dbshell < ./return/sql/delete_all_return.sql
There are management commands to create or drop indexes on object_id, ein and (for schedule K) documentId. Use
$ python manage.py make_indexes
or
$ python manage.py drop_indexes
. These are just conveniences to create indexes named xx_<tablename> --they won't remove other indexes.
You can remove all filings from a given index file with the remove_year. It's likely to run faster if indexes are in place.
If loading gets interrupted, you can remove only the rows where parse_started is true and parse_complete is not with the management command remove_half_loaded. It also requires a year as a command line argument.
$ python manage.py remove_half_loaded 2018
The full download of uncompressed .xml files is over ~74 gigabytes. Processing a complete year of data probably entails moving at least 15 gigs of xml.
You probably want to look into a tool to help you move these files in bulk. AWS' S3 CLI can dramatically reduce download time, but seems unhelpful when trying to pull a subset of files (it seems like --exclude '*' hangs when processing so many files). You may want to look into moving all the files to your own S3 bucket as well. There are also alternatives to AWS' CLI tool, like S3 CMD.
You'll also want to configure IRSx file cache directory to set the WORKING_DIRECTORY variable to the file path of the folder where the xml files are located.
The worst option is to download the uncompressed files one at a time. That sounds, really, really slow.
With most hosting providers, you'll need to configure additional storage to support the static files and the database that's ultimately loaded. Make sure that you set the database storage directory to that storage, and get the fastest storage type you can afford.
You may want to look into tuning your database parameters to better support data loading. And you'll get better performance if you only create indexes after loading is complete (and delete them before bulk loads take place).
One random datapoint: on an Amazon t2.medium ec2 server (~$38/month) with 150 gigs of additional storage and postgres running on the default configs and writing to an SSD EBS volume, load time for the complete set of about 490,000 filings from 2017 took about 3 hours.
This assumes no schema changes are required, which is usually the case.
Run an S3 sync to the location of the fillings. The whole collection is now over 80 GB, make sure you have room. You can also retrieve some other way (if you don't retrieve en masse the load_filings.py script will attempt to download one filing at a time). It's useful to run this with nohup, i.e.
nohup aws s3 sync s3://irs-form-990/ ./ &
Then update the index file data
$ python manage.py enter_yearly_submissions 2018
index_2018.csv has changed. Downloading updated file...
Done!
Entering xml submissions from /home/webuser/virt/env/lib/python3.5/site-packages/irsx/CSV/index_2018.csv
Committing 10000 total entered=10000
commit complete
Committing 10000 total entered=20000
commit complete
Added 24043 new entries.
Then enter the filings into the relational database with:
$ python manage.py load_filings 2018
Running filings during year 2018
Processed a total of 100 filings
Processed a total of 200 filings
Processed a total of 300 filings
...
Handled 24000 filings
Processed a total of 24000 filings
Processed a total of 24043 filings
Done
This script finds filings where submission_year
is the entered year and parse_complete
has not been set to True. It enters them in groups of 100 and sets parse_complete
to True after each batch has completed. The script is fairly fault tolerant, but if it dies in the middle it's important to remove all the half entered filings where parse_started
= True and parse_complete
is not True. (By default it is null, so don't try to match on parse_complete
= False).
--