Welcome to the Data Gathering Repository for the Index of NCI Studies (INS)!
This repository is designed to use a curated list of key National Cancer Institute (NCI) programs to gather data for grants, projects and publications associated with each program in an automated way. This repository accesses resources from the NIH RePORTER API, NIH iCite bulk downloads, and the NCBI PubMed API (through BioPython).
To access file outputs of the data gathering process, follow the instructions below.
The data gathered here are compatible with INS Data Model v2.0.0.
The INS Data Gathering workflow consists of the following steps designed to run together in order:
The workflow is supported by additional, independent steps run as needed:
This step gathers programs curated by the NCI Office of Data Sharing (ODS).
Programs represent a coherent assembly of plans, project activities, and supporting resources contained within an administrative framework, the purpose of which is to implement an organization's mission or some specific program-related aspect of that mission.
Within INS, programs are the highest-level node. Programs contain projects.
Examples
All program processing is handled in the gather_program_data.py
module and can be run as an independent process with the command:
python modules/gather_program_data.py
-
Process Qualtrics Input
- Processes the curated CSV of Programs received from the NCI Office of Data Sharing (ODS). This CSV is an export of survey results from the Qualtrics survey tool. Each Program in this export includes a curation of associated Notices of Funding Opportunities (NOFOs) or Awards.
- NOTE: Though similar terms, 'Award' is used throughout this documentation to refer to the Award values provided within the Qualtrics CSV, while 'Grant' is used to refer to the grants gathered from NIH RePORTER and used in downstream processing.
-
Validate provided NOFOs and Awards
- Compares each NOFO or Award to expected string formatting patterns to check for validity
- Generates versioned
invalidAwardReport
andinvalidNofoReport
in thereports
directory if any unexpected patterns are found - Prompts user to review and correct any issues. There are two ways to make corrections:
- Make manual changes within the raw qualtrics file
qualtrics_output_{version}_raw.csv
and save asqualtrics_output_{version}_manual_fix.csv
. ChangeQUALTRICS_TYPE
tomanual_fix
withinconfig.py
. - Add
suggested_fix
and optionalcomment
column(s) to theinvalidAwardReport
orinvalidNofoReport
csv and save to thedata/reviewed/{version}/
directory with the_reviewed
suffix. The validation step will automatically check to see if this file exists and make any suggested changes specified within. If any invalid values still remain after this fix, aninvalidAwardReport_corrected.csv
orinvalidNofoReport_corrected.csv
is generated in thereports
directory.
- Make manual changes within the raw qualtrics file
-
Validate Program names and generate IDs
- Checks for duplicates or unexpected combinations in Program Names and Program Acronyms
- If duplicates are found, then a manual fix must be made to the qualtrics input file to avoid downstream issues.
- Save the fixed file as
qualtrics_output_{version}_manual_fix.csv
. ChangeQUALTRICS_TYPE
tomanual_fix
withinconfig.py
.
- Generates a
program_id
from the Program Acronym
- Checks for duplicates or unexpected combinations in Program Names and Program Acronyms
-
Generate clean Program file
- Saves intermediate
program.csv
in versioneddata/01_intermediate/
directory for reference and downstream use - The fields expected are defined and can be modified in
config.py
- Saves intermediate
This step gathers grants from the NIH RePORTER. Only grants associated with programs (above) are gathered.
Grants are financial assistance mechanisms providing money, property, or both to an eligible entity to carry out an approved project or activity. A grant is used whenever the NIH Institute or Center anticipates no substantial programmatic involvement with the recipient during performance of the financially assisted activities.
Within INS, a grant is usually an annual support to a multi-year project.
Examples
- 5U24CA209999-02 | Monitoring tumor subclonal heterogeneity over time and space (FY2017)
- 5U24CA209999-03 | Monitoring tumor subclonal heterogeneity over time and space (FY2018)
- 5U10CA031946-23 | Cancer and Leukemia Group B (FY2004)
- 1P50CA217691-01A1 | Emory University Lung Cancer SPORE (FY2019)
All grants processing is handled within the gather_grant_data.py
module, except for the summary statistic step handled with the summary_statistic.py
module. They can be run as independent processes with the commands:
python modules/gather_grant_data.py
python modules/summary_statistics.py
-
Get grants data from NIH RePORTER API
- This process takes approximately 5-10 minutes for ~70 programs
- For each Key Program, this queries the NIH RePORTER API to gather a list of all associated extramural grants along with descriptive data for each grant.
- The NOFOs (e.g.
RFA-CA-21-038
;PAR21-346
) and/or Awards (e.g.1 U24 CA274274-01
;P50CA221745
;3U24CA055727-26S1
) provided for each Key Program are used as the query. - The following exclusion are also applied within the query:
- Subprojects are excluded
- Grants prior to fiscal year 2000 are excluded
- Grants receiving no funding from NCI are excluded
-
Process grants data
- Reformats the data received from the NIH RePORTER API for use within INS.
- Removes extraneous fields and rename fields to match the existing INS data model.
- Flattens nested JSON structures. In particular, the PI, PO, and agency funding fields have this structure.
- Formats names to standardize capitalization
- The fields expected are defined and can be modified in
config.py
- Reformats the data received from the NIH RePORTER API for use within INS.
-
Save grants data for each Program
- Adds the associated Program ID to each grant
- Combines grants data from all programs and store as a versioned
grant.csv
within thedata/01_intermediate/
directory.
-
Generate summary statistics
- Builds reports useful for testing and validation but not intended for ingestion into the site
grantsStatsByProgram.csv
groups grants data by Key Program and aggregates counts of grants, projects, searched values, and earliest fiscal yearsharedProjectsByProgramPair.csv
lists pairs of Key Programs and counts of projects that are associated with both
- Summary statistics are handled within the
summary_statistics.py
module
- Builds reports useful for testing and validation but not intended for ingestion into the site
This step derives projects from the grants data gathered from NIH RePORTER. All projects are associated with at least one program.
Projects are the primary unit of collaborative research effort, sometimes also known as core projects or parent projects. Projects receive funding from awards to conduct research and produce outputs, often across multiple years.
Within INS, projects are organized as a grouping of grants with identical Activity Code, Institute Code, and Serial Number.
Examples
- U24CA209999 | Monitoring tumor subclonal heterogeneity over time and space
- U10CA031946 | Cancer and Leukemia Group B
All projects processing is handled within the gather_project_data.py
module and can be run as an independent process with the command:
python modules/gather_project_data.py
-
Aggregate grants into projects
-
Using the intermediate grants file
grant.csv
as input, this process builds a table of all projects and fills it with information pulled from grants for each project. -
Each project field is populated using the appropriate aggregation type:
- Newest/Oldest: Some fields should be populated with values from either the newest (e.g. project title, abstract) or oldest (e.g. project start date) grant.
- Identical: Some fields are identical across all grants within a project (e.g. organization and location details). For consistency, the project value is populated with the value from the newest non-supplement grant.
- List: Some fields are populated by gathering and listing all values from all grants within a project (e.g. opportunity number)
-
Newest/oldest grants are decided using the award notice date if available. If not available, fiscal year is used.
-
Because grant supplements are not always representative of the project, supplements are ignored when pulling newest/oldest values when non-supplement grants are available within a project.
-
-
Validate values and add Program IDs
- Validates that all values that should be identical between grants within a project are identical. Any mismatches are reported in
mismatchedProjectValueReport.csv
in thereports
directory. - Adds Program IDs to projects. To support the data model, any project associated with more than one program will appear multiple times (once for each program).
- Stores project data in the versioned
project.csv
within thedata/01_intermediate/
directory.
- Validates that all values that should be identical between grants within a project are identical. Any mismatches are reported in
This step gathers publication data from NIH RePORTER, NCBI PubMed, and NIH iCite. All publications are associated with at least one project.
Publications gathered for INS only include articles represented in PubMed with a unique PubMed ID. Projects associated with the CCDI program are excluded from the publications workflow.
All publication processing is handled within the gather_publication_data.py
module and can be run as an independent process with the command:
python modules/gather_publication_data.py
-
Get associated PMIDs from the NIH RePORTER API
- This process takes approximately 45 minutes for ~2500 project IDs
- For each project in
project.csv
, this queries the NIH RePORTER API to gather a list of all associated PubMed IDs (PMIDs)- For information on how NIH RePORTER links projects to publications, see their FAQ
- Note that the project-to-publication link is many-to-many. A single project can be associated with multiple publications, and a single publication can be associated with multiple projects
- These are stored in the intermediate "checkpoint" file
projectPMIDs.csv
. For subsequent data gathering runs on the same start date (version), this file can be used instead of gathering PMIDs again.
-
Gather select PubMed information for each PMID
- NOTE: This process can take several hours to gather information and is highly dependent upon the number of PMIDs gathered. The rate is approximately 15,000-20,000 publications per hour and it is recommended to run this outside of peak hours (9am - 5pm EST)
- Uses the BioPython Entrez package to access the PubMed API via Entrez e-Utilities and query PMIDs for PubMed information.
- The following fields are pulled from PubMed:
- Title
- Authors
- Publication date
- NOTE: Publication dates are inconsistent within the PubMed data. When month and/or date cannot be identified, they will default to January and/or 1st. (e.g.
2010 Dec
would be interpreted as2010-12-01
and2010
as2010-01-01
). Publication year is never estimated.
- NOTE: Publication dates are inconsistent within the PubMed data. When month and/or date cannot be identified, they will default to January and/or 1st. (e.g.
- Because of the long processing time, checkpoint files are saved periodically in a temporary
temp_pubmed_chunkfiles
in thedata/processed/
directory.- The default length of each checkpoint file is 2000 rows, but this can be changed in config.py with
PUB_DATA_CHUNK_SIZE
- Whenever this workflow is run for the same start date (version), any existing checkpoint files are all loaded together and the unique PMIDs within are accounted for. Each run will check for any missing PMIDs and restart the data gathering wherever it left off. This allows the publications workflow to be stopped and restarted without problems.
- The default length of each checkpoint file is 2000 rows, but this can be changed in config.py with
-
Gather iCite information for each PMID
- Uses the most recent NIH iCite bulk download (zipped CSV) to access iCite data for each PMID
- This manual download process must be completed before starting the automated workflow
- The downloaded
icite_metadata.zip
should be stored in a versioned (e.g.2023-11/
) directory in thedata/raw/icite/
directory- Note: iCite files are ~10GB and are not stored under git control
- The download takes 2-4 hours depending upon download speed.
- Note: The iCite API was explored, but operates much slower (3-4x) than the PubMed API.
- The downloaded
- The following fields are pulled from iCite:
- Title
- Authors
- Publication Year
- Citation Count
- Relative Citation Ratio (RCR)
- Saves checkpoint file
icitePMIDData.csv
containing all iCite data fields for PMIDs of interest
-
Combine and clean PubMed and iCite data for each PMID
- Adds the unique metrics from iCite (Citation Count & RCR) to the PubMed data for each PMID
- Checks for any missing values in the PubMed information and fill in with iCite information where available
- When values are conflicting, PubMed is used as the default. iCite values are only used when PubMed value is completely missing
- Saves checkpoint file
mergedPMIDData.csv
containing all merged PubMed and iCite data for PMIDs of interest - Cleans the publications data by removing rows with the following issues:
- Publication year before 2000
- No publication information for any fields from PubMed or iCite
- Stores a report of removed publications in a versioned
removedPublicationsReport.csv
within thereports/
directory - Stores the intermediate publication output
publication.csv
in the versioneddata/01_intermediate/
directory
This step standardizes, validates, and exports all data gathered from other steps of the process.
All final data packaging steps are handled within the package_output_data.py
module and can be run as an independent process with the command:
python modules/package_output_data.py
-
Finalize columns in output files
- Adds a
type
column and fills with appropriate values required for data loading(e.g.program
orproject
) - Specifies columns and ordering for all files. These can be configured within
config.py
- Adds a
-
Standardize characters
- Normalizes any non-standard characters within data gathered from sources
- Characters were normalized using NFKC and then converted to ASCII before saving as UTF-8
-
Perform special handling steps
- Removes any publications with a publication date more than 365 days before the associated project start date
- This helps to address the overcitation issue occasionally encountered when authors cite inappropriate grant support. See this 2023 Office of Extramural Research article for more details on the issue.
- Any publications removed in this process are saved in a
removedEarlyPublications.csv
in thereports/../packagingReports
directory. - The 365-day buffer can be modified within
config.py
- Removes any publications with a publication date more than 365 days before the associated project start date
-
Validate and save final outputs
- Validates that files do not contain duplicates or other inconsistencies
- NOTE: Nodes with many-to-many relationships (e.g.
publications
) will have duplicate records where all values are identical except for the linking column (e.g.project.project_id
). This is intentional.
- NOTE: Nodes with many-to-many relationships (e.g.
- Validates that all list-like columns are separated by semicolons
- Generates a list of enumerated values for each column specified in
config.py
. These can be used to update the INS Data Model before loading. - Saves all final outputs as TSV files in the
data/02_outputs
directory. These are ready for INS data loading.
- Validates that files do not contain duplicates or other inconsistencies
This step builds a Data Validation Excel file with expected data values. It can be used to validate that data is loaded and displayed correctly in the INS UI.
All Data Validation Generation steps are handled within the build_validation_file.py
module and can be run as an independent process with the command:
python modules/build_validation_file.py
-
Load data from finalized TSV output files
- This reads final output TSVs to generate the validation file, but does not change or save the output TSVs.
-
Build "Report Info"
- Each step will build a dataframe ready for export as a separate tab within the Excel output
- Each tab within the Excel will represent a different type of information or validation
- The first "report_info" tab lists basic versioning and timestamp information
- This tab can be used to verify that the data version matches the data validation version
-
Build "Single Program Results"
- Creates a table where each unique program is listed with total counts of associated projects, grants, and publications
- Count of projects, grants, and publications should match counts on the INS Program Details Page or on the Explore page when a single program is selected within the filter
-
Build "Single Project Results"
- Creates a table where each unique project is listed with total counts of associated grants and publications
- Count of grants and publications should match counts on the INS Project Details Page
-
Format and save all tables to Excel as separate tabs
- Excel file is saved in the versioned
reports/
directory with a timestamp noting when it was generated
- Excel file is saved in the versioned
This independenet step gathers NCI-supported datasets from the NCBI dbGaP. These are not necessarily associated with any of the Programs, Grants, Projects, or Publications gathered in the main workflow.
Datasets within INS represent open-access or controlled data released as an output of an NCI-supported effort. Currently, INS datasets are sourced only from dbGaP study metadata, but future releases will expand to additional resources.
Examples
- phs002790 | Childhood Cancer Data Initiative (CCDI): Molecular Characterization Initiative
- phs002153 | Genomic Characterization CS-MATCH-0007 Arm S1
All Dataset gathering steps are handled within the gather_dbgap_data.py
module.
This module is not included in the main.py
workflow and must be run indepenedently:
python modules/gather_dbgap_data.py
NOTE: This module is intended to automate the effort to gather a large number of datasets, but the results must undergo manual review and curation. Because of this, the datasets module is not intended to be run as regularly as the main workflow.
-
Load input CSV of NCI-supported dbGaP studies
- CSV is retrieved from the dbGaP Advanced Search by filtering for IC: NCI and downloading resulting list
- This list has phs study accessions of interest along with some descriptive metadata
-
Enrich dataset data with dbGaP API resources
- Use the dbGaP Study Metadata API to gather additional study metadata for each phs
- Use the dbGaP SSTR API to gather additional study metadata for each phs
-
Enrich dataset data with NCI administrative information
- Use a file received from the NCI Office of Data Sharing (ODS) that monitors dbGaP data submissions as input
- Associate Grant Program Administrators (GPAs) and NCI Divisions, Offices, and Centers (DOCs) with each phs
- Lookup tables associating GPAs and DOCs were built using public information from the following resources:
-
Manually review and curate final datasets file
- Combine and clean all dataset inputs and export as a CSV for curation
- Missing, erroneous, or unstructured source data is refined by expert curation
- The curated file is validated and ready for INS data loading
-
Go to the data/02_output directory.
-
Open the folder named with the most recent date. This notes the date that the Programs file was recevied from ODS.
-
Open the folder named with the most recent gathered date. This notes the date that the data gathering pipeline was run.
For example, the files within
data/02_output/
2024-01-04/
gathered-2024-01-10/
used a Program file received on January 4th, 2024 as the input into the gathering process run on January 10th, 2024. -
Download the TSV files within this directory. These represent the latest, finalized data intended for loading into INS.
-
Clone the repo to your local machine
- Option 1: Use the built-in GitHub cloning method or a tool like GitHub desktop
- Option 2: Open the command terminal and navigate to the desired destination. Run the following:
git clone https://github.com/CBIIT/INS-Data.git
-
Setup environment
- Install either Anaconda or Miniconda
- For compatibility with setups that do not use conda, a
requirements.txt
is also available
- For compatibility with setups that do not use conda, a
- In a command terminal, run the following command in the INS-Data directory:
conda env create -f environment.yaml
- This will read
environment.yaml
, setup a matching environment, and install any requirements - Activate the environment either through the Anaconda Navigator or with the terminal command:
conda activate ins-data-env
- If you make changes to the repo that require new or updated packages, update the
environment.yaml
andrequirements.txt
with:
conda env export | findstr -v "^prefix:" > environment.yaml pip list --format=freeze > requirements.txt
- NOTE: Replace the 'findstr' with 'grep' if using MacOS or Linux. This step removes the local path prefix from the environment.yaml for privacy.
- Install either Anaconda or Miniconda
-
Add or update the input CSV from ODS
- If necessary, update the Qualtrics CSV received from ODS
- Rename and place it in the
data/00_input/qualtrics/
folder- Name should be in the format
qualtrics_output_{version}_{type}.csv
(e.g.qualtrics_output_2023-07-19_raw.csv
)
- Name should be in the format
- If the Qualtrics CSV is updated, also update the values for
QUALTRICS_VERSION
andQUALTRICS_TYPE
inconfig.py
to match the Qualtrics CSV as needed.
-
Download the most recent iCite Database Snapshot
- The file is a ~10GB zipped CSV and can take 2-4 hours to manually download
- Place the file in a versioned raw data directory (i.e.
data/raw/icite/{version}/icite_metadata.zip
) - Update the value of
ICITE_VERSION
inconfig.py
to match the version directory name (e.g.2023-11
)
-
Get an NCBI API Key
- Follow NCBI instructions to register for an NCBI account and get an API key
- Create a new file in the INS-Data root directory named
.env
and fill with the following:
NCBI_EMAIL = <your.name@nih.gov> NCBI_API_KEY = <API Key from NCBI>
- Replace the values above (without <>) with your email and key
- NOTE: Because
.env
is listed in.gitignore
, this file will not be added to GitHub. - Never commit API keys to GitHub. Keep them on your local.
- Failure to add a valid API key here will triple the time required for the PubMed API data gathering process.
-
Run the pipeline
- In the command terminal, run the main workflow from the INS-Data root directory with:
python main.py
- This will run all steps of the workflow in order and save all output files in locations defined in
config.py
- OPTIONAL - Instead of
main.py
, modules can be run as independent processes with the following commands:python modules/gather_program_data.py python modules/gather_grant_data.py python modules/summary_statistics.py python modules/gather_project_data.py python modules/gather_publication_data.py python modules/package_output_data.py python modules/build_validation_file.py
- NOTE: If running modules independently, ensure that necessary output files from preceding modules already exist for the same start date (version)
-
Gather/Curate Datasets (optional)
- On the NCBI dbGaP Advanced Search, filter for "NIH Institute: NCI" and download the CSV with the "Save Results" button. Store this in
data/00_input/dbgap/study_{DBGAP_CSV_VERSION}.csv
whereDBGAP_CSV_VERSION
is the date of download - Update the
DBGAP_CSV_VERSION
inconfig.py
- Obtain an updated GPA study list from a point of contact at the NCI Office of Data Sharing and store as
data/00_input/dbgap/gpa_tables/gpa_study_table.csv
- If needed, manually update the
gpa_doc_lookup_table.csv
in the same folder - Run the datasets gathering module with the following command:
python modules/gather_dbgap_data.py
- Manually review and curate results before storing as
dataset.tsv
for INS data loading- It is recommended to continue additive manual curation of this file for each release rather than re-running the datasets module and repeating full curation unless necessary
- On the NCBI dbGaP Advanced Search, filter for "NIH Institute: NCI" and download the CSV with the "Save Results" button. Store this in
The entire workflow is captured within main.py
for simplicity and reproducibility. Functions used within main.py
are defined in scripts within the modules/
directory to allow for additional processing steps if needed in the future.
In general, TSV output files are intended for ingestion into INS, while CSV files are used as inputs, intermediates, or for troubleshooting/reporting.
INS-Data
├── data/
│ ├── 00_input/
│ │ ├── dbgap/
│ │ │ ├── gpa_tables/
│ │ │ | ├── gpa_doc_lookup_table.csv
│ │ │ | └── gpa_study_table.csv
| | | └── study_{dbgap_version}.csv
│ │ ├── icite/
│ │ │ └── {version}/ # Not git-controlled
│ │ │ └── icite_metadata.zip
│ │ └── qualtrics/
│ │ └── qualtrics_output_{version}_{type}.csv
│ ├── 01_intermediate/
│ │ ├── {qualtrics version}/
│ │ | ├── {gathered version}/
│ │ | │ ├── temp_pubmed_chunkfiles/ # Not git-controlled
| | | | | └── Partial PubMed files for iterative loading
│ │ | │ ├── grant.csv
│ │ | │ ├── icitePMIDData.csv
│ │ | │ ├── mergedPMIDData.csv
│ │ | │ ├── project.csv
│ │ | │ ├── projectPMIDs.csv
│ │ | │ └── publication.csv
│ │ | ├── invalidNofoReport_reviewed.csv
│ │ | └── program.csv
| | └── dbgap/
| | └── {dbgap_version}/
| | ├── dbgap_datasets.csv
| | ├── dbgap_sstr.json
| | └── dbgap_study_metadata.json
│ └── 02_output/
│ └── {qualtrics version}/
│ └── {gathered version}/
│ ├── grant.tsv
│ ├── program.tsv
│ ├── project.tsv
│ └── publication.tsv
├── images/
│ ├── ins-data-repo-diagram.drawio
│ └── ins-data-repo-diagram.png
├── modules/
│ ├── build_validation_file.py
│ ├── gather_grant_data.py
│ ├── gather_program_data.py
│ ├── gather_project_data.py
│ ├── gather_publication_data.py
│ ├── package_output_data.py
│ └── summary_statistics.py
├── reports/
│ └── {qualtrics version}/
│ ├── {gathered version}/
│ │ ├── packagingReports
│ │ | ├── duplicate_{type}.csv
│ │ | ├── {type}_enums.txt
│ │ | └── removedEarlyPublications.csv
│ │ ├── INS_DataValidation_Generated_{date}.xlsx
│ │ ├── mismatchedProjectValuesReport.csv
│ │ ├── grantsStatsByProgram.csv
│ │ ├── removedPublicationsReport.csv
│ │ └── sharedProjectsByProgramPair.csv
│ ├── invalidAwardReport_raw.csv
│ ├── invalidNofoReport_corrected.csv
│ ├── invalidNofoReport_manual_fix.csv
│ └── invalidNofoReport_raw.csv
├── .env # Not git-controlled
├── .gitignore
├── config.py
├── environment.yaml
├── main.py
└── README.md
During the PMID gathering step, you may see an error of the form
Received a 500 error for <project_id>. Retrying after 2 seconds. Attempt <n>/5
This error is caused by issues with the NIH RePORTER API. When we previously had this issue, we sent a message to NIH RePORTER support, and they fixed the problem.
During the PubMed data gathering step, you may see an error of the form
Error fetching information for PMID <pmid>: list index out of range
This error can be safely ignored. The out of range list indices come from NIH RePORTER returning old PMID's. These PMID's are generally from before the year 2000 and would be ignored later in the script even if they were gathered.
If the NIH RePORTER API is unresponsive, it could be because the API is receiving too many requests at a time. It could be worth trying to run this data gathering script during a time when the API is likely to receive fewer requests, such as overnight.