The RADx Data Hub Content reporter is a Python package used to aggregate counting statistics on the contents of the RADx Data Hub. At the basic level, the reporter tool dumps a report in JSON format that counts the number of studies belonging to controlled terms in several categories, including the RADx Program (DCC) and Study Domain. The reporter effectively performs map-reduce using ontology terms as keys.
- Python>=3.10
- openpyxl>=3.0.10
- numpy>=1.26,<2.0
- pandas>=2.1.4
- python-dateutile>=2.8.2
- xlsxwriter>=3.2.0
Running the reporter requires installing it as a python package. Using conda, run the following in the same directory as this README:
git clone https://github.com/bmir-radx/radx-reporter.git
cd radx-reporter
conda create -n radxreporter python=3.10 -y
conda activate radxreporter
pip install -r requirements.txt
pip install .
Once installed, the program can be executed using the radx-study-metadata-reporter
command. It expects arguments for the input file and sheet name of the XLSX spreadsheet to parse:
radx-study-metadata-reporter -i 2024-07-29_RADx-DataHub-Metadata-Spreadsheet.xlsx -s "RADx Study Metadata Summary"
Alternatively, the content reporter can be used programmatically by importing the module.
from radx_reporter import reporter
import pandas as pd
# load a dataframe from file or provide one from a database query
dataframe = pd.read_excel("2024-07-29_RADx-DataHub-Metadata-Spreadsheet.xlsx", sheet_name="RADx Study Metadata Summary")
# provide the dataframe and an optimal report name
reporter.Reporter.basic_report(dataframe, report_name="report")
Additional fields are also supported for reporting.
from radx_reporter import reporter
import pandas as pd
# load a dataframe from file or provide one from a database query
dataframe = pd.read_excel("2024-07-29_RADx-DataHub-Metadata-Spreadsheet.xlsx", sheet_name="RADx Study Metadata Summary")
# supply any extra columns to report on. the required columns (above)
# are assumed to be present
extra_columns = ["NIH GRANT NUMBER", "FOA NUMBER"]
# provide the dataframe and any extra columns
reporter.Reporter.basic_report(dataframe, extra_columns, report_name="report")
The reporter aggregates statistics for categories Program
, NIH Institute
, Collection Method
, Study Design
, Population Range
, Data Type
, and Study Domain
using controlled terms for each. These statistics are extracted from study metadata available in the RADx Data Hub. The CLI for the reporter takes an Excel spreadsheet as input. It performs the following steps in sequence:
- Conversion of the target Excel spreadsheet to a Pandas Dataframe.
- Mapping of studies to ontology terms that characterize them.
- Aggregation of study counts over the ontology terms.
- Generation of bar charts for visualization.
The reporter can also be used programmatically, in which case, step 1 can be skipped by providing a dataframe of the necessary data directly.
The Dataframe requires the following columns:
- STUDY STATUS
- STUDY PROGRAM
- NIH INSTITUTE OR CENTER
- DATA COLLECTION METHOD
- STUDY DESIGN
- ESTIMATED COHORT SIZE
- DATA TYPES
- STUDY DOMAIN
- STUDY PHS
- STUDY POPULATION FOCUS
- STUDY STATUS
Additional columns are supported, but these contain coded terms that point directly to facets in the RADx Data Hub Study Explorer.
There are two possible output formats for the basic report.
- Excel spreadsheet, in which each tab summarizes aggregation results over the controlled terms for each category, e.g., a tab for
Program
, in which each row has the study count and PHS IDs attributed to each DCC. - JSON serialized (perhaps are more convenient format for use by another Data Hub component).
In the language of the Data Hub's search engine, each of the categories is a "name" and each controlled term for the category is a "facet." The search URL for studies that belong to a controlled term are automatically generated. It would be cool for search links to be published alongside the statistics for each controlled term.
Example: for the Program
category and DCC RADx-UP
, the search URL is https://radxdatahub.nih.gov/studyExplorer?&facets=%5B%7B%22name%22:%22dcc%22,%22facets%22:%5B%22RADx-rad%22%5D%7D%5D