RADx Data Hub Content Reporter

The RADx Data Hub Content reporter is a Python package used to aggregate counting statistics on the contents of the RADx Data Hub. At the basic level, the reporter tool dumps a report in JSON format that counts the number of studies belonging to controlled terms in several categories, including the RADx Program (DCC) and Study Domain. The reporter effectively performs map-reduce using ontology terms as keys.

Requirements

Python>=3.10
openpyxl>=3.0.10
numpy>=1.26,<2.0
pandas>=2.1.4
python-dateutile>=2.8.2
xlsxwriter>=3.2.0

Running the reporter

Running the reporter requires installing it as a python package. Using conda, run the following in the same directory as this README:

git clone https://github.com/bmir-radx/radx-reporter.git
cd radx-reporter

conda create -n radxreporter python=3.10 -y
conda activate radxreporter
pip install -r requirements.txt
pip install .

Command Line

Once installed, the program can be executed using the radx-study-metadata-reporter command. It expects arguments for the input file and sheet name of the XLSX spreadsheet to parse:

radx-study-metadata-reporter -i 2024-07-29_RADx-DataHub-Metadata-Spreadsheet.xlsx -s "RADx Study Metadata Summary"

Library

Alternatively, the content reporter can be used programmatically by importing the module.

from radx_reporter import reporter
import pandas as pd

# load a dataframe from file or provide one from a database query
dataframe = pd.read_excel("2024-07-29_RADx-DataHub-Metadata-Spreadsheet.xlsx", sheet_name="RADx Study Metadata Summary")

# provide the dataframe and an optimal report name
reporter.Reporter.basic_report(dataframe, report_name="report")

Additional fields are also supported for reporting.

from radx_reporter import reporter
import pandas as pd

# load a dataframe from file or provide one from a database query
dataframe = pd.read_excel("2024-07-29_RADx-DataHub-Metadata-Spreadsheet.xlsx", sheet_name="RADx Study Metadata Summary")

# supply any extra columns to report on. the required columns (above)
# are assumed to be present
extra_columns = ["NIH GRANT NUMBER", "FOA NUMBER"]

# provide the dataframe and any extra columns
reporter.Reporter.basic_report(dataframe, extra_columns, report_name="report")

Required Input

The reporter aggregates statistics for categories Program, NIH Institute, Collection Method, Study Design, Population Range, Data Type, and Study Domain using controlled terms for each. These statistics are extracted from study metadata available in the RADx Data Hub. The CLI for the reporter takes an Excel spreadsheet as input. It performs the following steps in sequence:

Conversion of the target Excel spreadsheet to a Pandas Dataframe.
Mapping of studies to ontology terms that characterize them.
Aggregation of study counts over the ontology terms.
Generation of bar charts for visualization.

The reporter can also be used programmatically, in which case, step 1 can be skipped by providing a dataframe of the necessary data directly.

The Dataframe requires the following columns:

STUDY STATUS
STUDY PROGRAM
NIH INSTITUTE OR CENTER
DATA COLLECTION METHOD
STUDY DESIGN
ESTIMATED COHORT SIZE
DATA TYPES
STUDY DOMAIN
STUDY PHS
STUDY POPULATION FOCUS
STUDY STATUS

Additional columns are supported, but these contain coded terms that point directly to facets in the RADx Data Hub Study Explorer.

Output

There are two possible output formats for the basic report.

Excel spreadsheet, in which each tab summarizes aggregation results over the controlled terms for each category, e.g., a tab for Program, in which each row has the study count and PHS IDs attributed to each DCC.
JSON serialized (perhaps are more convenient format for use by another Data Hub component).

In the language of the Data Hub's search engine, each of the categories is a "name" and each controlled term for the category is a "facet." The search URL for studies that belong to a controlled term are automatically generated. It would be cool for search links to be published alongside the statistics for each controlled term.

Example: for the Program category and DCC RADx-UP, the search URL is https://radxdatahub.nih.gov/studyExplorer?&facets=%5B%7B%22name%22:%22dcc%22,%22facets%22:%5B%22RADx-rad%22%5D%7D%5D

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
reports		reports
src/radx_reporter		src/radx_reporter
test		test
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RADx Data Hub Content Reporter

Requirements

Running the reporter

Command Line

Library

Required Input

Output

About

Releases

Packages

Languages

License

bmir-radx/radx-reporter

Folders and files

Latest commit

History

Repository files navigation

RADx Data Hub Content Reporter

Requirements

Running the reporter

Command Line

Library

Required Input

Output

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages