Skip to content

How to Run Synthea Automation

Michael J Burling edited this page Aug 27, 2024 · 6 revisions

How to Run Synthea Automation

  1. Go to the Synthea Automation Jenkins Job
  2. Click "Build with Parameters" on the left side
  3. The parameters here control how the Synthea automation will generate and load data:
    • NUM_BENES: The desired number of beneficiaries in this dataset
      • Jenkins has an upper limit that it can handle creating in one batch, attested to support up to 375,000 in a single batch and will require up to 19 hours to complete
      • If you need more than the specified number, split the generation into multiple batches
    • NUM_FUTURE_MONTHS: If greater than 0, Synthea will generate some beneficiaries which have claim creation dates up to that many months into the future, and those future claims will be automatically portioned into weekly loads and placed in the output as separate load folders. When placed in an environment's ETL Synthea/Incoming folder, these future claim folders will be loaded and updated when their load date comes to term every week for the number of months specified. If this is 0, no claims will have dates beyond the current date.
    • USE_TARGET_CONTRACT and TARGET_CONTRACT: If the dataset's use-case requires a known contract for the Part D Events, users can provide one using the TARGET_CONTRACT input. This value will be attached to all Part D Events when the USE_TARGET_CONTRACT checkbox is checked.
    • UPDATE_END_STATE_S3: In order to avoid collisions in fields which have unique database constraints, we keep a file called "end state" in S3. This file keeps track of the current last generated value of all synthetic constrained fields so we can continue incrementing them in the next load without conflicting with previous values. If this checkbox is checked, it will update the file with the load's new latest state.
      • Basically this should always be checked if you intend to load the generated data into any environment. The only time this should not be checked is if you're reloading data that exists (using idempotent mode in the pipeline) or creating a "test batch" to verify something but do not intend to load the data into BFD.
  4. Once the parameters are as you'd like then, click Build to begin the process
  5. Once the data is generated, the output will be automatically uploaded to S3 in the Output Directory
  6. To load a generated dataset into an environment, the following instructions apply but depend on whether the selected dataset was generated with a zero vs non-zero NUM_FUTURE_MONTHS figure
  • For simple datasets (those generated with NUM_FUTURE_MONTHS set to 0), use the load-dataset.sh script to load the data in a single step. Loading the data with invocations of ./load-dataset.sh <target-bucket> <dataset> will look like ./load-dataset.sh bfd-3460-prod-etl20240603193725757700000001 generated-2024-08-06_12-07-34/
  • For datasets containing rolling claims or recurring beneficiaries (generated with a NUM_FUTURE_MONTHS figure greater than 0), loading datasets is only a little more involved:
    1. Verify (with the --dryrun flag) that the output of the following aws s3 sync between the specific folder for your dataset and the target ETL bucket is as you'd expect it:
    # Replace the <dataset> with the generated-yyyy-MM-dd_HH-mm-ss formatted folder name
    # Replace <target-bucket> with the desired ETL bucket name
    aws s3 sync "s3://bfd-mgmt-synthea/generated/<dataset>/output/" \
      s3://<target-bucket>/Synthetic/Incoming/ \
      --exclude "end_state*" \
      --exclude "missing_codes.csv" \
      --exclude "npi.tsv" \
      --exclude "export_summary.csv" \
      --exclude "bfd*" \
      --exclude "metadata*" \
      --dryrun
    1. Once verified, run the same command as above without the --dryrun flag:
    aws s3 sync "s3://bfd-mgmt-synthea/generated/<dataset>/output/" \
      s3://<target-bucket>/Synthetic/Incoming/ \
      --exclude "end_state*" \
      --exclude "missing_codes.csv" \
      --exclude "npi.tsv" \
      --exclude "export_summary.csv" \
      --exclude "bfd*" \
      --exclude "metadata*"
    1. When this is complete, identify the earliest timestamped folder for this dataset in the target ETL bucket. This will contain both claims data and beneficiary demographics data that subsequent separated data loads from this dataset will depend on.
    2. Rename (move) the manifest.xml object contained within the identified earliest timestamped folder to 0_manifest.xml. This sets the conditions for the ETL process to proceed in the given environment, e.g.
    # Replace <target-bucket> with the ETL bucket name you've partially loaded the data into from the previous steps
    # Replace <earliest-timestamped-dataset-folder> with the earliest, yyyy-MM-ddTHH:mm:ssZ formatted folder in this dataset
    aws s3 mv s3://<target-bucket>/Synthetic/Incoming/<earliest-timestamped-dataset-folder>/manifest.xml \
      s3://<target-bucket>/Synthetic/Incoming/<earliest-timestamped-dataset-folder>/0_manifest.xml

Prod Load Additional Steps

For prod/prod-sbx loads, we need to publish the results to our consumers, so some additional steps are required.

  1. A characteristics file needs to be generated and made available in order to let our partners know what beneficiary ids and claims will be available to use. A script exists for generating the characteristics file at '''beneficiary-fhir-data/ops/ccs-ops-misc/synthetic-data/scripts/synthea-automation/generate-characteristics-file.py'''
  2. Ensure you have Python3 installed, and also psycopg2 and boto3 (Python libraries) installed, as the script will need them
  3. Ensure you're connected to the VPN, as you'll need access to the database to run the script
  4. Run the script locally. It takes three parameters:
    • bene_id_start: this is the bene id the generation started at, which will be printed in the Jenkins log when you run the job
    • bene_id_end: this is the bene id the generation ended at, which will be printed in the Jenkins log when you run the job
    • output_path: the local directory the characteristics file should be written to
  5. Once the script runs, a file should be output called characteristics.csv at the location you specified in parameter 3
  6. Create a folder in the bfd project under apps/bfd-model/bfd-model-rif-samples/dev/synthea_releases with the date of generation, and add this file to that location, then open a PR to merge this change so that the file is publicly available to our partners
  7. Next we need to update the github wiki page: Synthetic Data Guide
  8. On this page there are two spots we need to update:
    • The Available Synthetic Beneficiaries table, which you should add a row for the date, bene ranges, and link to the characteristics file in AWS above. Additionally, if there are future updates with this batch, an additional column should specify for how many months the batch will update
    • The Release History table, which should describe the purpose of the synthetic batch along with any other relevant information
  9. Lastly, our partners should be made aware of the new data load; post a message in the bfd-users chat informing them of the newly available data and a link to the wiki page with additional information (update the parts in brackets as needed/if there is future data) Note the default update time is Wednesday at 7am, so just remove the brackets if the update was done with future data.

BFD Synthetic data in prod-sbx and prod has been updated with <10,000> new beneficiaries<, which will update every Wednesday at 7am EST>. Information about the beneficiary ranges added can be found in the Synthetic Data Guide: https://github.com/CMSgov/beneficiary-fhir-data/wiki/Synthetic-Data-Guide

Troubleshooting (High Level)

  • Many steps of the automation process can fail during data generation or loading, which will fail the Jenkins job. If this occurs, you will need to investigate the failure to determine the next steps. If errors occur during generation, you may need to check the database or end-state.properties file parameters. If errors occur during the pipeline load, you may need to move the new load files generated in Synthetic/Incoming out to Synthetic/Failed (just a holding ground) and restart the pipeline, as well as investigate the failure.
  • In the unlikely event of issues in the prod environments after successful test, keep in mind you may need investigation and manual re-run of the data to keep consistency between environments, or cleanup/rollback of the Test database.
Clone this wiki locally