Skip to content

Conversation

@jingjtang
Copy link
Contributor

@jingjtang jingjtang commented Jan 14, 2022

Description

  • Add age group specific signals for Quidel Covid
    • Use the age group breakdown: 0-4, 5-17, 18-49, 50-64, 65+, 0-17(supergroup)
    • Stick with the current 50 availability threshold for 1) consistency with what we have published 2) statistical consideration 3) not big difference if lower the threshold
  • Add megacountyies to Quidel Covid
    • Megacounties are added before the availability threshold checking.
    • For raw signals, counties with counties <= 50 will be merged together as megacounties
    • For smoothed signals, counties with counties <= 25 will be merged together as megacounties
    • After getting the megacounties, if the count for a megacounty does not pass the availability threshold, we will not report anything for this megacounty.
  • Live docs here for availability checking and correlation analysis

Changelog

Itemize code/test/documentation changes and files added/removed.

  • constants.py, pull.py, run.py, geo_map.py, generate_sensor.py
  • tests/test_constants.py, tests/test_pull.py, tests/test_run.py, tests/test_geo_map.py, tests/test_generate_sensor.py, tests/test_data

Fixes

@jingjtang jingjtang requested a review from krivard January 14, 2022 14:42
@jingjtang
Copy link
Contributor Author

We need new signals names for age-group specific siganls.
The current ones:

  • covid_ag_raw_pct_positive_age_0to4
  • covid_ag_raw_pct_positive_age_5to17
  • covid_ag_raw_pct_positive_age_18to49
  • covid_ag_raw_pct_positive_age_50to64
  • covid_ag_raw_pct_positive_age_65toOlder
  • covid_ag_raw_pct_positive
  • covid_ag_smoothed_pct_positive_age_0to4
  • covid_ag_smoothed_pct_positive_age_5to17
  • covid_ag_smoothed_pct_positive_age_18to49
  • covid_ag_smoothed_pct_positive_age_50to64
  • covid_ag_smoothed_pct_positive_age_65toOlder
  • covid_ag_smoothed_pct_positive

Need @RoniRos @ryantibs @krivard 's approval for the signal names.

@krivard
Copy link
Contributor

krivard commented Jan 14, 2022

We need new signals names for age-group specific siganls.

Let me find what we're using in covid_hosp; we should be consistent where possible

@krivard
Copy link
Contributor

krivard commented Jan 14, 2022

Recommended set:

  • covid_ag_raw_pct_positive_age_0to4 -> covid_ag_raw_pct_positive_age_0_4
  • covid_ag_raw_pct_positive_age_5to17 -> covid_ag_raw_pct_positive_age_5_17
  • covid_ag_raw_pct_positive_age_18to49 -> covid_ag_raw_pct_positive_age_18_49
  • covid_ag_raw_pct_positive_age_50to64 -> covid_ag_raw_pct_positive_age_50_64
  • covid_ag_raw_pct_positive_age_65toOlder -> covid_ag_raw_pct_positive_age_65plus
  • covid_ag_raw_pct_positive
  • covid_ag_smoothed_pct_positive_age_0to4 -> covid_ag_smoothed_pct_positive_age_0_4
  • covid_ag_smoothed_pct_positive_age_5to17 -> covid_ag_smoothed_pct_positive_age_5_17
  • covid_ag_smoothed_pct_positive_age_18to49 -> covid_ag_smoothed_pct_positive_age_18_49
  • covid_ag_smoothed_pct_positive_age_50to64 -> covid_ag_smoothed_pct_positive_age_50_64
  • covid_ag_smoothed_pct_positive_age_65toOlder -> covid_ag_smoothed_pct_positive_age_65plus
  • covid_ag_smoothed_pct_positive

@ryantibs
Copy link
Member

Katie's last naming scheme: 👍 from me.

@jingjtang
Copy link
Contributor Author

As suggested by @RoniRos, we can add another two super age groups: 0-17; 18-64, so that we have information for those small age groups like 0-4 added but not bothered too much by the available threshold

@RoniRos
Copy link
Member

RoniRos commented Jan 14, 2022 via email

@RoniRos
Copy link
Member

RoniRos commented Jan 14, 2022 via email

@jingjtang
Copy link
Contributor Author

jingjtang commented Jan 14, 2022

@ryantibs @RoniRos
location_nums_over_time

The new availability is shown here. I think those two super age groups are good to be added.

@RoniRos
Copy link
Member

RoniRos commented Jan 14, 2022 via email

@ryantibs
Copy link
Member

ryantibs commented Jan 14, 2022

It's quite hard for me to tell based on the current plots whether the tradeoff is actually worth it. I suspect it may not be. The easiest & most direct way to tell would be to add up the availability of the subgroups within each combined group

Eg, just compare the availability of 0_4 + 5_17 to 0_17 (so just comparing two lines), and likewise for the other case.

@RoniRos
Copy link
Member

RoniRos commented Jan 14, 2022 via email

@ryantibs
Copy link
Member

Yes sorry, my "+" was supposed to be an "OR". So I think we are saying equivalent things.

@RoniRos
Copy link
Member

RoniRos commented Jan 15, 2022

Sorry to be a pest, but I still don't understand. Why would you want to do an OR, rather than add up the two differences I listed as an example? If there is a county were neither 0-4 nor 5-17 is available, arguably it should count twice as much as a county where only one of them is unavailable. No?

Anyway, what are the use cases you envision for these signals?

@RoniRos
Copy link
Member

RoniRos commented Jan 15, 2022

One could argue that when, say 0-17 and 5-17 are available but 0-4 is not, one could derive an estimate of 0-4 from the other two, based on their proportion in the population (since we don't give them their proportion among the tests). So the case when neither 0-4 nor 5-17 are available is by far the worst.

@ryantibs
Copy link
Member

Not a pest, I just wasn't reading/thinking carefully! Yes, you're right, your message is alluding to something different.

But we wouldn't be able to find out from your counts whether in a particular county we have what you call the "worst case" (neither 0-4 nor 5-17 are available), since it's just marginal counts, right? And yes I buy your argument. So I guess the following four numbers would be useful:

How many counties that do not support a 0-4 signal, support a 0-17 signal.
How many counties that do not support a 5-17 signal, support a 0-17 signal.
How many counties that do not support a 0-4 signal OR do not support a 5-17 signal, support a 0-17 signal.
How many counties that do not support a 0-4 signal AND do not support a 5-17 signal, support a 0-17 signal.

@jingjtang Can you do this, and do the same for the other category 18-64, so that we can get a sense of the value ad?

@jingjtang
Copy link
Contributor Author

jingjtang commented Jan 15, 2022

@RoniRos We initially added the availability checking for all-age groups here, but that curve for county only is removed later ONLY for a more clear looking at the bottom area for the age-specific signals.

In case you are not able to jump to the slack thread, the fig is attached below
image

(The first legend should be 0-4, not 0-5)

@RoniRos
Copy link
Member

RoniRos commented Jan 15, 2022

Ok.
From the curves Jingjing published so far, it's clear there is a non-negligible number of counties "rescued" by the merging. But we still don't have a clear formula to help us decide whether or not to include these merged groups.

I just browsed some of CDC's data and found many places were they present age breakdowns as 0-17,18-49,50-64,65+. E.g. https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/burden.html

So my suggested is that we at least include 0-17.

@jingjtang
Copy link
Contributor Author

@ryantibs
0_17 but not 0_4: Available for 0-17 but not available for 0-4
0_17 but not 5_17: Available for 0-17 but not available for 5-17
0_17 but not OR: Available for 0-17 but not available for either 0-4 or 5-17
0_17 but not AND: Available for 0-17 but not available for both 0-14 and 5-17

Similar ones for 18-64

comparison_with_supergroups

@ryantibs
Copy link
Member

@RoniRos @jingjtang Thanks!

I'm convinced we should add 0-17. We can just make it clear in our documentation that we also wanted to separate this out into two subcategories duet to the fact that 0-4 is unvaccinated population. This would be a clean explanation as to why there is overlap/nesting in the lowest 3 categories (0-4, 5-17, 0-17).

I'm less convinced we should add 18-64. Rationale: if we rescue any counties here with this big bucket, then the signal we publish is probably going to be very similar to the all-ages signal. (Because I'm guessing this big bucket includes the most common ages from a marginal perspective, and if signals at 18-49 and 50-64 are each missing, then it must means we're just right below the threshold we set for availability, and 18-64 puts us right above).

So I suggest we just publish 0-17, along with the existing breakdowns, and finish this off ASAP. However, not strongly opposed to also including 18-64, if Roni or somebody else feels differently. Thanks!

@jingjtang
Copy link
Contributor Author

jingjtang commented Jan 25, 2022

@ryantibs

  1. The plot titles say "< 25 counts for every consecutive 7 days". I just want to check to understand what you're plotting. You're plotting the sum of counts in 7-day rolling windows, and then reporting how many times that's < 25, right?
    Yes, you are correct. But to be more clear, the smoothed signals use the sum of counts in 7-day rolling windows to be sample size in our API, which is different from the raw signals.

And for the time series plot, the left most one, I checked how many counties have < 25 counts for each date (and if we switch to option2, those counties will be unavailable at all).

  1. In the above plots with Counties 21079 and 8107, the straight blue line is just a plotting artifact right? I assume the real signal value is missing there, and the plotting behavior is just to connect it by a straight line?

Yes, it is just a plotting artifact. In fact, we only have values reported at the two endpoints. All the dates in between are unavailable for that county. And remember that, we also have counties with only <10 days available due to geographical pooling (they has < 10 days with counts in [25, 50] but the rest of the days with counts < 25), I didn't show those examples since the figures are just dominated by the line of option1 but you can image how they look like. And we do have those counties, the numbers are shown in the middle one of the time series plot

@ryantibs
Copy link
Member

Thank you. I think option 2 is the winner in my mind, but will have a decision for by 5pm (just to be sure, will check to if Roni agrees with me at team-leads meeting).

@ryantibs
Copy link
Member

Hi @jingjtang, confirming that we should go with option 2. Thank you for working hard to compare the options.

@jingjtang
Copy link
Contributor Author

Thanks @ryantibs. @dshemetov The pipeline is ready for review with the new strategy. Described in details below for your reference:

  • As for smoothed signals:

    • If counts smaller than 50, we do geo shrinkage. We will borrow no more than the counts that we currently have. That is, for example, if county c has 2 counts, we can only borrow 2 counts at most as pseudo counts from its home state.

    • If counts larger than 50, we don't borrow anything.

    • The counts for smoothed signals are the sum of counts in the 7-day moving windows.

    • After the temporal pooling and geographical pooling, we still have an availability checking. If the number of counts is larger than 50, we report it; otherwise, we report nothing.

  • As for raw signals:

    • For counts smaller than 50, we report nothing.
    • For counts larger than 50, we report what we have.
    • The counts for raw signals are just the counts for the corresponding date (different from the smoothed signals)

@jingjtang jingjtang requested a review from dshemetov January 26, 2022 00:47
Copy link
Contributor

@dshemetov dshemetov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found a small bug, see the comments. Otherwise looks good!

# For raw signals, the threshold is MIN_OBS
# For smoothed signals, the threshold is MIN_OBS/2
if smooth:
threshold_visits = MIN_OBS/2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this change in thresholding here?

@jingjtang jingjtang requested a review from dshemetov January 27, 2022 19:27
Copy link
Contributor

@dshemetov dshemetov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

[(14+parent_pos*22/parent_test+0.5)/(50+1)*100,
(26+parent_pos*1/parent_test+0.5)/(50+1)*100,
(24+0.5)/(60+1)*100,
(32+0.5)/(64+1)*100], equal_nan=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent, thank you!

@jingjtang jingjtang requested a review from krivard January 28, 2022 18:16
@krivard
Copy link
Contributor

krivard commented Jan 28, 2022

Three questions:

  • Will this change require us to reissue any past data?
  • Does this change require any updates to the API documentation?
  • Are there any special instructions for the first run of the pipeline once this change is released?

@jingjtang
Copy link
Contributor Author

@krivard

  • Will this change require us to reissue any past data?
    Yes, but only needs deletion. We have have some counties for smoothed signals deleted. They will not be reported anymore. But for the counties that still pass the availability threshold, the values are unchanged.
  • Does this change require any updates to the API documentation?
    We can make an update to it to explain the current strategy more clearly. But the current one matches to the current code.
  • Are there any special instructions for the first run of the pipeline once this change is released?
    Yes. If we run all the data back to the very first date for Quidel(2020-05-26), the cache should be deleted before the first run.

By the way, since it's a time to re-run the past data, I want to ask whether we are able to extend the export range for Quidel from 45 days to 175 days? (If storage allows) This means for the daily run, we generate reports for the dates from -180 days to -5 days instead of -50 days to -5 days. The backfill related pipeline will benefit from this.

@ryantibs
Copy link
Member

Sorry if this is contributing noise, but I just want to double check --- apart from the new stricter way of pooling, and the new signals --- 1. that we'll get megacounties, and 2. that the national signal (and all signals) will take values for references dates all the way back to the start of the time period. These are things I've asked for on other issues or messages.

@jingjtang
Copy link
Contributor Author

jingjtang commented Jan 29, 2022

Sorry if this is contributing noise, but I just want to double check --- apart from the new stricter way of pooling, and the new signals --- 1. that we'll get megacounties, and 2. that the national signal (and all signals) will take values for references dates all the way back to the start of the time period. These are things I've asked for on other issues or messages.

@ryantibs Yes, we will have

  1. megacounties: same smoothing method for smoothed signals. And for raw signal, if the counts are less than 50, they are merged into a megacounty; for smoothed signal, if the counts are less than 25, they are merged into a megacounty.
  2. age-specific signals: 0-4, 5-17, 0-17, 18-49, 50-64, 65+
  3. If the engineering side allows, we will at least have one time run to generate all the reports back to the start (2020-05-26), and maybe not all the issues(@krivard @korlaxxalrok)
  4. new availability threshold for smoothed signals:
    considering the sample size after temporal pooling
  • <25: report nothing (different from the previous strategy)
  • [25, 50): geographical pooling
  • [50, +infinity): report what they are

for this PR.

@krivard
Copy link
Contributor

krivard commented Jan 31, 2022

Will this change require us to reissue any past data? -> Yes, but only needs deletion.

Cool. To do this we will need the full list of (issue, reference date, geo type, geo value) to delete.

Does this change require any updates to the API documentation? -> No

Acknowledged

Are there any special instructions for the first run of the pipeline once this change is released? -> Yes: delete the cache

Acknowledged

extend the export range for Quidel from 45 days to 175 days?

That's just shy of 4x, on top of the 7x we're adding with this PR -- I don't recommend it. Quidel currently adds ~110k rows to the database per day, out of a total ~2.8M rows added daily across all indicators. Here's some rough figures imagining the impact on the database if we did that (upper bound; i imagine the age group signals have lower coverage than all-ages and so are somewhat less than 7x)

scenario quidel rows added per day total rows added per day quidel pct of total
current 110k 2.8M 4%
add age groups 770k 3.6M 22%
extend export range 440k 3.1M 14%
age groups + export range 3.1M 5.8M 54%

I do notice that quidel isn't currently using the archive-differ, so some of that 110k rows might be duplicates. Do you happen to know the actual rate of new/updated figures vs duplicates in each submission?

@krivard
Copy link
Contributor

krivard commented Jan 31, 2022

Releasing this change has some subtle details -- here's a draft of our options. Pick:

  • (A or B) plus (C or D)
  • E

A: If we think the initial run will be fast (<40 minutes)

  1. Merge this PR
  2. Release covidcast-indicators
  3. On prod, clear the pulled_until cache. Swap out the production params file with one that will export from 2020-02 to present. Run the indicator. Swap the params file back to the production version.

B: If we think the initial run will be slow

  1. Merge this PR
  2. On staging, clear the pulled_until cache. Set the params file to export from 2020-02 to present. Run the indicator. Drop the resulting files off in production receiving.
  3. Release covidcast-indicators
  4. On prod, clear the pulled_until cache.

C: If we think the deletions will be fast (<4 hours)

Delete any time after A2 or B3

D: If we think the deletions will be slow

Do the deletions need to be in place relatively simultaneously with release? Is it bad for the database to spend multiple days with deletions only partially applied?

Di: if we need to be rigorous

Schedule a 3-day patch period the same way we did with the JHU patch last fall. Start A1 or B1 on day one of the patch. Complete A/B before concluding the patch period.

Dii: if we can be more flexible

Delete in batches daily until done. Could take place before, during, or after A/B, depending on if we want deletions finished before the new code runs, or if we want the new code to run before we finalize the set of deletions needed.

E: If we want archive-differ to handle deletions

If we don't care about deleting from past issues, archive-differ will automatically mark as deleted anything that switches from showing up in an output file to not-showing-up in an output file. "Mark as deleted" means the data is still in the database, and will appear in as-of queries for as-ofs before the deletion date, but will not appear in most-recent queries or in as-of queries for as-ofs after the deletion date.

  1. Set up archive-differ bucket in S3
  2. Add archiver section to quidel params
  3. Initialize the archive bucket with the current snapshot: On staging, set the params file to export from 2020-02 to present. Run the indicator using indicator-runner. Drop the resulting files off in production receiving.
  4. Merge this PR
  5. Release covidcast-indicators
  6. On prod, clear the pulled_until cache. Swap out the production params file with one that will export from 2020-02 to present. Run the indicator. Swap the params file back to the production version.

@jingjtang
Copy link
Contributor Author

jingjtang commented Jan 31, 2022

@krivard

Do you happen to know the actual rate of new/updated figures vs duplicates in each submission?

I do not have that kind of data in hand. Do you want me to check that?

Choice between A and B: the runtime will be longer than 40 mins but ~60 mins or so, no more than 120 mins according to my experience.

Choice for deletion: I think Dii is better. We can have the new signals produced as soon as possible and have those corrections for the historical release.

We do need correction (deleting) for the past issues. So I won't choose E.

@krivard
Copy link
Contributor

krivard commented Feb 1, 2022

Do you happen to know the actual rate of new/updated figures vs duplicates in each submission?

I do not have that kind of data in hand. Do you want me to check that?

Yes please; just for a week or two worth of data should be fine.

@jingjtang
Copy link
Contributor Author

@krivard Here is a table showing the percentage of 0 diff

  • compare the val reported today with the val reported yesterday.
  • zero_diff_pct = 90, means on issue_date D for a specific signal, we have 90% of the reported val unchanged across all counties considered and across all dates reported (-50 days to -5 days)
  • If yesterday's report for location i and date d is NA, but today's is not NA, then there is a non-zero diff.
  • issue_date ranging from 2021-12-21 to 2021-12-28

We can see that for the smallest geographical level (county), we have zero_diff_pct >= 80 for most of the signals. It's good to apply archive-differ to quidel_covidtest too.
quidel_zero_diff_pct.csv

@krivard
Copy link
Contributor

krivard commented Feb 2, 2022

Excellent news!

So if the update to the smoothing means we expect to drop 15-50% of new rows we currently report each day for quidel,
and adding archive-differ means we expect to drop 80-90% of new rows we currently report each day for quidel,
that gives us a 5x-20x total reduction depending on how those two sets overlap.

Let's install archive-differ, release this, see where we are, and then revisit expanding the export range once we have a better idea of where we wind up. That'll look like E without the initialization step, or:

  1. Set up archive-differ bucket in S3 Done: quidel
  2. Add archiver section to quidel params
  "archive": {
    "aws_credentials": {
      "aws_access_key_id": "{{ delphi_aws_access_key_id }}",
      "aws_secret_access_key": "{{ delphi_aws_secret_access_key }}"
    },
    "bucket_name": "delphi-covidcast-indicator-output",
    "cache_dir": "./archivediffer_cache",
    "indicator_prefix": "quidel"
  }
  1. Merge this PR
  2. Release covidcast-indicators
  3. On prod, add the archivediffer_cache directory. Clear the pulled_until cache. Swap out the production params file with one that will export from 2020-02 to present. Run the indicator. Swap the params file back to the production version.

@jingjtang
Copy link
Contributor Author

jingjtang commented Feb 6, 2022

@krivard File created to provide deletion info: https://cmu.box.com/s/y41vjkj8z8cjvaqgtz8g5phagms1i3t0
Support code here: https://github.com/cmu-delphi/covidcast-indicators/blob/quidel_deletion/quidel_covidtest/delphi_quidel_covidtest/quidel_deletion_info.py

I did the sanity check by

  • randomly check the value and stderr to see whether they match what we have in our API
  • the sample size should be all 50
  • the number of the reports for each date at county level seems reasonable.

But it's still better to have another one to help do another sanity check for this.

@krivard krivard merged commit ea31835 into main Feb 8, 2022
@krivard krivard deleted the Add_Age_Group_to_QuidelCovidtest branch February 8, 2022 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update Quidel Covid pipeline

5 participants