Util fix #105

srandall02 · 2023-08-30T02:39:14Z

No description provided.

mindoftea · 2023-09-11T17:03:55Z

Sorry for the delay in writing back, but like we talked about this is looking good. The logic for handling combinations of mindate/maxdate/ndays to select the right date range seems complete. At this point, I think the only remaining challenge is to combine your date-filtering code with the prevalence-filtering logic from the original piece of code and test by comparing the two locally -- the behavior should be the same when min_date and max_date aren't specified. The intended behavior for this version is that a lineage will be grouped into 'other' iff it is (a) not in keep_lineages, and (b) occurs with a prevalence greater than the prevalence_threshold for at least nday_threshold days within the date range.

mindoftea · 2023-10-03T23:36:10Z

web/handlers/genomics/util.py

-    keep_lineages.append(lineages_to_retain)
+        date_limit = dt.strptime(max_date, "%Y-%m-%d") - timedelta(days=ndays) # searches from max_date to ndays back
+        df = df[(df["prevalence"] >= prevalence_threshold) & (df['date'] < max_date) & (df['date'] > date_limit)]
+        num_unique_dates = df[df["date"] >= date_limit]["date"].unique().shape[0]


Since you've already filtered the df at this point, I think you can skip the [df["date"] <= date_limit] part of this line and the two similar ones above, and then just have one line that does this after the if statement.

mindoftea · 2023-10-03T23:38:20Z

web/handlers/genomics/util.py

@@ -209,48 +209,28 @@ def get_major_lineage_prevalence(df, index_col = "date", min_date = None, max_da

    df['prevalence'] = df['total_count']/df['lineage_count']
    df = df.sort_values(by="date") #Sort date values
-    min_date = dt.strptime(min_date, "%Y-%m-%d")
-    max_date = dt.strptime(max_date, "%Y-%m-%d")
+

    if min_date and max_date:
        df = df[(df["date"].between(min_date, max_date)) & (df["prevalence"] >= prevalence_threshold)]


I think you just want to filter by date here, and not by prevalence just yet. Later, you're removing the lineages which don't have enough days above the prevalence threshold, but we still want to return data for low-prevalence days for lineages that are above the threshold.

mindoftea · 2023-10-03T23:39:53Z

web/handlers/genomics/util.py

-
+    if num_unique_dates < nday_threshold:
+        nday_threshold = round((nday_threshold/ndays) * num_unique_dates) 
+    lineage_counts = df["lineage"].value_counts() #number of times lineage is found in df


If you put your prevalence threshold filter down here instead, you should get the same counts as your current version, but there won't be gaps in the dataframe on low prevalence days

srandall02 added 2 commits August 24, 2023 21:03

added date params

1b98334

updating get_major_lineage_prevalence

546ad8b

srandall02 requested a review from gkarthik August 30, 2023 02:39

mindoftea self-requested a review September 6, 2023 16:35

srandall02 added 2 commits September 6, 2023 18:22

updating get_major_lineage_prevalence

64daf21

Merge remote-tracking branch 'origin/util_fix' into util_fix

0797d1f

fixed logic

233ee00

srandall02 marked this pull request as ready for review October 2, 2023 04:42

mindoftea reviewed Oct 3, 2023

View reviewed changes

final changes

849427b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Util fix #105

Util fix #105

srandall02 commented Aug 30, 2023

mindoftea commented Sep 11, 2023

mindoftea Oct 3, 2023

mindoftea Oct 3, 2023

mindoftea Oct 3, 2023

Util fix #105

Are you sure you want to change the base?

Util fix #105

Conversation

srandall02 commented Aug 30, 2023

mindoftea commented Sep 11, 2023

mindoftea Oct 3, 2023

Choose a reason for hiding this comment

mindoftea Oct 3, 2023

Choose a reason for hiding this comment

mindoftea Oct 3, 2023

Choose a reason for hiding this comment