-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Util fix #105
base: master
Are you sure you want to change the base?
Util fix #105
Conversation
Sorry for the delay in writing back, but like we talked about this is looking good. The logic for handling combinations of |
web/handlers/genomics/util.py
Outdated
keep_lineages.append(lineages_to_retain) | ||
date_limit = dt.strptime(max_date, "%Y-%m-%d") - timedelta(days=ndays) # searches from max_date to ndays back | ||
df = df[(df["prevalence"] >= prevalence_threshold) & (df['date'] < max_date) & (df['date'] > date_limit)] | ||
num_unique_dates = df[df["date"] >= date_limit]["date"].unique().shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you've already filtered the df at this point, I think you can skip the [df["date"] <= date_limit]
part of this line and the two similar ones above, and then just have one line that does this after the if statement.
web/handlers/genomics/util.py
Outdated
@@ -209,48 +209,28 @@ def get_major_lineage_prevalence(df, index_col = "date", min_date = None, max_da | |||
|
|||
df['prevalence'] = df['total_count']/df['lineage_count'] | |||
df = df.sort_values(by="date") #Sort date values | |||
min_date = dt.strptime(min_date, "%Y-%m-%d") | |||
max_date = dt.strptime(max_date, "%Y-%m-%d") | |||
|
|||
|
|||
if min_date and max_date: | |||
df = df[(df["date"].between(min_date, max_date)) & (df["prevalence"] >= prevalence_threshold)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you just want to filter by date here, and not by prevalence just yet. Later, you're removing the lineages which don't have enough days above the prevalence threshold, but we still want to return data for low-prevalence days for lineages that are above the threshold.
web/handlers/genomics/util.py
Outdated
|
||
if num_unique_dates < nday_threshold: | ||
nday_threshold = round((nday_threshold/ndays) * num_unique_dates) | ||
lineage_counts = df["lineage"].value_counts() #number of times lineage is found in df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you put your prevalence threshold filter down here instead, you should get the same counts as your current version, but there won't be gaps in the dataframe on low prevalence days
No description provided.