Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Util fix #105

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Util fix #105

wants to merge 6 commits into from

Conversation

srandall02
Copy link
Collaborator

No description provided.

@mindoftea mindoftea self-requested a review September 6, 2023 16:35
@mindoftea
Copy link
Collaborator

Sorry for the delay in writing back, but like we talked about this is looking good. The logic for handling combinations of mindate/maxdate/ndays to select the right date range seems complete. At this point, I think the only remaining challenge is to combine your date-filtering code with the prevalence-filtering logic from the original piece of code and test by comparing the two locally -- the behavior should be the same when min_date and max_date aren't specified. The intended behavior for this version is that a lineage will be grouped into 'other' iff it is (a) not in keep_lineages, and (b) occurs with a prevalence greater than the prevalence_threshold for at least nday_threshold days within the date range.

@srandall02 srandall02 marked this pull request as ready for review October 2, 2023 04:42
keep_lineages.append(lineages_to_retain)
date_limit = dt.strptime(max_date, "%Y-%m-%d") - timedelta(days=ndays) # searches from max_date to ndays back
df = df[(df["prevalence"] >= prevalence_threshold) & (df['date'] < max_date) & (df['date'] > date_limit)]
num_unique_dates = df[df["date"] >= date_limit]["date"].unique().shape[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you've already filtered the df at this point, I think you can skip the [df["date"] <= date_limit] part of this line and the two similar ones above, and then just have one line that does this after the if statement.

@@ -209,48 +209,28 @@ def get_major_lineage_prevalence(df, index_col = "date", min_date = None, max_da

df['prevalence'] = df['total_count']/df['lineage_count']
df = df.sort_values(by="date") #Sort date values
min_date = dt.strptime(min_date, "%Y-%m-%d")
max_date = dt.strptime(max_date, "%Y-%m-%d")


if min_date and max_date:
df = df[(df["date"].between(min_date, max_date)) & (df["prevalence"] >= prevalence_threshold)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you just want to filter by date here, and not by prevalence just yet. Later, you're removing the lineages which don't have enough days above the prevalence threshold, but we still want to return data for low-prevalence days for lineages that are above the threshold.


if num_unique_dates < nday_threshold:
nday_threshold = round((nday_threshold/ndays) * num_unique_dates)
lineage_counts = df["lineage"].value_counts() #number of times lineage is found in df
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put your prevalence threshold filter down here instead, you should get the same counts as your current version, but there won't be gaps in the dataframe on low prevalence days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants