Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequencies #497

Merged
merged 27 commits into from
Feb 23, 2018
Merged

Frequencies #497

merged 27 commits into from
Feb 23, 2018

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Feb 9, 2018

This PR implements some sort of frequency stream graph (the devil is in the details). cc @huddlej @trvrb @rneher

Availability:

git checkout frequencies
bash scripts/get_data.sh
python scripts/convert_augur_frequency_json.py # creates _pivots.json from _frequencies.json
npm run start:local

How are frequencies calculated

The global_clade:X entries in the frequencies JSON are extracted into a separate smaller JSON (called _pivots.json). For each tip in the tree (not internal branches) the frequencies (i.e. the array with n entries, where n is the number of pivots) are binned by the colorBy value (or the bounds if the scale is continuous). Non-visible tips (e.g. via filtering) are not included. For each of these bins, at each pivot point, the frequencies are summed and potentially normalized (see below). This allows the data to reflect the selected colorBy as well as any enabled filters.

The frequencies panel

  • Displayed as a stream graph below entropy.
  • The streams represent the combined frequencies across pivots for each colorBy (i.e. each entry in the tree legend).
  • A tool tip on hover shows you more information.
  • The data can be normalized (the values of each pivot point sum to 1) via the toggle (top right).
  • When forecasted frequencies are available these can easily be displayed.

Known bugs / things to do

  • genotype doesn't work
  • browser resize breaks things
  • styling need improving
  • tips are visible outside of the selected date range if the branch intersects the visible range (tree branch visibility is incorrect #473)
  • tips continue to have some visibility long after their collection date (potential augur bug, not sure). I've filtered out any values < 0.0002
  • rounding errors in the frequency JSON mean that totals for all tips don't sum to 1.
  • having a colorBy of clade / named clades (or some ability to select this) would result in these being different streams in the graph.
  • no legend
  • sometimes the stacking order is wrong

Screenshots

  • 12y coloured by date, all tips selected
    image

  • 12y coloured by antigenic advance, all tips selected
    image

  • same data but restricted to asia
    image

  • 12y coloured by region, all tips selected
    image

@trvrb
Copy link
Member

trvrb commented Feb 9, 2018

@jameshadfield:

This is so cool. Thanks for getting this up. I'm totally sold this is a good direction. I want to dig in more, but I had a couple immediate thoughts:

  • I think it would look more natural with stacking flipped. For example, the blue advanced clade (or stream) should start at the bottom of the panel and grow upwards.

36009935-9512a27a-0d03-11e8-9afe-a54c9127237d

  • The display of 12y date makes it look like there is an error somewhere:

36009923-80005990-0d03-11e8-83b2-c2867263574f

This shows the 2008-2009 stream as persisting to 2018 at 5%. This could be an issue in the underlying frequency calculations of course, but still should be figured out.

@jameshadfield
Copy link
Member Author

jameshadfield commented Feb 9, 2018

@trvrb looking at something like http://localhost:4000/flu/h3n2/ha/12y?c=num_date&dmax=2014-02-23&m=num_date (i.e. max date feb 2014), there's hundreds of tips that have >0.0001 frequency at the final pivot. Here's a few...

Pivot 2018.17 strain A/Maracay/FLA6546/2009 (clade # 927 ) carried frequency of 0.0002
Pivot 2018.17 strain A/Wisconsin/13/2008 (clade # 1171 ) carried frequency of 0.0021
Pivot 2018.17 strain A/Perth/5/2008 (clade # 1172 ) carried frequency of 0.0021
Pivot 2018.17 strain A/Auckland/121/2008 (clade # 1174 ) carried frequency of 0.0021
Pivot 2018.17 strain A/Sydney/2/2008 (clade # 1176 ) carried frequency of 0.0021
Pivot 2018.17 strain A/Christchurch/19/2008 (clade # 1178 ) carried frequency of 0.0021
Pivot 2018.17 strain A/Auckland/104/2008 (clade # 1180 ) carried frequency of 0.0021
Pivot 2018.17 strain A/Darwin/1/2008 (clade # 1181 ) carried frequency of 0.0021
Pivot 2018.17 strain A/Toyama/123/2008 (clade # 1184 ) carried frequency of 0.0017
Pivot 2018.17 strain A/Japan/AF1456/2008 (clade # 1185 ) carried frequency of 0.0017
Pivot 2018.17 strain A/Singapore/631/2008 (clade # 1303 ) carried frequency of 0.0023
Pivot 2018.17 strain A/Japan/WRAIR1059P/2008 (clade # 1304 ) carried frequency of 0.0023
Pivot 2018.17 strain A/Boston/97/2008 (clade # 1306 ) carried frequency of 0.0012
Pivot 2018.17 strain A/Kansas/UR07-0110/2008 (clade # 1308 ) carried frequency of 0.0002
Pivot 2018.17 strain A/Kuwait/WRAIR1561P/2009 (clade # 1311 ) carried frequency of 0.0032
Pivot 2018.17 strain A/Tanger/533/2009 (clade # 1312 ) carried frequency of 0.0032
Pivot 2018.17 strain A/Brisbane/14/2009 (clade # 1315 ) carried frequency of 0.0023
Pivot 2018.17 strain A/Shizuoka-C/52/2008 (clade # 1316 ) carried frequency of 0.0023
Pivot 2018.17 strain A/Novosibirsk/707/2009 (clade # 1319 ) carried frequency of 0.0023
Pivot 2018.17 strain A/Novosibirsk/628/2009 (clade # 1320 ) carried frequency of 0.0023

image

@jameshadfield
Copy link
Member Author

Streams rise up in frequency from the bottom in 6acacd3
image

@jameshadfield
Copy link
Member Author

closes #357

@jameshadfield
Copy link
Member Author

The stacking order is now determined by the rise over time, which looks great for genotypes, but a bit weird for date (probably due to the above issues).

I'm going to stop development on this for a while while the validity of the tip frequencies are sorted out. Latest version is up on https://auspice-dev.herokuapp.com/flu

@trvrb trvrb mentioned this pull request Feb 20, 2018
@trvrb
Copy link
Member

trvrb commented Feb 21, 2018

Thanks for the adjustments @jameshadfield. Regarding stacking order, it seems most natural to me to keep it it in the same order as the color legend. For continuous colorBys like epitope it makes a lot more sense to have it cleanly ordered by value. Right now epitope looks funny to me:

epitope-frequencies

I see why you did this for genotypes as currently legend ordering doesn't correspond to appearance, like so:

genotype-at-159

I would suggest keeping frequencies tied exactly to legend ordering but fix legend ordering for genotypes to reflect when particular genotypes were first observed.

@jameshadfield
Copy link
Member Author

jameshadfield commented Feb 21, 2018

@trvrb order now the same as tree legend. Here's the screenshot now.

image

@jameshadfield jameshadfield merged commit 4fb549e into master Feb 23, 2018
@jameshadfield jameshadfield deleted the frequencies branch February 23, 2018 22:55
@jameshadfield jameshadfield mentioned this pull request Feb 24, 2018
tsibley added a commit that referenced this pull request Feb 15, 2022
This property was introduced with the original frequencies work¹ as an
anticipated need², but it was never used.  Omit it for now to avoid
carrying around unnecessary baggage; it can be added back in the future
easily if its time comes.

I uncovered this while authoring a JSON Schema for the tip-frequencies
format.³

¹ In PR #497 as a7bda1e.
² nextstrain/augur#83 (comment)
³ nextstrain/augur#852
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants