-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
export tip frequencies #83
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an awesome idea. My only concern is with backwards compatibility of the frequency JSONs. See inline comments for more.
base/auspice_export.py
Outdated
@@ -39,6 +39,23 @@ def process_freqs(freq): | |||
else: | |||
process.log.notify("Cannot export frequencies - pivots do not exist") | |||
|
|||
def export_tip_frequency_json(process, prefix, indent): | |||
def process_freqs(freq): # 6dp here, 4 above | |||
return [round(x,6) for x in freq] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for rounding to 6 instead of 4. Instead of defining a replacement function inside a function, we should just define a constant at the top of the module and use the same value and process_freqs
in all exports for consistency of code and data. Maybe FREQUENCY_PRECISION = 6
or EXPORT_PRECISION
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
base/auspice_export.py
Outdated
if clade in clade_tip_map: | ||
freq_json[clade_tip_map[clade]] = process_freqs(freq) | ||
|
||
write_json(freq_json, prefix+'_tipfrequencies.json', indent=indent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably just be "_tip_frequencies.json" to avoid run-on file names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we can no longer use _
to easily split filenames. Can be tip-frequencies
if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
base/auspice_export.py
Outdated
|
||
for clade, freq in process.tree_frequencies["global"].iteritems(): | ||
if clade in clade_tip_map: | ||
freq_json[clade_tip_map[clade]] = process_freqs(freq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to switch the format of the JSON to indexing frequencies by clade name instead of clade id? Along those same lines, can maintain information about which region the frequencies are from in the data themselves?
It would be great if the tip frequencies JSON was backwards compatible with the complete JSON, so code that refers to one could refer to the other without any changes. Annotating the region will make the data more interpretable to anyone who is just looking at the JSON and it will also allow us to export tip frequencies for multiple regions in the future (which is a likely goal, I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
base/auspice_export.py
Outdated
|
||
clade_tip_map = {n.clade: n.name for n in process.tree.tree.get_terminals()} | ||
|
||
for clade, freq in process.tree_frequencies["global"].iteritems(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of looping through all the clades here, you can just loop through the tips and use tip.clade
as the key to get the corresponding frequencies from process.tree_frequencies["global"][tip.clade]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are up on |
Also, let's future proof this and add a
We'll make augur smart about weighting these later, but I think important to have in the data model. |
I don't think separating regions is the correct way to go here. We should discuss. |
Final
The proper |
This property was introduced with the original frequencies work¹ as an anticipated need², but it was never used. Omit it for now to avoid carrying around unnecessary baggage; it can be added back in the future easily if its time comes. I uncovered this while authoring a JSON Schema for the tip-frequencies format.³ ¹ In PR #497 as a7bda1e. ² nextstrain/augur#83 (comment) ³ nextstrain/augur#852
what
A
tipfrequencies.JSON
is exported at the same time asfrequencies.JSON
. This is a dictionary with keyspivots
(same as frequencies) as well as each strain name -> frequencies at pivots. The frequencies are 6dp as opposed to 4dp to reduce error.why
Auspice frequencies uses this information. This is a subset of the larger frequencies JSON to improve auspice fetch time.