-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-detect column and row coordinates from niid data #175
Conversation
Similar to vidrl autodetect PR: #157
c89136d
to
47c06e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to see all the work you put into titer block pays off for processing NIID data as well 🌟
I only left non-blocking comments so feel free to merge if this works well with previous Excel files.
m = re.search(r'^(\S+)(egg|cell|siat|hck|nib121|ivr|\(bvr)', serum_id, re.IGNORECASE) | ||
if m is None: | ||
m = re.search(r'^(\S+)(no\.)', serum_id, re.IGNORECASE) | ||
serum_strain = "" | ||
if m: | ||
serum_strain = m.group(1) | ||
if not serum_strain.startswith(flutype + "/"): | ||
serum_strain = flutype + "/" + serum_strain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking
Seems like this should be replaced by the serum_map
, am I missing the reason why the serum map cannot be used?
m = re.search(r'^(\S+)(egg|cell|siat|hck|nib121|ivr|\(bvr)', serum_id, re.IGNORECASE) | |
if m is None: | |
m = re.search(r'^(\S+)(no\.)', serum_id, re.IGNORECASE) | |
serum_strain = "" | |
if m: | |
serum_strain = m.group(1) | |
if not serum_strain.startswith(flutype + "/"): | |
serum_strain = flutype + "/" + serum_strain | |
serum_strain = serum_mapping[serum_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly I just wasn't confident my mapping was matching the behavior of the original mapping. For example:
envdir ~/nextstrain/env.d/seasonal-flu/ \
python tdb/niid_upload.py \
-db niid_tdb \
--virus flu \
--subtype h1n1pdm \
--assay_type hi \
--path /Users/jchang3/nextstrain/fludata/NIID-Tokyo-WHO-CC/raw-data/A/H1N1pdm/HI/2024/ \
--fstem H1pdm09_HIdata\(20241003\) \
--ftype niid \
--preview &>> my_log/niid_H1pdm09_HIdata\(20241003\).txt
titer_parse.py gave serum_mapping values like:
serum_mapping = {
'Wisconsin/67/22CellNo.111': 'A/Wisconsin/67/2022',
'Victoria/4897/22EggNo.121': 'A/Victoria/4897/2022',
'Victoria/4897/22(IVR-238)No.112': 'A/Victoria/4897/2022 IVR-238',
'Norway/25089/22CellNo.152': 'A/Norway/25089/2022',
'Sydney/5/21CellNo.131': 'A/Sydney/5/2021',
'India/PUN-NIV323546/21CellNo.112': 'A/India/PUN-NIV323546/2021',
'KANAGAWA/IC1920/20CellNo.111': 'A/KANAGAWA/IC1920/2020',
'OKINAWA/93/19CellNo.171': 'A/OKINAWA/93/2019',
}
While checking the original parsing of serum_names (2nd column) in visidata, gave serum names like:
visidata data/tmp/H1pdm09_HIdata\(20241003\).tsv
#> Example, only showing 2nd column of serum_names
#> A/Wisconsin/67/22
#> A/Victoria/4897/22
#> A/Victoria/4897/22(
#> A/Norway/25089/22
#> A/Sydney/5/21
#> A/India/PUN-NIV323546/21
#> A/KANAGAWA/IC1920/20
#> A/OKINAWA/93/19
Therefore I relied on the original behavior code, till I had time to dig further into this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, agree this can wait till later!
From first glance, the serum mapping includes the complete strain name so it could be switched over. However, I just noticed the A/Victoria/4897/2022 IVR-238
strain, which needs to drop IVR-238
🤔
serum_passage_pattern = r".+(Egg|Cell).+" | ||
serum_abbrev_pattern = r"\w+\s{0,1}\w+/\d+.*" | ||
crick = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking
Since the returned serum_passage_row_idx
, serum_abbrev_row_idx
, and serum_mapping
are not being used, do we even need to set these patterns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! The serum_abbrev_row_idx
pattern is used to get the row coordinate. I miswrote that it was not being used:
[Edit after I read the comment more carefully
re: serum_abbrev_row_id
VIDRL data had separate serum_id
and serum_abbr
rows while NIID data combines them into one row. You're correct I could have set this to None
and rely on default pattern:
Lines 230 to 231 in 47c06e8
if serum_abbrev_pattern is None: | |
serum_abbrev_pattern = r"\w+\s{0,1}\w+/\d+.*" |
]
Regarding the others:
serum_passage_row_idx
: Could be optional, I left it in to potentially validate that the serum passage matches the same row asserum_abbrev_row_idx
and isn't missing.serum_mapping
: Could be optional, or eventually refined to become the official NIID serum mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should have been clearer in my comment. I was confused by the serum_passage_pattern
and serum_abbrev_pattern
because that the actual data parsing was using completely separate patterns (serum strain patterns and passage patterns).
Simplifies the log messages when parsing NIID excel data. Since serum_mapping is not being used, don't print it out. Instead be explicit about how the serum strain and passage are parsed from the serum id. Co-authored-by: Jover Lee <joverlee521@gmail.com>
Description of proposed changes
This PR extends the functionality for autodetecting titer blocks (previously implemented for VIDRL Excel files) to NIID excel files. By replacing the reliance on hardcoded row and column start and stop coordinates, this enhancement provides greater flexibility and reduces the need for frequent updates when file formats change.
Basically changing from hardcoded row and columns code:
fauna/tdb/niid_upload.py
Lines 70 to 89 in ecce160
To automatically detecting the titer block
fauna/tdb/niid_upload.py
Lines 54 to 66 in 47c06e8
This PR reuses the
titer_block.py
functions introduced during #157, which automatically detect titer coordinates in VIDRL Excel files. However, it substitutes NIID-specific pattern matching logic to identify key features:fauna/tdb/niid_upload.py
Lines 41 to 47 in 47c06e8
During the update of the
niid_upload.py
script, noticed that certain fields -- such as serum strain, ID, and passage --- undergo additional NIID-specific processing compared to VIDRL files. Therefore, this PR preserves the original NIID-specific looping behavior for parsing virus, serum, passage, and ID data, without introducing changes to those portions.This script was successfully used to process NIID files on 20 December 2024 https://github.com/fludata/NIID-Tokyo-WHO-CC/commit/7d766375a17433969d7fa00190f89b4b6ee4f9c3, the results manually checked, and no processing issues detected.
Related issue(s)
vidrl_upload
to auto-detect titer info #157Checklist