Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-detect column and row coordinates from niid data #175

Merged
merged 3 commits into from
Jan 6, 2025

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Dec 20, 2024

Description of proposed changes

This PR extends the functionality for autodetecting titer blocks (previously implemented for VIDRL Excel files) to NIID excel files. By replacing the reliance on hardcoded row and column start and stop coordinates, this enhancement provides greater flexibility and reduces the need for frequent updates when file formats change.

Basically changing from hardcoded row and columns code:

fauna/tdb/niid_upload.py

Lines 70 to 89 in ecce160

if subtype == "h3n2":
serum_id_row_index = 5 #5
start_row = 7
virus_id_col_index = 1
start_col = 4
elif subtype == "h1n1pdm":
serum_id_row_index = 5
start_row = 6
virus_id_col_index = 1
start_col = 4
elif subtype == "vic":
serum_id_row_index = 4
start_row = 5
virus_id_col_index = 1
start_col = 4
elif subtype == "yam":
serum_id_row_index = 3
start_row = 5
virus_id_col_index = 1
start_col = 4

To automatically detecting the titer block

fauna/tdb/niid_upload.py

Lines 54 to 66 in 47c06e8

# autodetecting titer, virus, serum blocks
titer_block = find_titer_block(worksheet)
if len(titer_block["col_start"]) == 0:
print("No titer block found.")
break
titer_coords = {
'col_start': titer_block["col_start"][0][0],
'col_end': titer_block["col_end"][0][0],
'row_start': titer_block["row_start"][0][0],
'row_end': titer_block["row_end"][0][0]
}

This PR reuses the titer_block.py functions introduced during #157, which automatically detect titer coordinates in VIDRL Excel files. However, it substitutes NIID-specific pattern matching logic to identify key features:

fauna/tdb/niid_upload.py

Lines 41 to 47 in 47c06e8

# Set NIID patterns
virus_pattern = r"[A-Z]/[\w\s-]+/.+/\d{4}"
virus_passage_pattern = r"(MDCK|SIAT|E\d+|hCK)"
serum_id_pattern = r".+(No\.|no\.).+"
serum_passage_pattern = r".+(Egg|Cell).+"
serum_abbrev_pattern = r"\w+\s{0,1}\w+/\d+.*"
crick = False

During the update of the niid_upload.py script, noticed that certain fields -- such as serum strain, ID, and passage --- undergo additional NIID-specific processing compared to VIDRL files. Therefore, this PR preserves the original NIID-specific looping behavior for parsing virus, serum, passage, and ID data, without introducing changes to those portions.

This script was successfully used to process NIID files on 20 December 2024 https://github.com/fludata/NIID-Tokyo-WHO-CC/commit/7d766375a17433969d7fa00190f89b4b6ee4f9c3, the results manually checked, and no processing issues detected.

Related issue(s)

Checklist

  • Checks pass

@j23414 j23414 changed the title wip: Auto-detect column and row coordinates from niid data Auto-detect column and row coordinates from niid data Jan 2, 2025
@j23414 j23414 marked this pull request as ready for review January 3, 2025 00:25
@joverlee521 joverlee521 self-requested a review January 3, 2025 23:05
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see all the work you put into titer block pays off for processing NIID data as well 🌟

I only left non-blocking comments so feel free to merge if this works well with previous Excel files.

tdb/niid_upload.py Outdated Show resolved Hide resolved
Comment on lines +142 to +149
m = re.search(r'^(\S+)(egg|cell|siat|hck|nib121|ivr|\(bvr)', serum_id, re.IGNORECASE)
if m is None:
m = re.search(r'^(\S+)(no\.)', serum_id, re.IGNORECASE)
serum_strain = ""
if m:
serum_strain = m.group(1)
if not serum_strain.startswith(flutype + "/"):
serum_strain = flutype + "/" + serum_strain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

Seems like this should be replaced by the serum_map, am I missing the reason why the serum map cannot be used?

Suggested change
m = re.search(r'^(\S+)(egg|cell|siat|hck|nib121|ivr|\(bvr)', serum_id, re.IGNORECASE)
if m is None:
m = re.search(r'^(\S+)(no\.)', serum_id, re.IGNORECASE)
serum_strain = ""
if m:
serum_strain = m.group(1)
if not serum_strain.startswith(flutype + "/"):
serum_strain = flutype + "/" + serum_strain
serum_strain = serum_mapping[serum_id]

Copy link
Contributor Author

@j23414 j23414 Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly I just wasn't confident my mapping was matching the behavior of the original mapping. For example:

envdir ~/nextstrain/env.d/seasonal-flu/ \
  python tdb/niid_upload.py \
  -db niid_tdb \
  --virus flu \
  --subtype h1n1pdm \
  --assay_type hi \
  --path /Users/jchang3/nextstrain/fludata/NIID-Tokyo-WHO-CC/raw-data/A/H1N1pdm/HI/2024/ \
  --fstem H1pdm09_HIdata\(20241003\) \
  --ftype niid \
  --preview &>> my_log/niid_H1pdm09_HIdata\(20241003\).txt

titer_parse.py gave serum_mapping values like:

serum_mapping = {
    'Wisconsin/67/22CellNo.111': 'A/Wisconsin/67/2022',
    'Victoria/4897/22EggNo.121': 'A/Victoria/4897/2022',
    'Victoria/4897/22(IVR-238)No.112': 'A/Victoria/4897/2022 IVR-238',
    'Norway/25089/22CellNo.152': 'A/Norway/25089/2022',
    'Sydney/5/21CellNo.131': 'A/Sydney/5/2021',
    'India/PUN-NIV323546/21CellNo.112': 'A/India/PUN-NIV323546/2021',
    'KANAGAWA/IC1920/20CellNo.111': 'A/KANAGAWA/IC1920/2020',
    'OKINAWA/93/19CellNo.171': 'A/OKINAWA/93/2019',
}

While checking the original parsing of serum_names (2nd column) in visidata, gave serum names like:

visidata data/tmp/H1pdm09_HIdata\(20241003\).tsv

#> Example, only showing 2nd column of serum_names
#> A/Wisconsin/67/22
#> A/Victoria/4897/22
#> A/Victoria/4897/22(
#> A/Norway/25089/22
#> A/Sydney/5/21
#> A/India/PUN-NIV323546/21
#> A/KANAGAWA/IC1920/20
#> A/OKINAWA/93/19

Therefore I relied on the original behavior code, till I had time to dig further into this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, agree this can wait till later!

From first glance, the serum mapping includes the complete strain name so it could be switched over. However, I just noticed the A/Victoria/4897/2022 IVR-238 strain, which needs to drop IVR-238 🤔

Comment on lines +45 to +47
serum_passage_pattern = r".+(Egg|Cell).+"
serum_abbrev_pattern = r"\w+\s{0,1}\w+/\d+.*"
crick = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

Since the returned serum_passage_row_idx, serum_abbrev_row_idx, and serum_mapping are not being used, do we even need to set these patterns?

Copy link
Contributor Author

@j23414 j23414 Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! The serum_abbrev_row_idx pattern is used to get the row coordinate. I miswrote that it was not being used:

[Edit after I read the comment more carefully
re: serum_abbrev_row_id VIDRL data had separate serum_id and serum_abbr rows while NIID data combines them into one row. You're correct I could have set this to None and rely on default pattern:

fauna/tdb/titer_block.py

Lines 230 to 231 in 47c06e8

if serum_abbrev_pattern is None:
serum_abbrev_pattern = r"\w+\s{0,1}\w+/\d+.*"

]

Regarding the others:

  • serum_passage_row_idx: Could be optional, I left it in to potentially validate that the serum passage matches the same row as serum_abbrev_row_idx and isn't missing.
  • serum_mapping: Could be optional, or eventually refined to become the official NIID serum mapping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have been clearer in my comment. I was confused by the serum_passage_pattern and serum_abbrev_pattern because that the actual data parsing was using completely separate patterns (serum strain patterns and passage patterns).

Simplifies the log messages when parsing NIID excel data. Since serum_mapping is not being used, don't print it out. Instead be explicit about how the serum strain and passage are parsed from the serum id.

Co-authored-by: Jover Lee <joverlee521@gmail.com>
@j23414 j23414 merged commit 9c5b8dc into master Jan 6, 2025
@j23414 j23414 deleted the auto-detect-niid branch January 6, 2025 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants