Auto-detect column and row coordinates from niid data #175

j23414 · 2024-12-20T15:04:22Z

Description of proposed changes

This PR extends the functionality for autodetecting titer blocks (previously implemented for VIDRL Excel files) to NIID excel files. By replacing the reliance on hardcoded row and column start and stop coordinates, this enhancement provides greater flexibility and reduces the need for frequent updates when file formats change.

Basically changing from hardcoded row and columns code:

fauna/tdb/niid_upload.py

Lines 70 to 89 in ecce160

    
           if subtype == "h3n2": 
        
               serum_id_row_index = 5 #5 
        
               start_row = 7 
        
               virus_id_col_index = 1 
        
               start_col = 4 
        
           elif subtype == "h1n1pdm": 
        
               serum_id_row_index = 5 
        
               start_row = 6 
        
               virus_id_col_index = 1 
        
               start_col = 4 
        
           elif subtype == "vic": 
        
               serum_id_row_index = 4 
        
               start_row = 5 
        
               virus_id_col_index = 1 
        
               start_col = 4 
        
           elif subtype == "yam": 
        
               serum_id_row_index = 3 
        
               start_row = 5 
        
               virus_id_col_index = 1 
        
               start_col = 4

To automatically detecting the titer block

fauna/tdb/niid_upload.py

Lines 54 to 66 in 47c06e8

    
           # autodetecting titer, virus, serum blocks 
        
           titer_block = find_titer_block(worksheet) 
        
           if len(titer_block["col_start"]) == 0: 
        
               print("No titer block found.") 
        
               break 
        
           titer_coords = { 
        
               'col_start': titer_block["col_start"][0][0], 
        
               'col_end': titer_block["col_end"][0][0], 
        
               'row_start': titer_block["row_start"][0][0], 
        
               'row_end': titer_block["row_end"][0][0] 
        
           }

This PR reuses the titer_block.py functions introduced during #157, which automatically detect titer coordinates in VIDRL Excel files. However, it substitutes NIID-specific pattern matching logic to identify key features:

fauna/tdb/niid_upload.py

Lines 41 to 47 in 47c06e8

    
           # Set NIID patterns 
        
           virus_pattern = r"[A-Z]/[\w\s-]+/.+/\d{4}" 
        
           virus_passage_pattern = r"(MDCK|SIAT|E\d+|hCK)" 
        
           serum_id_pattern = r".+(No\.|no\.).+" 
        
           serum_passage_pattern = r".+(Egg|Cell).+" 
        
           serum_abbrev_pattern = r"\w+\s{0,1}\w+/\d+.*" 
        
           crick = False

During the update of the niid_upload.py script, noticed that certain fields -- such as serum strain, ID, and passage --- undergo additional NIID-specific processing compared to VIDRL files. Therefore, this PR preserves the original NIID-specific looping behavior for parsing virus, serum, passage, and ID data, without introducing changes to those portions.

This script was successfully used to process NIID files on 20 December 2024 https://github.com/fludata/NIID-Tokyo-WHO-CC/commit/7d766375a17433969d7fa00190f89b4b6ee4f9c3, the results manually checked, and no processing issues detected.

Related issue(s)

Refactor vidrl_upload to auto-detect titer info #157

Checklist

Checks pass

Similar to vidrl autodetect PR: #157

…l script

joverlee521

Glad to see all the work you put into titer block pays off for processing NIID data as well 🌟

I only left non-blocking comments so feel free to merge if this works well with previous Excel files.

tdb/niid_upload.py

joverlee521 · 2025-01-04T00:37:07Z

tdb/niid_upload.py

+                    m = re.search(r'^(\S+)(egg|cell|siat|hck|nib121|ivr|\(bvr)', serum_id, re.IGNORECASE)
+                    if m is None:
+                        m = re.search(r'^(\S+)(no\.)', serum_id, re.IGNORECASE)
+                    serum_strain = ""
+                    if m:
+                        serum_strain = m.group(1)
+                        if not serum_strain.startswith(flutype + "/"):
+                            serum_strain = flutype + "/" + serum_strain


non-blocking

Seems like this should be replaced by the serum_map, am I missing the reason why the serum map cannot be used?

Suggested change

m = re.search(r'^(\S+)(egg|cell|siat|hck|nib121|ivr|\(bvr)', serum_id, re.IGNORECASE)

if m is None:

m = re.search(r'^(\S+)(no\.)', serum_id, re.IGNORECASE)

serum_strain = ""

if m:

serum_strain = m.group(1)

if not serum_strain.startswith(flutype + "/"):

serum_strain = flutype + "/" + serum_strain

serum_strain = serum_mapping[serum_id]

Mostly I just wasn't confident my mapping was matching the behavior of the original mapping. For example:

envdir ~/nextstrain/env.d/seasonal-flu/ \ python tdb/niid_upload.py \ -db niid_tdb \ --virus flu \ --subtype h1n1pdm \ --assay_type hi \ --path /Users/jchang3/nextstrain/fludata/NIID-Tokyo-WHO-CC/raw-data/A/H1N1pdm/HI/2024/ \ --fstem H1pdm09_HIdata\(20241003\) \ --ftype niid \ --preview &>> my_log/niid_H1pdm09_HIdata\(20241003\).txt

titer_parse.py gave serum_mapping values like:

serum_mapping = { 'Wisconsin/67/22CellNo.111': 'A/Wisconsin/67/2022', 'Victoria/4897/22EggNo.121': 'A/Victoria/4897/2022', 'Victoria/4897/22(IVR-238)No.112': 'A/Victoria/4897/2022 IVR-238', 'Norway/25089/22CellNo.152': 'A/Norway/25089/2022', 'Sydney/5/21CellNo.131': 'A/Sydney/5/2021', 'India/PUN-NIV323546/21CellNo.112': 'A/India/PUN-NIV323546/2021', 'KANAGAWA/IC1920/20CellNo.111': 'A/KANAGAWA/IC1920/2020', 'OKINAWA/93/19CellNo.171': 'A/OKINAWA/93/2019', }

While checking the original parsing of serum_names (2nd column) in visidata, gave serum names like:

visidata data/tmp/H1pdm09_HIdata\(20241003\).tsv #> Example, only showing 2nd column of serum_names #> A/Wisconsin/67/22 #> A/Victoria/4897/22 #> A/Victoria/4897/22( #> A/Norway/25089/22 #> A/Sydney/5/21 #> A/India/PUN-NIV323546/21 #> A/KANAGAWA/IC1920/20 #> A/OKINAWA/93/19

Therefore I relied on the original behavior code, till I had time to dig further into this

Gotcha, agree this can wait till later!

From first glance, the serum mapping includes the complete strain name so it could be switched over. However, I just noticed the A/Victoria/4897/2022 IVR-238 strain, which needs to drop IVR-238 🤔

joverlee521 · 2025-01-04T00:59:22Z

tdb/niid_upload.py

+    serum_passage_pattern = r".+(Egg|Cell).+"
+    serum_abbrev_pattern = r"\w+\s{0,1}\w+/\d+.*"
+    crick = False


non-blocking

Since the returned serum_passage_row_idx, serum_abbrev_row_idx, and serum_mapping are not being used, do we even need to set these patterns?

Good question! ~~The serum_abbrev_row_idx pattern is used to get the row coordinate. I miswrote that it was not being used:~~

[Edit after I read the comment more carefully
re: serum_abbrev_row_id VIDRL data had separate serum_id and serum_abbr rows while NIID data combines them into one row. You're correct I could have set this to None and rely on default pattern:

fauna/tdb/titer_block.py

Lines 230 to 231 in 47c06e8

if serum_abbrev_pattern is None:

serum_abbrev_pattern = r"\w+\s{0,1}\w+/\d+.*"

]

Regarding the others:

serum_passage_row_idx: Could be optional, I left it in to potentially validate that the serum passage matches the same row as serum_abbrev_row_idx and isn't missing.

serum_mapping: Could be optional, or eventually refined to become the official NIID serum mapping.

Sorry, I should have been clearer in my comment. I was confused by the serum_passage_pattern and serum_abbrev_pattern because that the actual data parsing was using completely separate patterns (serum strain patterns and passage patterns).

Simplifies the log messages when parsing NIID excel data. Since serum_mapping is not being used, don't print it out. Instead be explicit about how the serum strain and passage are parsed from the serum id. Co-authored-by: Jover Lee <joverlee521@gmail.com>

j23414 changed the title ~~wip: Auto-detect column and row coordinates from niid data~~ Auto-detect column and row coordinates from niid data Jan 2, 2025

j23414 added 2 commits January 2, 2025 15:57

Autodetect titer block for niid xlsx processing

55d5495

Similar to vidrl autodetect PR: #157

Comment that niid serum mapping has not been changed from the origina…

47c06e8

…l script

j23414 force-pushed the auto-detect-niid branch from c89136d to 47c06e8 Compare January 3, 2025 00:02

j23414 marked this pull request as ready for review January 3, 2025 00:25

joverlee521 self-requested a review January 3, 2025 23:05

joverlee521 approved these changes Jan 4, 2025

View reviewed changes

NIID: simplify log messages

22c8253

Simplifies the log messages when parsing NIID excel data. Since serum_mapping is not being used, don't print it out. Instead be explicit about how the serum strain and passage are parsed from the serum id. Co-authored-by: Jover Lee <joverlee521@gmail.com>

j23414 merged commit 9c5b8dc into master Jan 6, 2025

j23414 deleted the auto-detect-niid branch January 6, 2025 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-detect column and row coordinates from niid data #175

Auto-detect column and row coordinates from niid data #175

j23414 commented Dec 20, 2024 •

edited

Loading

joverlee521 left a comment

joverlee521 Jan 4, 2025

j23414 Jan 6, 2025 •

edited

Loading

joverlee521 Jan 6, 2025

joverlee521 Jan 4, 2025

j23414 Jan 6, 2025 •

edited

Loading

joverlee521 Jan 6, 2025

	if subtype == "h3n2":
	serum_id_row_index = 5 #5
	start_row = 7
	virus_id_col_index = 1
	start_col = 4
	elif subtype == "h1n1pdm":
	serum_id_row_index = 5
	start_row = 6
	virus_id_col_index = 1
	start_col = 4
	elif subtype == "vic":
	serum_id_row_index = 4
	start_row = 5
	virus_id_col_index = 1
	start_col = 4
	elif subtype == "yam":
	serum_id_row_index = 3
	start_row = 5
	virus_id_col_index = 1
	start_col = 4

	# autodetecting titer, virus, serum blocks
	titer_block = find_titer_block(worksheet)

	if len(titer_block["col_start"]) == 0:
	print("No titer block found.")
	break

	titer_coords = {
	'col_start': titer_block["col_start"][0][0],
	'col_end': titer_block["col_end"][0][0],
	'row_start': titer_block["row_start"][0][0],
	'row_end': titer_block["row_end"][0][0]
	}

	# Set NIID patterns
	virus_pattern = r"[A-Z]/[\w\s-]+/.+/\d{4}"
	virus_passage_pattern = r"(MDCK\|SIAT\|E\d+\|hCK)"
	serum_id_pattern = r".+(No\.\|no\.).+"
	serum_passage_pattern = r".+(Egg\|Cell).+"
	serum_abbrev_pattern = r"\w+\s{0,1}\w+/\d+.*"
	crick = False

	if serum_abbrev_pattern is None:
	serum_abbrev_pattern = r"\w+\s{0,1}\w+/\d+.*"

Auto-detect column and row coordinates from niid data #175

Auto-detect column and row coordinates from niid data #175

Conversation

j23414 commented Dec 20, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

joverlee521 left a comment

Choose a reason for hiding this comment

joverlee521 Jan 4, 2025

Choose a reason for hiding this comment

j23414 Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

joverlee521 Jan 6, 2025

Choose a reason for hiding this comment

joverlee521 Jan 4, 2025

Choose a reason for hiding this comment

j23414 Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

joverlee521 Jan 6, 2025

Choose a reason for hiding this comment

j23414 commented Dec 20, 2024 •

edited

Loading

j23414 Jan 6, 2025 •

edited

Loading

j23414 Jan 6, 2025 •

edited

Loading