Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

+ delimited dual index barcodes, and arbitrary raw barcodes #62

Merged
merged 4 commits into from
Mar 4, 2024

Conversation

theJasonFan
Copy link
Collaborator

@theJasonFan theJasonFan commented Feb 28, 2024

Addresses #60.

We now term barcodes delimited by "+" a "semantic" barcode --- a representation that is a 1-1 mapping between a fixed length barcodes (e.g, AAA+CCC) and its read structure 3B3B.

Tasks

  • update per_sample_metrics.tsv and per_project_metrics.tsv
  • update sample_barcode_hop_,metrics.tsv
  • update most_frequent_unmatched.tsv

Details for per-sample metrics

When using sample sheets, dual index barcodes are reported in metrics with + delimiter.

For example, the sample sheet:

[Demux]
[Data]
Sample_ID,Index1_Sequence,Index2_Sequence
s1,CCCCC,AAAAA

yields a barcode for s1 that appear as CCCCC+AAAAA

When using two-column format arbitrary barcodes are allowed. For example:

Sample_ID,Sample_Barcode
s1,TTT+=!-AAA

yields a barcode for s1 that appear as TTT+=!-AAA.

Details for sample barcode hop and unmatched barcode metrics

Barcodes now also appear also in their delimited semantic representations

Tests

  • For extracting semantic barcode representations from SampleMetadata, the above mentioned behaviors are now tested and exampled in included explicit test cases.
  • Added end-to-end tests with small examples to check basic values in output per-sample and barcode hopping metrics.

Future Todos

  • Add release note and review updates to README
  • Design schema for displaying and reading multi-segment barcodes
  • Validate multi-segment barcodes. As reflected in the test and implementation, we allow for arbitrary barcodes.

@theJasonFan theJasonFan marked this pull request as ready for review February 28, 2024 23:12
@theJasonFan theJasonFan linked an issue Feb 28, 2024 that may be closed by this pull request
3 tasks
Copy link
Collaborator

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theJasonFan minor nitpicks, please fixup, then merge at your leisure.

@@ -436,10 +436,11 @@ impl DemuxedGroupSampleMetrics {
best_barcode_template_count: usize,
sample_metadata: &SampleMetadata,
) -> SampleMetricsProcessed {
let barcode = Self::get_metrics_barcode(sample_metadata);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to inline this below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlined. Please re-review.

@@ -457,6 +458,19 @@ impl DemuxedGroupSampleMetrics {
/ self.base_qual_counter.index_bases_total_seen as f64,
}
}

fn get_metrics_barcode(sample_metadata: &SampleMetadata) -> String {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. might as well add some function documentation
  2. why not make this a method on SampleMetadata?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved. Please re-review. Note that moving this to SampleMetadata highlights how we should be better about validating the struct.

.map(|(barcode, count)| {
let b1 = BString::from(&barcode[..b1_len]);
let b2 = BString::from(&barcode[b1_len..]);
let barcode = BString::from(format!("{}+{}", b1, b2));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nh13 @jiangweiyao, where hopped barcodes get delimited.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Now we have two places where this formatting occurs, once in the sample metadata and once in the metric modules. Can you refactor to once place please and call from both?
  2. What happens if b2 is empty? I don't think this is doing the right thing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I strongly agree that this is code-smell coming from different barcode representations. Finding a solution might take more time / effort.
  2. For hop metrics, we know that the reported barcodes can be split and delimited. It is only used for dual index barcodes. The SampleBarcodeHopTracker only updates counts for delimited and hopped barcodes. See:
    let index1a = &barcode[0..self.checker.index1_length];

src/lib/run.rs Outdated
let delim = DelimFile::default();
let hop_metrics: Vec<BarcodeCount> = delim.read_tsv(&hop_metrics).unwrap();
assert_eq!(hop_metrics.len(), 1);
assert_eq!(hop_metrics[0].barcode, "TTT+AAA");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nh13 @jiangweiyao This test demultiplexes a small example of dual indexed data where a single read hops a pair of sample barcodes. We check for that barcode in the output barcode_hop_metrics.tsv

assert_eq!(per_sample_metrics.len(), 3);

assert_eq!(per_sample_metrics[0].barcode_name, "s1");
assert_eq!(per_sample_metrics[0].barcode, "TTT+AAA");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nh13 @jiangweiyao This test demultiplexes a small example of dual indexed data. We check for the barcode in the output per_sample_metrics. Admittedly, the test is does not exhaustively check for all the metrics (a test for that is above in the source). The focus here is on (a) the actual delimited barcode and the # of matches.

assert_eq!(per_sample_metrics[1].templates, 0);

assert_eq!(per_sample_metrics[2].barcode_name, "Undetermined");
assert_eq!(per_sample_metrics[2].barcode, "NNNNNN");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nh13 @jiangweiyao note that the "undetermined" barcode is not delimited.

/// **Note**: this expects that `self` is a valid SampleMetdata struct. i.e., a after
/// an update and sanitization with `update_with_and_set_demux_barcode()`, or is a sentinal
/// value where the barcode is all Ns.
pub fn get_semantic_barcode(&self) -> BString {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nh13 refactored this into SampleMetadta and added a docstring per your previous comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will want to use this in metrics.rs given you are creating the unmatched barcode there. Sorry, you may have to revert this to being not a method on samplemetadata

@@ -788,4 +811,51 @@ Sample2,GGGG
];
assert_eq!(actual, expected);
}
#[test]
fn test_get_semantic_barcode() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nh13 Moved and renamed this test after your suggestion to move the method into SampleMetadata

Copy link
Collaborator Author

@theJasonFan theJasonFan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nh13 @jiangweiyao ready for re-review

@theJasonFan
Copy link
Collaborator Author

@jiangweiyao Also updated to use delimited barcodes in most_frequent_unmatched.tsv

src/lib/run.rs Outdated
let delim = DelimFile::default();
let unmatched_metrics: Vec<BarcodeCount> = delim.read_tsv(&unmatched_metrics).unwrap();
assert_eq!(unmatched_metrics.len(), 1);
assert_eq!(unmatched_metrics[0].barcode, "TTT+AAA");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also check the unmatched barcode

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably need to test unmatched metrics for a single-barcode (not dual) given the above comment

Copy link
Collaborator

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the unmatched delimiting of the barcode has a bug. Can you take a look?

.map(|(barcode, count)| {
let b1 = BString::from(&barcode[..b1_len]);
let b2 = BString::from(&barcode[b1_len..]);
let barcode = BString::from(format!("{}+{}", b1, b2));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Now we have two places where this formatting occurs, once in the sample metadata and once in the metric modules. Can you refactor to once place please and call from both?
  2. What happens if b2 is empty? I don't think this is doing the right thing.

src/lib/run.rs Outdated
let delim = DelimFile::default();
let unmatched_metrics: Vec<BarcodeCount> = delim.read_tsv(&unmatched_metrics).unwrap();
assert_eq!(unmatched_metrics.len(), 1);
assert_eq!(unmatched_metrics[0].barcode, "TTT+AAA");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably need to test unmatched metrics for a single-barcode (not dual) given the above comment

/// **Note**: this expects that `self` is a valid SampleMetdata struct. i.e., a after
/// an update and sanitization with `update_with_and_set_demux_barcode()`, or is a sentinal
/// value where the barcode is all Ns.
pub fn get_semantic_barcode(&self) -> BString {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will want to use this in metrics.rs given you are creating the unmatched barcode there. Sorry, you may have to revert this to being not a method on samplemetadata

Copy link
Collaborator Author

@theJasonFan theJasonFan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nh13 added the test with single indexing run unmatched barcodes. Other end-to-end tests should cover other single indexing run metrics with matched barcodes.

Copy link
Collaborator

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@theJasonFan theJasonFan merged commit 80f50fb into main Mar 4, 2024
6 checks passed
@theJasonFan theJasonFan deleted the 60/jf/multi-barcode-metrics branch March 5, 2024 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement dual index barcodes, delimited by +.
2 participants