Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial fix of coordinate casting to short in MarkDuplicates #1442

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions src/main/java/picard/sam/markduplicates/MarkDuplicates.java
Original file line number Diff line number Diff line change
Expand Up @@ -483,14 +483,15 @@ private void buildSortedReadEndLists(final boolean useBarcodes) {
log.info("Will retain up to " + maxInMemory + " data points before spilling to disk.");

final ReadEndsForMarkDuplicatesCodec fragCodec, pairCodec, diskCodec;
final double scale = OPTICAL_DUPLICATE_PIXEL_DISTANCE/(double) OpticalDuplicateFinder.DEFAULT_OPTICAL_DUPLICATE_DISTANCE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the typical use case this might actually make a difference since we are compressing the pixel distance by 2500/100 or 25 and then int scaling, which will end up introducing approximately a 1% truncation in every case. That is probably fine given how unlikely it is to substantially change the optical duplicate finding. I do not however think this should not be defined by OpticalDuplicateFinder.DEFAULT_OPTICAL_DUPLICATE_DISTANCE and we should fix some other variable instead since the error range shouldn't really have anything to do with the default pixel distance.

It would also seem prudent to ensure that we don't ever set a value below 1 here (i.e. we don't ever want to "inflate" the pixel coordinates if the user selects a pixel distance < 100).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to think of a scaling factor that keeps the default behaviour the same....I guess I could add an additional argument, but I thought this admitted hack is good enough....if there are other ideas, I'm happy to consider.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable. I do think it seems bizzare that if we were to change the default to say 2500 (which I think is the standard for most of our pipelines) that we would be compressing by a factor of 25 less than we were before which could cause confusion of unintended consequences. I think we should make a separate variable that pegs the to 100 to make it a little clearer why we are doing it this way. I still think we should avoid ever scaling by fractional value however.

final double scale = Math.max(OPTICAL_DUPLICATE_PIXEL_DISTANCE/(double) OpticalDuplicateFinder.OPTICAL_DUPLICATE_PIXEL_SCALING_FACTOR, 1.0);

if (useBarcodes) {
fragCodec = new ReadEndsForMarkDuplicatesWithBarcodesCodec();
pairCodec = new ReadEndsForMarkDuplicatesWithBarcodesCodec();
diskCodec = new ReadEndsForMarkDuplicatesWithBarcodesCodec();
fragCodec = new ReadEndsForMarkDuplicatesWithBarcodesCodec(scale);
pairCodec = new ReadEndsForMarkDuplicatesWithBarcodesCodec(scale);
diskCodec = new ReadEndsForMarkDuplicatesWithBarcodesCodec(scale);
} else {
fragCodec = new ReadEndsForMarkDuplicatesCodec();
pairCodec = new ReadEndsForMarkDuplicatesCodec();
diskCodec = new ReadEndsForMarkDuplicatesCodec();
fragCodec = new ReadEndsForMarkDuplicatesCodec(scale);
pairCodec = new ReadEndsForMarkDuplicatesCodec(scale);
diskCodec = new ReadEndsForMarkDuplicatesCodec(scale);
}

this.pairSort = SortingCollection.newInstance(ReadEndsForMarkDuplicates.class,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,10 @@
import picard.util.GraphUtils;

import java.io.Serializable;
import java.util.*;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

/**
* Contains methods for finding optical/co-localized/sequencing duplicates.
Expand Down Expand Up @@ -327,16 +330,19 @@ private PhysicalLocation keeperOrNull(final List<? extends PhysicalLocation> lis
/** Simple method to test whether two physical locations are close enough to each other to be deemed optical dupes. */
private boolean closeEnough(final PhysicalLocation lhs, final PhysicalLocation rhs, final int distance) {
return lhs != rhs && // no comparing an object to itself (checked using object identity)!
lhs.hasLocation() && rhs.hasLocation() && // no comparing objects without locations
lhs.getReadGroup() == rhs.getReadGroup() && // must be in the same RG to be optical duplicates
lhs.getTile() == rhs.getTile() && // and the same tile
Math.abs(lhs.getX() - rhs.getX()) <= distance &&
Math.abs(lhs.getY() - rhs.getY()) <= distance;
lhs.hasLocation() && rhs.hasLocation() && // no comparing objects without locations
lhs.getReadGroup() == rhs.getReadGroup() && // must be in the same RG to be optical duplicates
lhs.getTile() == rhs.getTile() && // and the same tile
closeEnoughShort(lhs, rhs, distance);
}

private boolean closeEnoughShort(final PhysicalLocation lhs, final PhysicalLocation rhs, final int distance) {
return lhs != rhs &&
Math.abs(lhs.getX() - rhs.getX()) <= distance &&
Math.abs(lhs.getY() - rhs.getY()) <= distance;
// Since the X and Y coordinates are constrained to short values, they can also become negative when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... this method should have been named something less overloaded like "closeEnoughQuick" to make it clearer that this checks fewer things and not short casted things given that the underlying values could be ints or shorts... Now that you have made this change you perhaps is a correct description as now this method will always check adjacency in short space even if the underlying physical location still has integers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to do some arithmetic to convince myself that this works. Assuming your goal is to make it so that both 32768 and 32767 adjacent without disturbing the 65536 and 65535 adjacency I think this succeeded. However I do not think it is necessary to perform the operation forwards and backwards, the absolute value operation (even accounting for overflowing from the shorts) will not be changed by flipping the order by virtue of the fact that this calculation is being performed in integer space whereas the underlying data overflowed in short space. Unless your scaling factor is in the region of Short.MAX_VALUE/2 you should never have an overflow here thus you should be able to get away without flipping the comparison.

// cast from int to short. This code compares them while taking that into account.
(Math.abs((lhs.getX() - rhs.getX()) % Short.MAX_VALUE) <= distance ||
Math.abs((rhs.getX() - lhs.getX()) % Short.MAX_VALUE) <= distance) &&
(Math.abs((lhs.getY() - rhs.getY()) % Short.MAX_VALUE) <= distance ||
Math.abs((rhs.getY() - lhs.getY()) % Short.MAX_VALUE) <= distance);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -26,15 +26,26 @@
import htsjdk.samtools.util.SortingCollection;
import picard.PicardException;

import java.io.*;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.EOFException;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;

/** Codec for ReadEnds that just outputs the primitive fields and reads them back. */
public class ReadEndsForMarkDuplicatesCodec implements SortingCollection.Codec<ReadEndsForMarkDuplicates> {
protected DataInputStream in;
protected DataOutputStream out;

final protected double scaleFactor;

public ReadEndsForMarkDuplicatesCodec(final double coordinateAccuracy) {
this.scaleFactor = coordinateAccuracy;
}

public SortingCollection.Codec<ReadEndsForMarkDuplicates> clone() {
return new ReadEndsForMarkDuplicatesCodec();
return new ReadEndsForMarkDuplicatesCodec(this.scaleFactor);
}

public void setOutputStream(final OutputStream os) { this.out = new DataOutputStream(os); }
Expand Down Expand Up @@ -66,8 +77,12 @@ public void encode(final ReadEndsForMarkDuplicates read) {

this.out.writeShort(read.readGroup);
this.out.writeShort(read.tile);
this.out.writeShort((short)read.x);
this.out.writeShort((short)read.y);

// scaling this so that in cases where there may be overflow, but the local accuracy is not so important
// the low-level bits are forgotten instead of overflowing

this.out.writeShort((short) (read.x / this.scaleFactor));
this.out.writeShort((short) (read.y / this.scaleFactor));
this.out.writeByte(read.orientationForOpticalDuplicates);
this.out.writeInt(read.duplicateSetSize);
} catch (final IOException ioe) {
Expand Down Expand Up @@ -99,8 +114,10 @@ public ReadEndsForMarkDuplicates decode() {

read.readGroup = this.in.readShort();
read.tile = this.in.readShort();
read.x = this.in.readShort();
read.y = this.in.readShort();

// reversing the scaling that was done during writing.
read.x = (int) (this.scaleFactor * this.in.readShort());
read.y = (int) (this.scaleFactor * this.in.readShort());

read.orientationForOpticalDuplicates = this.in.readByte();
read.duplicateSetSize = this.in.readInt();
Expand Down
Original file line number Diff line number Diff line change
@@ -1,26 +1,26 @@
/*
* The MIT License
*
* Copyright (c) 2015 The Broad Institute
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*/
* The MIT License
*
* Copyright (c) 2015 The Broad Institute
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*/

package picard.sam.markduplicates.util;

Expand All @@ -34,9 +34,13 @@
*/
public class ReadEndsForMarkDuplicatesWithBarcodesCodec extends ReadEndsForMarkDuplicatesCodec {

public ReadEndsForMarkDuplicatesWithBarcodesCodec(final double coordinateAccuracy) {
super(coordinateAccuracy);
}

@Override
public SortingCollection.Codec<ReadEndsForMarkDuplicates> clone() {
return new ReadEndsForMarkDuplicatesWithBarcodesCodec();
return new ReadEndsForMarkDuplicatesWithBarcodesCodec(this.scaleFactor);
}

@Override
Expand All @@ -47,7 +51,7 @@ public void encode(final ReadEndsForMarkDuplicates read) {
super.encode(read);

try {
final ReadEndsForMarkDuplicatesWithBarcodes val = (ReadEndsForMarkDuplicatesWithBarcodes)read;
final ReadEndsForMarkDuplicatesWithBarcodes val = (ReadEndsForMarkDuplicatesWithBarcodes) read;
out.writeInt(val.barcode);
out.writeInt(val.readOneBarcode);
out.writeInt(val.readTwoBarcode);
Expand All @@ -59,7 +63,9 @@ public void encode(final ReadEndsForMarkDuplicates read) {
@Override
public ReadEndsForMarkDuplicates decode() {
final ReadEndsForMarkDuplicates parentRead = super.decode();
if (null == parentRead) return null; // EOF
if (null == parentRead) {
return null; // EOF
}
final ReadEndsForMarkDuplicatesWithBarcodes read = new ReadEndsForMarkDuplicatesWithBarcodes(parentRead);
try {
read.barcode = in.readInt();
Expand All @@ -70,5 +76,4 @@ public ReadEndsForMarkDuplicates decode() {
throw new PicardException("Exception writing ReadEnds to file.", ioe);
}
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,7 @@ public void testOpticalDuplicateDetection(final File sam, final long expectedNum
markDuplicates.TMP_DIR = CollectionUtil.makeList(outputDir);
// Needed to suppress calling CommandLineProgram.getVersion(), which doesn't work for code not in a jar
markDuplicates.PROGRAM_RECORD_ID = null;
markDuplicates.OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500;
Assert.assertEquals(markDuplicates.doWork(), 0);
Assert.assertEquals(markDuplicates.numOpticalDuplicates(), expectedNumOpticalDuplicates);
IOUtil.recursiveDelete(outputDir.toPath());
Expand All @@ -204,6 +205,7 @@ public Object[][] testOpticalDuplicateDetectionDataProvider() {
return new Object[][] {
{new File(TEST_DATA_DIR, "optical_dupes.sam"), 1L},
{new File(TEST_DATA_DIR, "optical_dupes_casava.sam"), 1L},
{new File(TEST_DATA_DIR, "GH1141.optical_dups.sam"), 1L},
};
}

Expand Down
11 changes: 11 additions & 0 deletions testdata/picard/sam/MarkDuplicates/GH1141.optical_dups.sam
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
@HD VN:1.6 SO:coordinate
@SQ SN:1 LN:200000000
@SQ SN:2 LN:200000000
@RG ID:00001_1#1 LB:22222222 PL:ILLUMINA SM:AW99
@PG ID:MarkDuplicates VN:2.21.4-2-ga3021a7-SNAPSHOT CL:MarkDuplicates TAG_DUPLICATE_SET_MEMBERS=true TAGGING_POLICY=All INPUT=[short_test.sam] OUTPUT=short_test_out.sam METRICS_FILE=short_test_stat.txt OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_SEQUENCING_DUPLICATES=false CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false PN:MarkDuplicates
AW7_00001:1:2101:29376:32760 99 1 2239203 60 151M = 2239280 228 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT AAFFFJJJJJJJJJJJJJJJJJJJJFJJJJJJFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAFJJFJF MC:Z:151M PG:Z:MarkDuplicates RG:Z:00001_1#1 DI:i:0 DS:i:2
AW7_00001:1:2101:29346:33023 1123 1 2239203 60 151M = 2239280 228 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT AAFFFJJJJJJJJJJJJJJJJJJJAJFJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJAJJJJJF MC:Z:151M PG:Z:MarkDuplicates RG:Z:00001_1#1 DI:i:0 DS:i:2 DT:Z:LB
AW7_00001:1:2101:29346:98296 1123 1 2239203 60 151M = 2239280 228 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT AAFFFJJJJJJJJJJJJJJJJJJJAJFJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJAJJJJJF MC:Z:151M PG:Z:MarkDuplicates RG:Z:00001_1#1 DI:i:0 DS:i:2 DT:Z:LB
AW7_00001:1:2101:29376:32760 147 1 2239280 60 151M = 2239203 -228 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT JJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA MC:Z:151M PG:Z:MarkDuplicates RG:Z:00001_1#1 DI:i:0 DS:i:2
AW7_00001:1:2101:29346:33023 1171 1 2239280 60 151M = 2239203 -228 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT JJJJJFJJJJJJJJJJJJJJJJJJJJJFJJF7JJJFJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJFFAAA MC:Z:151M PG:Z:MarkDuplicates RG:Z:00001_1#1 DI:i:0 DS:i:2 DT:Z:LB
AW7_00001:1:2101:29346:98296 1171 1 2239280 60 151M = 2239203 -228 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT JJJJJFJJJJJJJJJJJJJJJJJJJJJFJJF7JJJFJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJFFAAA MC:Z:151M PG:Z:MarkDuplicates RG:Z:00001_1#1 DI:i:0 DS:i:2 DT:Z:LB