Seqlet calling issues #24

bovagner · 2024-10-17T22:08:03Z

Thank you for adding functionality for calling seqlets, this is really useful. I encountered a few issues when experimenting with the new functions.

With tfmodisco_seqlets:

When setting flank=0 no seqlets are returned. I think it's because of this line, which sets all entries equal to negative infinity: X_sum[:, -flank:] = -numpy.inf
The device argument lacks an explanation. I got an error because some tensors were on cpu and some on cuda with the default device='cuda'. Also, it's not possible to put the input tensor on cuda, so it seems the function only works with device='cpu'.

With recursive_seqlets:
It seems that returned seqlets are skewed to the right, i.e. often extending beyond the core 'motif' on the right side, but not on the left side. This is most apparent when using a rather large value for min_seqlet_len, such as 10. It could be related to this line:
end = min(end + min_seqlet_len + additional_flanks - 1, l)
where min_seqlet_len is added to end.

With plot_logo:
plot_logo fails with seqlets if there are more annotations that n_tracks and show_extra is true,
since motif = row.values[0] is a float and so has no length.

The text was updated successfully, but these errors were encountered:

jmschrei · 2024-10-18T18:16:20Z

fixed, thanks!
I've removed device. At one point, I wanted to make tfmodisco seqlet calling work on a GPU but the usage of sklearn's isotonic regression function made that not feasible.

For recursive_seqlets can you show a screenshot of an example? I wouldn't expect the high-attribution characters to fall in the middle, necessarily. Rather, it will be whichever span gives the best score. If that's what you want, I'd use additional_flanks, which will add more nucleotides on either side regardless of their effect on the p-value.

plot_logo issue is fixed.

Will be available next release, hopefully next week.

bovagner · 2024-10-21T11:59:42Z

I've tried making a reproducible example.

import torch
import tangermeme.seqlet
import tangermeme.plot
import matplotlib.pyplot as plt
g_cpu = torch.Generator()
g_cpu.manual_seed(2147483647)

Generate 50 random sequences of length 100 with random attributions in the interval [-1,1)

X = torch.rand((50,100), generator = g_cpu)*2-1

Call recursive seqlets:

seqlets = tangermeme.seqlet.recursive_seqlets(X, threshold = 0.01)
seqlets

	example_idx	start	end	attribution	p-value
0	23	17	21	3.300204	0.000402
1	9	89	93	3.036839	0.004016

The start and end are supposed to be bed-formatted, meaning end is not included, and indeed this agrees with X[9, 89:93].sum() > tensor(3.0368).

I would expect the results to be invariant to reversing the attributions along the length of the sequence:

X_rev = torch.flip(X, dims = (1,))
rseqlets = tangermeme.seqlet.recursive_seqlets(X_rev, threshold = 0.01)
rseqlets['rev_start'] = 100 - rseqlets['end']
rseqlets['rev_end'] = 100 - rseqlets['start']
rseqlets

	example_idx	start	end	attribution	p-value	rev_start	rev_end
0	9	6	10	3.046904	0.003210	90	94
1	15	1	5	-2.931187	0.008232	95	99

However, one seqlet is found in a different sequence, and the one in sequence 9 is shifted by 1.

X_rev[9, 6:10].sum(), X_rev[9, 7:11].sum() > (tensor(3.0469), tensor(3.0368))

This hints at some left-right asymmetri.
Let's try to insert a symmetrical 'motif' in sequence 0.

motif = torch.arange(0.1,2.1,0.2)
X[0, 10:20] = motif
X[0, 20:30] = torch.flip(motif, dims = (0,))

seqlets = tangermeme.seqlet.recursive_seqlets(X, threshold = 0.01)
seqlets

	example_idx	start	end	attribution	p-value
0	0	16	27	15.5	0.001315

The peak of the 'motif' is at positions 19 and 20. We thus have 3 positions to the left of the peak and 6 positions to the right of the peak.

idx = 0
plt.figure(figsize=(12, 2))
ax = plt.subplot(111)
t = torch.stack([X[idx], torch.zeros_like(X[idx]), torch.zeros_like(X[idx]), torch.zeros_like(X[idx])])
tangermeme.plot.plot_logo(t, ax=ax, start=0, end=100, annotations=seqlets[seqlets['example_idx'] == idx], score_key='attribution', show_extra=False)
plt.ylim(-1, 2)

plt.xlabel("Genomic Coordinate")
plt.ylabel("Attribution")
plt.title("DeepLIFT/SHAP")
plt.show()

When we plot it, we even see the seqlet extending 4 positions further to the right than to the left. I think the extra position we observe when plotting is due to plot_logo including the end position when plotting, which is misleading. Still, the algorithm appears to have a preference for extending the motif to the right.

Another artefact is the fact that increasing min_seqlet_len can produce more seqlets, which is a bit counterintuitive. This probably derives from the fact that only subseqlets down to min_seqlet_len needs to have a p-value below the threshold.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seqlet calling issues #24

Seqlet calling issues #24

bovagner commented Oct 17, 2024

jmschrei commented Oct 18, 2024

bovagner commented Oct 21, 2024 •

edited

Loading

Seqlet calling issues #24

Seqlet calling issues #24

Comments

bovagner commented Oct 17, 2024

jmschrei commented Oct 18, 2024

bovagner commented Oct 21, 2024 • edited Loading

bovagner commented Oct 21, 2024 •

edited

Loading