Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seqlet calling issues #24

Open
bovagner opened this issue Oct 17, 2024 · 2 comments
Open

Seqlet calling issues #24

bovagner opened this issue Oct 17, 2024 · 2 comments

Comments

@bovagner
Copy link
Contributor

Thank you for adding functionality for calling seqlets, this is really useful. I encountered a few issues when experimenting with the new functions.

With tfmodisco_seqlets:

  1. When setting flank=0 no seqlets are returned. I think it's because of this line, which sets all entries equal to negative infinity: X_sum[:, -flank:] = -numpy.inf

  2. The device argument lacks an explanation. I got an error because some tensors were on cpu and some on cuda with the default device='cuda'. Also, it's not possible to put the input tensor on cuda, so it seems the function only works with device='cpu'.

With recursive_seqlets:
It seems that returned seqlets are skewed to the right, i.e. often extending beyond the core 'motif' on the right side, but not on the left side. This is most apparent when using a rather large value for min_seqlet_len, such as 10. It could be related to this line:
end = min(end + min_seqlet_len + additional_flanks - 1, l)
where min_seqlet_len is added to end.

With plot_logo:
plot_logo fails with seqlets if there are more annotations that n_tracks and show_extra is true,
since motif = row.values[0] is a float and so has no length.

@jmschrei
Copy link
Owner

  1. fixed, thanks!
  2. I've removed device. At one point, I wanted to make tfmodisco seqlet calling work on a GPU but the usage of sklearn's isotonic regression function made that not feasible.

For recursive_seqlets can you show a screenshot of an example? I wouldn't expect the high-attribution characters to fall in the middle, necessarily. Rather, it will be whichever span gives the best score. If that's what you want, I'd use additional_flanks, which will add more nucleotides on either side regardless of their effect on the p-value.

plot_logo issue is fixed.

Will be available next release, hopefully next week.

@bovagner
Copy link
Contributor Author

bovagner commented Oct 21, 2024

I've tried making a reproducible example.

import torch
import tangermeme.seqlet
import tangermeme.plot
import matplotlib.pyplot as plt
g_cpu = torch.Generator()
g_cpu.manual_seed(2147483647)

Generate 50 random sequences of length 100 with random attributions in the interval [-1,1)

X = torch.rand((50,100), generator = g_cpu)*2-1

Call recursive seqlets:

seqlets = tangermeme.seqlet.recursive_seqlets(X, threshold = 0.01)
seqlets
	example_idx	start	end	attribution	p-value
0	23	17	21	3.300204	0.000402
1	9	89	93	3.036839	0.004016

The start and end are supposed to be bed-formatted, meaning end is not included, and indeed this agrees with X[9, 89:93].sum() > tensor(3.0368).

I would expect the results to be invariant to reversing the attributions along the length of the sequence:

X_rev = torch.flip(X, dims = (1,))
rseqlets = tangermeme.seqlet.recursive_seqlets(X_rev, threshold = 0.01)
rseqlets['rev_start'] = 100 - rseqlets['end']
rseqlets['rev_end'] = 100 - rseqlets['start']
rseqlets
	example_idx	start	end	attribution	p-value	rev_start	rev_end
0	9	6	10	3.046904	0.003210	90	94
1	15	1	5	-2.931187	0.008232	95	99

However, one seqlet is found in a different sequence, and the one in sequence 9 is shifted by 1.

X_rev[9, 6:10].sum(), X_rev[9, 7:11].sum() > (tensor(3.0469), tensor(3.0368))

This hints at some left-right asymmetri.
Let's try to insert a symmetrical 'motif' in sequence 0.

motif = torch.arange(0.1,2.1,0.2)
X[0, 10:20] = motif
X[0, 20:30] = torch.flip(motif, dims = (0,))
seqlets = tangermeme.seqlet.recursive_seqlets(X, threshold = 0.01)
seqlets
	example_idx	start	end	attribution	p-value
0	0	16	27	15.5	0.001315

The peak of the 'motif' is at positions 19 and 20. We thus have 3 positions to the left of the peak and 6 positions to the right of the peak.

idx = 0
plt.figure(figsize=(12, 2))
ax = plt.subplot(111)
t = torch.stack([X[idx], torch.zeros_like(X[idx]), torch.zeros_like(X[idx]), torch.zeros_like(X[idx])])
tangermeme.plot.plot_logo(t, ax=ax, start=0, end=100, annotations=seqlets[seqlets['example_idx'] == idx], score_key='attribution', show_extra=False)
plt.ylim(-1, 2)

plt.xlabel("Genomic Coordinate")
plt.ylabel("Attribution")
plt.title("DeepLIFT/SHAP")
plt.show()

attributions_logo

When we plot it, we even see the seqlet extending 4 positions further to the right than to the left. I think the extra position we observe when plotting is due to plot_logo including the end position when plotting, which is misleading. Still, the algorithm appears to have a preference for extending the motif to the right.

Another artefact is the fact that increasing min_seqlet_len can produce more seqlets, which is a bit counterintuitive. This probably derives from the fact that only subseqlets down to min_seqlet_len needs to have a p-value below the threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants