Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when chromosome names don't start with chr #20

Open
Avsecz opened this issue Nov 11, 2018 · 1 comment
Open

Error when chromosome names don't start with chr #20

Avsecz opened this issue Nov 11, 2018 · 1 comment

Comments

@Avsecz
Copy link
Contributor

Avsecz commented Nov 11, 2018

  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi_veff/snv_predict.py", line 795, in score_variants
    return_predictions=return_predictions)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi_veff/snv_predict.py", line 620, in predict_snvs
    for i, batch in enumerate(tqdm(it)):
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/tqdm/_tqdm.py", line 979, in __iter__
    for obj in iterable:
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 175, in __next__
    return self._process_next_batch(batch)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 195, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
pyfaidx.FetchError: Traceback (most recent call last):
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 639, in from_file
    i = self.index[rname]
KeyError: 'chr1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 58, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 58, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/dataloaders/sequence.py", line 350, in __getitem__
    ret = self.seq_dl[idx]
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/dataloaders/sequence.py", line 238, in __getitem__
    seq = self.fasta_extractors.extract(interval)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/extractors.py", line 50, in extract
    seq = str(self.fasta.get_seq(interval.chrom, interval.start + 1, interval.stop, rc=rc).seq)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 1032, in get_seq
    seq = self.faidx.fetch(name, start, end)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 624, in fetch
    seq = self.from_file(name, start, end)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 642, in from_file
    "Please check your FASTA file.".format(rname))
pyfaidx.FetchError: Requested rname chr1 does not exist! Please check your FASTA file.

minimal.vcf

##fileformat=VCFv4.0
##fileDate=20181110
##source=UKBB/variants.tsv.bgz_V3
##reference=GRCh37
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1    15791   1:15791_C_T     C       T       .       .       .
1    69487   1:69487_G_A     G       A       .       .       .
1    69569   1:69569_T_C     T       C       .       .       .
1    139853  1:139853_C_T    C       T       .       .       .
1    693731  1:693731_A_G    A       G       .       .       .

Fasta file contained the correct chromosome names. Eg. >1...

@krrome
Copy link
Contributor

krrome commented Nov 11, 2018

Ok we need a way to deal with that. I think is either the job of the dataloader or we catch the keyerror in kipoi_veff.
Problem is:

  • vcf files tend to always have chromosome names without leading "chr" indicating that the position in them is 1-based
  • fasta does not have any restrictions on the chromosome naming

An argument towards handling it within the dataloader:

  • Bed files have to have a "chr" prefix for genomic coordinates (because they are UCSC-standard / 0-based). Therefore your fasta file would raise the exact same error with any bed file.

An argument to not handle it automatically:

  • If the fasta file contains entries with names ">1" and ">chr1" that are not identical. I think we can ignore this case is intuitively this wouldn't make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants