Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more tests for filtering mutations, more specific var-type options, option for controlling number of parallel threads (--jobs) #14

Merged
merged 11 commits into from
Jun 10, 2022

Conversation

matthuska
Copy link
Collaborator

Kind of a potpourri of a PR (sorry), containing extra testing of the feature filtering (skipping indels, trimming mutations at the ends of the genome) and solutions to issues with that which came up.

Also took this as an opportunity to add the ability to set the number of parallel threads that are used by the distance calculation (--jobs), which formerly had to be set using the OMP_NUM_THREADS environment variable. Lastly added a few extra checks to ensure that the program aborts in certain cases, like when there are duplicate sequence IDs in the input table.

@matthuska matthuska requested a review from denisbeslic June 7, 2022 09:38
@matthuska
Copy link
Collaborator Author

Closes #13

@matthuska matthuska linked an issue Jun 7, 2022 that may be closed by this pull request
@denisbeslic
Copy link
Contributor

breakfast --input-file output/nextclade.tsv --var-type nextclade_dna --id-col "seqName" --clust-col "substitutions" --sep2 ","
throws the following error

Clustering sequences
  Input file = output/nextclade.tsv
  Input file separator = '	'
  ID column = seqName
  clustering feature type = nextclade_dna
  clustering feature column = substitutions
  clustering feature column separator = ','
  max dist = 1
  minimum cluster size = 2
  trim start (bp) = 264
  trim end (bp) = 228
  reference length (bp) = 29903
  skip deletions = True
  skip insertions = True
  Input cache file = None
  Output cache file = None
Number of sequences: 10000
nextclade_dna
Skipping invalid feature: 'G210T'
Skipping invalid feature: 'C241T'
Skipping invalid feature: 'C1441T'
Skipping invalid feature: 'C3037T'
Skipping invalid feature: 'G4181T'
Skipping invalid feature: 'C6402T'
Skipping invalid feature: 'C7124T'
Skipping invalid feature: 'C8986T'
Skipping invalid feature: 'G9053T'
...
Traceback (most recent call last):
  File "X/.conda/envs/breakfast-dev/bin/breakfast", line 5, in <module>
    main()
  File "X/.conda/envs/breakfast-dev/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "X/.conda/envs/breakfast-dev/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "X/.conda/envs/breakfast-dev/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "X/.conda/envs/breakfast-dev/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "X/breakfast-dev/breakfast/src/breakfast/console.py", line 83, in main
    meta["feature"] = breakfast.filter_features(
  File "X/breakfast-dev/breakfast/src/breakfast/breakfast.py", line 160, in filter_features
    pos = int(submatch.group(1))
IndexError: no such group

matthuska and others added 7 commits June 9, 2022 12:41
Var-type is printed already at the start
* Switch to using pathlib Paths, and create output cache dir if missing

* Switch to using pathlib Paths, and create output cache dir if missing
Features are now joined with user-defined sep instead of space at the end of filter_features
@denisbeslic denisbeslic merged commit 6ba721e into develop Jun 10, 2022
@matthuska matthuska deleted the feature/moretests branch June 10, 2022 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check for duplicates of sequence IDs
2 participants