Add more tests for filtering mutations, more specific var-type options, option for controlling number of parallel threads (--jobs) #14

matthuska · 2022-06-07T09:38:39Z

Kind of a potpourri of a PR (sorry), containing extra testing of the feature filtering (skipping indels, trimming mutations at the ends of the genome) and solutions to issues with that which came up.

Also took this as an opportunity to add the ability to set the number of parallel threads that are used by the distance calculation (--jobs), which formerly had to be set using the OMP_NUM_THREADS environment variable. Lastly added a few extra checks to ensure that the program aborts in certain cases, like when there are duplicate sequence IDs in the input table.

…bs, indel matching.

…es for all. Need to test everything that's not 'covsonar_dna'

…utation parsers

matthuska · 2022-06-07T09:50:51Z

Closes #13

denisbeslic · 2022-06-08T15:52:19Z

breakfast --input-file output/nextclade.tsv --var-type nextclade_dna --id-col "seqName" --clust-col "substitutions" --sep2 ","
throws the following error

Clustering sequences
  Input file = output/nextclade.tsv
  Input file separator = '	'
  ID column = seqName
  clustering feature type = nextclade_dna
  clustering feature column = substitutions
  clustering feature column separator = ','
  max dist = 1
  minimum cluster size = 2
  trim start (bp) = 264
  trim end (bp) = 228
  reference length (bp) = 29903
  skip deletions = True
  skip insertions = True
  Input cache file = None
  Output cache file = None
Number of sequences: 10000
nextclade_dna
Skipping invalid feature: 'G210T'
Skipping invalid feature: 'C241T'
Skipping invalid feature: 'C1441T'
Skipping invalid feature: 'C3037T'
Skipping invalid feature: 'G4181T'
Skipping invalid feature: 'C6402T'
Skipping invalid feature: 'C7124T'
Skipping invalid feature: 'C8986T'
Skipping invalid feature: 'G9053T'
...
Traceback (most recent call last):
  File "X/.conda/envs/breakfast-dev/bin/breakfast", line 5, in <module>
    main()
  File "X/.conda/envs/breakfast-dev/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "X/.conda/envs/breakfast-dev/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "X/.conda/envs/breakfast-dev/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "X/.conda/envs/breakfast-dev/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "X/breakfast-dev/breakfast/src/breakfast/console.py", line 83, in main
    meta["feature"] = breakfast.filter_features(
  File "X/breakfast-dev/breakfast/src/breakfast/breakfast.py", line 160, in filter_features
    pos = int(submatch.group(1))
IndexError: no such group

…ed after

Var-type is printed already at the start

* Switch to using pathlib Paths, and create output cache dir if missing * Switch to using pathlib Paths, and create output cache dir if missing

Features are now joined with user-defined sep instead of space at the end of filter_features

matthuska added 4 commits June 2, 2022 14:39

Test and fail on missing columns, duplicate ids. Use regex for dna su…

fa50ac0

…bs, indel matching.

Allow the user to set the number of threads (--jobs) OpenMP uses

4dfc0f3

More specific var-type (for covsonar vs nextclade differences). Regex…

4485936

…es for all. Need to test everything that's not 'covsonar_dna'

Document --jobs parameter and new --var-type's for program-specific m…

0ceeaad

…utation parsers

matthuska requested a review from denisbeslic June 7, 2022 09:38

matthuska linked an issue Jun 7, 2022 that may be closed by this pull request

Check for duplicates of sequence IDs #13

Closed

matthuska and others added 7 commits June 9, 2022 12:41

Fix typo (nextstrain -> nextclade) in filter function

d1b9bd6

Actually add the new mutation filtering tests that this branch is nam…

4e7cd3d

…ed after

Add tests for nextclade

a3c5bac

Remove print statement

26a6e82

Var-type is printed already at the start

Create output cache dir if it doesn't exist. Depends on PR #14. (#15)

2fdfdbc

* Switch to using pathlib Paths, and create output cache dir if missing * Switch to using pathlib Paths, and create output cache dir if missing

Add additional tests for nextclade and vartype case

f7238f2

Fix issue with user-defined feature sep

07e64b5

Features are now joined with user-defined sep instead of space at the end of filter_features

denisbeslic merged commit 6ba721e into develop Jun 10, 2022

matthuska deleted the feature/moretests branch June 10, 2022 10:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more tests for filtering mutations, more specific var-type options, option for controlling number of parallel threads (--jobs) #14

Add more tests for filtering mutations, more specific var-type options, option for controlling number of parallel threads (--jobs) #14

matthuska commented Jun 7, 2022

matthuska commented Jun 7, 2022

denisbeslic commented Jun 8, 2022

Add more tests for filtering mutations, more specific var-type options, option for controlling number of parallel threads (--jobs) #14

Add more tests for filtering mutations, more specific var-type options, option for controlling number of parallel threads (--jobs) #14

Conversation

matthuska commented Jun 7, 2022

matthuska commented Jun 7, 2022

denisbeslic commented Jun 8, 2022