Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: rename/refactor recursive_stem and add clearer error message #525

Closed
3 tasks
NickleDave opened this issue Jun 23, 2022 · 4 comments · Fixed by #566
Closed
3 tasks

CLN: rename/refactor recursive_stem and add clearer error message #525

NickleDave opened this issue Jun 23, 2022 · 4 comments · Fixed by #566
Labels
CLN: clean / refactor code cleanup, e.g. refactoring, maintenance, typos

Comments

@NickleDave
Copy link
Collaborator

Moving from #524 to a separate issue here since this is not really a documentation thing, but it's a fix I've been meaning to make anyway

See #524 and #523 for context

  • In general the recursive_stem function is cryptically named. Rename find_notated_ext or something like that?
  • In recursive_stem refer to naming convention in error message. Ideally link directly to the docs
  • give @sthaar ideas credit
@NickleDave NickleDave added the CLN: clean / refactor code cleanup, e.g. refactoring, maintenance, typos label Jun 23, 2022
@NickleDave
Copy link
Collaborator Author

Looking at this more.
Part of what's confusing is that this function is used in two ways:

  • to get a stem (filename without extension) from an audio file
  • also to get a stem from spectrogram or annotation files, that by convention should have the audio file extension in them (e.g. bird1-song1.wav.npz and bird1-song1.wav.csv)

What's confusing is when the latter case causes an error because the user doesn't know about that convention. We need to specifically capture that case and provide an informative error with a link to the docs explaining the convention.

@NickleDave
Copy link
Collaborator Author

Here is the whole traceback for an instance of when this happens (that I got by naming files without following the convention, to raise the error intentionally)

Key point is it's happening when vak.io.audio calls source_annot_map

$ vak prep make-prep-raise-stem-error.toml 
/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  /opt/conda/conda-bld/pytorch_1607370172916/work/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
determined that purpose of config file is: train
will add 'csv_path' option to 'TRAIN' section
purpose for dataset: train
will split dataset
making array files containing spectrograms from audio files in: /home/pimienta/Documents/data/vocal/tweety-tech-support/arnav/2022-07-24/TweetyNet_Arnav/train-csv-without-wav-in-filename
Traceback (most recent call last):
  File "/home/pimienta/miniconda3/envs/tweety-article-env/bin/vak", line 10, in <module>
    sys.exit(main())
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/__main__.py", line 45, in main
    cli.cli(command=args.command, config_file=args.configfile)
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/cli/cli.py", line 30, in cli
    COMMAND_FUNCTION_MAP[command](toml_path=config_file)
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/cli/prep.py", line 132, in prep
    vak_df, csv_path = core.prep(
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/core/prep.py", line 205, in prep
    vak_df = dataframe.from_files(
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/io/dataframe.py", line 133, in from_files
    spect_files = audio.to_spect(
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/io/audio.py", line 174, in to_spect
    audio_annot_map = source_annot_map(audio_files, annot_list)
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/annotation.py", line 207, in source_annot_map
    keys = [recursive_stem(annot.audio_path) for annot in annot_list]
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/annotation.py", line 207, in <listcomp>
    keys = [recursive_stem(annot.audio_path) for annot in annot_list]
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/annotation.py", line 172, in recursive_stem
    raise ValueError(f"unable to compute stem of {path}")
ValueError: unable to compute stem of /home/pimienta/Documents/data/vocal/tweety-tech-support/arnav/2022-07-24/TweetyNet_Arnav/train-csv-without-wav-in-filename/Sapp165_20220509-01_2224.0435

@NickleDave
Copy link
Collaborator Author

NickleDave commented Aug 17, 2022

Ok so this is more nuanced...
what happens is that inside vak.io.dataframe we get the annotations from crowsetta.Transcriber

but the 'simple' format assumes that the name of the annotated file is "name of annotation file, minus the .csv extension"
https://github.com/vocalpy/crowsetta/blob/d3c4f4c476481661e7042e802d3602e2acdd3449/src/crowsetta/simple.py#L97

This means if you name the annotation file without following the "includes audio filename" convention, you will end up with an audio_path that doesn't include an audio format extension. E.g., a file in 'simple-seq' format with the name bird1-day3-142512.txt will have audio_path=bird1-day3-142512, which causes recursive_stem to fail.

So it's not crashing because we try to find a stem from the annotation filenames themselves, it's crashing because we try to find a stem from the audio_path attribute of the annotation objects, which has been created by crowsetta

This matters because I was expecting to wrap some code that tries to get stems from the annotation files themselves in a try/except block.

I can't just say "this audio file could not be stemmed" since a user won't know where the "audio file" is coming from (it's a filename made up by crowsetta).

Simplest thing to do is catch any ValueError inside source_annot_map when we call audio_stem_from_path inside list comprehension and at that point raise the informative message saying "make sure you follow the naming convention". Would be nice to include filenames that trigger exception though

@NickleDave
Copy link
Collaborator Author

Also: rename source_annot_map to map_annotated_to_annotation

  • more verbose yes
  • but matches naming now uses by crowsetta (annotation / annotated) instead of the enigmatic "source"
  • and having it be a verb helps readability and prevents clashes when we name a variable that ends in map, i.e., prefer audio_annot_map = map_annotated_to_annotation(audio_files, annot_files) over source_annot_map = annotation.source_annot_map(audio_files, annot_files)

NickleDave added a commit that referenced this issue Aug 17, 2022
Adds a page on file naming conventions
to the reference section of the docs,
that we can link to from other pages
(as in #524)
and from error messages (see #525).

Also adds a page with Frequently Asked
Questions,
with information on how to annotate
data for use with `vak`
and links to other how-tos. Fixes #424.

Squashed commits:
- Add target to howto_user_annot.md

  Add target to header "spectrogram file naming conventions"
  so we can reference it from a separate page on
  file naming conventions.

- Add doc/reference/filenames.md

  that explicitly documents naming conventions,
  so that we can reference this elsewhere,
  as described in #524.

- Add doc/faq.md, fixes #424
- Add "getting help" to doc/index.md

  with link to just-added FAQ,
  as well as links to forum and issue tracker.

- Update 'support/contributing' section of README

  to match the "Getting Help" section
  on the index of the docs,
  with links to FAQ, forum, issue tracker,
  and additionally the contributing docs.

- Add faq to toctree in doc/index.md
- Add link to naming conventions in howto_user_annot callout
NickleDave added a commit that referenced this issue Aug 18, 2022
- Rename `vak.annotation.recursive_stem` -> `audio_stem_from_path`

  Trying to convey that we are specifically looking for
  an audio filename, and then returning
  only its stem, while keeping function name concise.

- Rename `annotation.source_annot_map` -> `map_annotated_to_annot`
NickleDave added a commit that referenced this issue Aug 18, 2022
- Add custom exception `AudioFilenameNotFound`
- Have `audio_stem_from_path` raise this exception
  when it fails to find a valid audio format extension
  and then runs out of extensions to remove.
  The error clearly states that an audio filename
  was not found in the path, instead of short
  cryptic "unable to stem".
- Have `map_annoted_to_annot` look for this custom
  exception in a try-except block when it first
  creates the keys for the map.
  If we catch an AudioFilenameNotFound exception,
  then we raise a verbose ValueError
  explaining what happened, including both `audio_path`
  and the path to the annotation itself,
  and a link to the file naming convention page
  added in #564 and to the how-to-user-annotation-format
  page.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLN: clean / refactor code cleanup, e.g. refactoring, maintenance, typos
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant