CLN: rename/refactor `recursive_stem` and add clearer error message #525

NickleDave · 2022-06-23T00:18:06Z

Moving from #524 to a separate issue here since this is not really a documentation thing, but it's a fix I've been meaning to make anyway

See #524 and #523 for context

In general the recursive_stem function is cryptically named. Rename find_notated_ext or something like that?
In recursive_stem refer to naming convention in error message. Ideally link directly to the docs
give @sthaar ideas credit

The text was updated successfully, but these errors were encountered:

NickleDave · 2022-08-16T02:42:48Z

Looking at this more.
Part of what's confusing is that this function is used in two ways:

to get a stem (filename without extension) from an audio file
also to get a stem from spectrogram or annotation files, that by convention should have the audio file extension in them (e.g. bird1-song1.wav.npz and bird1-song1.wav.csv)

What's confusing is when the latter case causes an error because the user doesn't know about that convention. We need to specifically capture that case and provide an informative error with a link to the docs explaining the convention.

NickleDave · 2022-08-16T13:16:06Z

Here is the whole traceback for an instance of when this happens (that I got by naming files without following the convention, to raise the error intentionally)

Key point is it's happening when vak.io.audio calls source_annot_map

$ vak prep make-prep-raise-stem-error.toml 
/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  /opt/conda/conda-bld/pytorch_1607370172916/work/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
determined that purpose of config file is: train
will add 'csv_path' option to 'TRAIN' section
purpose for dataset: train
will split dataset
making array files containing spectrograms from audio files in: /home/pimienta/Documents/data/vocal/tweety-tech-support/arnav/2022-07-24/TweetyNet_Arnav/train-csv-without-wav-in-filename
Traceback (most recent call last):
  File "/home/pimienta/miniconda3/envs/tweety-article-env/bin/vak", line 10, in <module>
    sys.exit(main())
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/__main__.py", line 45, in main
    cli.cli(command=args.command, config_file=args.configfile)
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/cli/cli.py", line 30, in cli
    COMMAND_FUNCTION_MAP[command](toml_path=config_file)
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/cli/prep.py", line 132, in prep
    vak_df, csv_path = core.prep(
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/core/prep.py", line 205, in prep
    vak_df = dataframe.from_files(
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/io/dataframe.py", line 133, in from_files
    spect_files = audio.to_spect(
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/io/audio.py", line 174, in to_spect
    audio_annot_map = source_annot_map(audio_files, annot_list)
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/annotation.py", line 207, in source_annot_map
    keys = [recursive_stem(annot.audio_path) for annot in annot_list]
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/annotation.py", line 207, in <listcomp>
    keys = [recursive_stem(annot.audio_path) for annot in annot_list]
  File "/home/pimienta/miniconda3/envs/tweety-article-env/lib/python3.8/site-packages/vak/annotation.py", line 172, in recursive_stem
    raise ValueError(f"unable to compute stem of {path}")
ValueError: unable to compute stem of /home/pimienta/Documents/data/vocal/tweety-tech-support/arnav/2022-07-24/TweetyNet_Arnav/train-csv-without-wav-in-filename/Sapp165_20220509-01_2224.0435

NickleDave · 2022-08-17T03:12:49Z

Ok so this is more nuanced...
what happens is that inside vak.io.dataframe we get the annotations from crowsetta.Transcriber

but the 'simple' format assumes that the name of the annotated file is "name of annotation file, minus the .csv extension"
https://github.com/vocalpy/crowsetta/blob/d3c4f4c476481661e7042e802d3602e2acdd3449/src/crowsetta/simple.py#L97

This means if you name the annotation file without following the "includes audio filename" convention, you will end up with an audio_path that doesn't include an audio format extension. E.g., a file in 'simple-seq' format with the name bird1-day3-142512.txt will have audio_path=bird1-day3-142512, which causes recursive_stem to fail.

So it's not crashing because we try to find a stem from the annotation filenames themselves, it's crashing because we try to find a stem from the audio_path attribute of the annotation objects, which has been created by crowsetta

This matters because I was expecting to wrap some code that tries to get stems from the annotation files themselves in a try/except block.

I can't just say "this audio file could not be stemmed" since a user won't know where the "audio file" is coming from (it's a filename made up by crowsetta).

Simplest thing to do is catch any ValueError inside source_annot_map when we call audio_stem_from_path inside list comprehension and at that point raise the informative message saying "make sure you follow the naming convention". Would be nice to include filenames that trigger exception though

NickleDave · 2022-08-17T12:58:04Z

Also: rename source_annot_map to map_annotated_to_annotation

more verbose yes
but matches naming now uses by crowsetta (annotation / annotated) instead of the enigmatic "source"
and having it be a verb helps readability and prevents clashes when we name a variable that ends in map, i.e., prefer audio_annot_map = map_annotated_to_annotation(audio_files, annot_files) over source_annot_map = annotation.source_annot_map(audio_files, annot_files)

Adds a page on file naming conventions to the reference section of the docs, that we can link to from other pages (as in #524) and from error messages (see #525). Also adds a page with Frequently Asked Questions, with information on how to annotate data for use with `vak` and links to other how-tos. Fixes #424. Squashed commits: - Add target to howto_user_annot.md Add target to header "spectrogram file naming conventions" so we can reference it from a separate page on file naming conventions. - Add doc/reference/filenames.md that explicitly documents naming conventions, so that we can reference this elsewhere, as described in #524. - Add doc/faq.md, fixes #424 - Add "getting help" to doc/index.md with link to just-added FAQ, as well as links to forum and issue tracker. - Update 'support/contributing' section of README to match the "Getting Help" section on the index of the docs, with links to FAQ, forum, issue tracker, and additionally the contributing docs. - Add faq to toctree in doc/index.md - Add link to naming conventions in howto_user_annot callout

- Rename `vak.annotation.recursive_stem` -> `audio_stem_from_path` Trying to convey that we are specifically looking for an audio filename, and then returning only its stem, while keeping function name concise. - Rename `annotation.source_annot_map` -> `map_annotated_to_annot`

- Add custom exception `AudioFilenameNotFound` - Have `audio_stem_from_path` raise this exception when it fails to find a valid audio format extension and then runs out of extensions to remove. The error clearly states that an audio filename was not found in the path, instead of short cryptic "unable to stem". - Have `map_annoted_to_annot` look for this custom exception in a try-except block when it first creates the keys for the map. If we catch an AudioFilenameNotFound exception, then we raise a verbose ValueError explaining what happened, including both `audio_path` and the path to the annotation itself, and a link to the file naming convention page added in #564 and to the how-to-user-annotation-format page.

NickleDave added the CLN: clean / refactor code cleanup, e.g. refactoring, maintenance, typos label Jun 23, 2022

NickleDave mentioned this issue Jun 23, 2022

DOC: Better document naming convention #524

Closed

3 tasks

NickleDave mentioned this issue Aug 17, 2022

ENH: allow and check for "annotated file = annotation file - annot ext + annotated format ext" #563

Closed

4 tasks

This was referenced Aug 18, 2022

Rename and refactor recursive_stem, add clearer error messages in vak/annotation.py #566

Merged

DEP: use crowsetta 5.0.0 (post pyOpenSci review) #526

Closed

NickleDave closed this as completed in #566 Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN: rename/refactor `recursive_stem` and add clearer error message #525

CLN: rename/refactor `recursive_stem` and add clearer error message #525

NickleDave commented Jun 23, 2022

NickleDave commented Aug 16, 2022

NickleDave commented Aug 16, 2022

NickleDave commented Aug 17, 2022 •

edited

Loading

NickleDave commented Aug 17, 2022

CLN: rename/refactor recursive_stem and add clearer error message #525

CLN: rename/refactor recursive_stem and add clearer error message #525

Comments

NickleDave commented Jun 23, 2022

NickleDave commented Aug 16, 2022

NickleDave commented Aug 16, 2022

NickleDave commented Aug 17, 2022 • edited Loading

NickleDave commented Aug 17, 2022

CLN: rename/refactor `recursive_stem` and add clearer error message #525

CLN: rename/refactor `recursive_stem` and add clearer error message #525

NickleDave commented Aug 17, 2022 •

edited

Loading