Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Issue converting raven.txt file to simple-seq #261

Closed
sfcooke96 opened this issue Jan 22, 2024 · 3 comments
Closed

BUG: Issue converting raven.txt file to simple-seq #261

sfcooke96 opened this issue Jan 22, 2024 · 3 comments

Comments

@sfcooke96
Copy link

sfcooke96 commented Jan 22, 2024

Hi there @NickleDave,

I'm running the following on a MAC with crowsetta V 5.0.1

I tried using the following script (suggested here: yardencsGitHub/tweetynet#223) to convert my raven.txt files to simple-seq for use with vak and tweetynet.

import crowsetta
import numpy as np

example = crowsetta.data.get('raven')
raven = crowsetta.formats.bbox.Raven.from_file(example.annot_path, annot_col='Species')
annot = raven.to_annot()
onsets_s = []
offsets_s = []
labels = []
for bbox in annot.bboxes:
    onsets_s.append(bbox.onset)
    offsets_s.append(bbox.offset)
    labels.append(bbox.label)
onsets_s = np.array(onsets_s)
offsets_s = np.array(offsets_s)
labels = np.array(labels)
simpleseq = crowsetta.formats.seq.SimpleSeq(
    onsets_s=onsets_s,
    offsets_s=offsets_s, 
    labels=labels,
    annot_path='/dummy/path'
)
simpleseq.to_csv('example-data.csv')

After running this I got:

AttributeError: 'SimpleSeq' object has no attribute 'to_csv'

I adjusted the script slightly (raven = .... , simplest.to_file...) to the following:

import crowsetta
import numpy as np


example = crowsetta.data.get('raven')
raven = crowsetta.formats.bbox.raven.Raven.from_file(example.annot_path, annot_col='Species)
annot = raven.to_annot()
onsets_s = []
offsets_s = []
labels = []
for bbox in annot.bboxes:
    onsets_s.append(bbox.onset)
    offsets_s.append(bbox.offset)
    labels.append(bbox.label)
onsets_s = np.array(onsets_s)
offsets_s = np.array(offsets_s)
labels = np.array(labels)
simpleseq = crowsetta.formats.seq.SimpleSeq(
    onsets_s=onsets_s,
    offsets_s=offsets_s, 
    labels=labels,
    annot_path='/Users/training_data'
)

simpleseq.to_file("data.csv")

I have 10 .txt files in my directory (> 15 rows per file) to be written into simple-seq format but the resulting output is the following (this is complete):

onset_s,offset_s,label
154.387792767,154.911598217,EATO
167.526598245,168.17302044,EATO
183.609636834,184.097751553,EATO
250.527480604,251.160710509,EATO
277.88724277,278.480895806,EATO
295.52970757,296.110168316,EATO

I tried adjusting the above code

raven = crowsetta.formats.bbox.**raven**.Raven.from_file(example.annot_path, annot_col='Species)

By changing annot_col to 'Annotation' - the header for the annotation col in my .txt files. - and received the following output:

(tweetynet) Stephens-MacBook-Pro:HAV_TN_Training stephencooke$ python test.py 
Traceback (most recent call last):
  File "/Users/stephencooke/Library/CloudStorage/OneDrive-UniversityofArizona/Tweetynet/HAV_TN_Training/test.py", line 6, in <module>
    raven = crowsetta.formats.bbox.raven.Raven.from_file(example.annot_path, annot_col='Annotations')
  File "/Users/stephencooke/miniconda3/envs/tweetynet/lib/python3.9/site-packages/crowsetta/formats/bbox/raven.py", line 107, in from_file
    df = RavenSchema.validate(df)
  File "/Users/stephencooke/miniconda3/envs/tweetynet/lib/python3.9/site-packages/pandera/api/pandas/model.py", line 306, in validate
    cls.to_schema().validate(
  File "/Users/stephencooke/miniconda3/envs/tweetynet/lib/python3.9/site-packages/pandera/api/pandas/container.py", line 375, in validate
    return self._validate(
  File "/Users/stephencooke/miniconda3/envs/tweetynet/lib/python3.9/site-packages/pandera/api/pandas/container.py", line 404, in _validate
    return self.get_backend(check_obj).validate(
  File "/Users/stephencooke/miniconda3/envs/tweetynet/lib/python3.9/site-packages/pandera/backends/pandas/container.py", line 97, in validate
    error_handler = self.run_checks_and_handle_errors(
  File "/Users/stephencooke/miniconda3/envs/tweetynet/lib/python3.9/site-packages/pandera/backends/pandas/container.py", line 172, in run_checks_and_handle_errors
    error_handler.collect_error(
  File "/Users/stephencooke/miniconda3/envs/tweetynet/lib/python3.9/site-packages/pandera/error_handlers.py", line 38, in collect_error
    raise schema_error from original_exc
pandera.errors.SchemaError: column 'annotation' not in dataframe
   Selection           View  Channel  begin_time_s  end_time_s  low_freq_hz  high_freq_hz Species
0          1  Spectrogram 1        1    154.387793  154.911598       2878.2        4049.0    EATO
1          2  Spectrogram 1        1    167.526598  168.173020       2731.9        3902.7    EATO
2          3  Spectrogram 1        1    183.609637  184.097752       2878.2        3975.8    EATO
3          4  Spectrogram 1        1    250.527481  251.160711       2756.2        3951.4    EATO
4          5  Spectrogram 1        1    277.887243  278.480896       2707.5        3975.8    EATO

I've attached example data here, the python script, and output file. troubleshooting.zip

Another question while we're here: will training the model on simple-seq annotations restrict the predicted annotations to onset - offset borders without including high and low frequency bounds? I'm interested because I was hoping to estimate frequency ranges with the output data. Apologies if I'm misunderstanding how prediction output will be formatted.

Thanks for your help!

@NickleDave
Copy link
Collaborator

NickleDave commented Jan 22, 2024

Hi @sfcooke96!

Thank you for providing a detailed bug report and the zip with a couple samples to test with. 🙏

I think I might have confused you with my snippet on the other issue.

When you use your data, you'll want to specify the path to those files as the first argument to crowsetta.formats.bbox.Raven.from_file, like so:

crowsetta.formats.bbox.Raven.from_file(
    'troubleshooting/data1.txt'
)

I was able to do this and load the file without issue.
You don't need to specify the annot_col since it has the default name for Raven (the example data we have is from a dataset that uses a different name for their annotations column). Seems like we handle extra columns gracefully (I guess I programmed the class better than I thought 😏 ).

You'll also need to loop over all your files and save each of them with a separate name, so you don't overwrite the previous one you saved.
Please try this short script and see if you get separate files, each with the appropriate number of rows.

import pathlib

import crowsetta
import numpy as np

# this is where we get our files from
src_dir = pathlib.Path('./troubleshooting')
# next line: sorted because 
# https://www.vice.com/en/article/zmjwda/a-code-glitch-may-have-caused-errors-in-more-than-100-published-studies
src_txt_files = sorted(src_dir.glob('*.txt'))

# this is where we save the files (so we don't overwrite the originals)
dst_dir = pathlib.Path('./annots-simple-seq')
dst_dir.mkdir(exist_ok=True)

# to save ourselves from a typo
assert dst_dir != src_dir

for txt_file in src_txt_files:
    print(
        f"Converting Raven file to simple-seq format: {txt_file}"
    )
    annot = crowsetta.formats.bbox.Raven.from_file(
        txt_file
    ).to_annot()

    onsets_s = []
    offsets_s = []
    labels = []
    for bbox in annot.bboxes:
        onsets_s.append(bbox.onset)
        offsets_s.append(bbox.offset)
        labels.append(bbox.label)
    onsets_s = np.array(onsets_s)
    offsets_s = np.array(offsets_s)
    labels = np.array(labels)
    simpleseq = crowsetta.formats.seq.SimpleSeq(
        onsets_s=onsets_s,
        offsets_s=offsets_s, 
        labels=labels,
        annot_path='/dummy/path/doesnt/matter/here'
    )
    dst_txt_file = dst_dir / txt_file.name
    print(
        f"Saving converted simple-seq file: {dst_txt_file}"
    )
    simpleseq.to_file(dst_txt_file)

Just let me know if you have any questions about what this is doing!
Happy to share the ~five things I've managed to learn about Python and just keep recycling 😜

Another question while we're here

Re: the TweetyNet model, please see my reply on the issue on the TweetyNet repo: yardencsGitHub/tweetynet#223 (comment)

@sfcooke96
Copy link
Author

@NickleDave, thank you - this solution seems to have worked! On to prepping, training, and predicting.

Thanks a lot for your active support here! 🙏

@NickleDave
Copy link
Collaborator

Of course, glad to hear it's working @sfcooke96!
I will go ahead and close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants