Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenize_midi_dataset reproducing source file tree, overwrite_mode #82

Merged
merged 7 commits into from
Oct 13, 2023

Conversation

Natooz
Copy link
Owner

@Natooz Natooz commented Oct 11, 2023

Following #79, this PR make tokenize_midi_dataset reproduce the source files file tree in out_dir.
It also allows to handle file overwriting if files already exist in the saving path.

The file tree reproduction has also to be made for data_augmentation_dataset

@codecov
Copy link

codecov bot commented Oct 11, 2023

Codecov Report

Attention: 20 lines in your changes are missing coverage. Please review.

Comparison is base (ed0d500) 90.49% compared to head (599a129) 90.41%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #82      +/-   ##
==========================================
- Coverage   90.49%   90.41%   -0.09%     
==========================================
  Files          31       31              
  Lines        4463     4536      +73     
==========================================
+ Hits         4039     4101      +62     
- Misses        424      435      +11     
Files Coverage Δ
miditok/__init__.py 68.42% <100.00%> (+3.71%) ⬆️
miditok/constants.py 100.00% <100.00%> (ø)
miditok/pytorch_data/datasets.py 100.00% <100.00%> (ø)
miditok/tokenizations/cp_word.py 95.07% <100.00%> (ø)
miditok/tokenizations/mmm.py 95.21% <100.00%> (ø)
miditok/tokenizations/remi.py 95.93% <100.00%> (ø)
miditok/tokenizations/structured.py 97.03% <100.00%> (ø)
miditok/tokenizations/tsd.py 97.99% <100.00%> (ø)
setup.py 0.00% <ø> (ø)
tests/test_bpe.py 90.10% <100.00%> (ø)
... and 7 more

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Natooz Natooz marked this pull request as ready for review October 12, 2023 07:15
@Natooz
Copy link
Owner Author

Natooz commented Oct 12, 2023

@leleogere I'll leave this open a few days in case you want to review / test 😄

if isinstance(midi_paths, str):
midi_paths = Path(midi_paths)
root_dir = midi_paths
midi_paths = sum(list(midi_paths.glob(f"**/*.{ext}")) for ext in MIDI_FILES_EXTENSIONS)
Copy link
Contributor

@leleogere leleogere Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try this? I suspect this would cause an error, I think that it should be more something like the following:

sum((list(midi_paths.glob(f"**/*.{ext}")) for ext in MIDI_FILES_EXTENSIONS), [])
    ^                                                                     ^^^^^

EDIT: I confirm that after testing, this raises an error. The [] is needed to specify to sum the start of the aggregation.
EDIT2: You sould remove a dot either in the glob or in the MIDI_FILES_EXTENSIONS items as currently there is 2 dots and therefore midi_paths is empty after that.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well found thank you!
For the extension we should just remove the dot in the line above.
Do you want to make the code change suggestion or do I make the changes myself ?

@Natooz
Copy link
Owner Author

Natooz commented Oct 12, 2023

Thank you for the review, waiting for your response.
In the meantime I'll fix what has been broken here during the day

Copy link
Contributor

@leleogere leleogere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest proposed changes (Now that I know how to do a code suggestion, I'll do that directly next time!)

@Natooz
Copy link
Owner Author

Natooz commented Oct 12, 2023

Committed!
Yes, it allows you to also be author of the changes, as you are the one contributing :)

@Natooz
Copy link
Owner Author

Natooz commented Oct 12, 2023

I'll try to improve coverage before merging

@Natooz Natooz merged commit 8cd6a67 into main Oct 13, 2023
14 checks passed
@Natooz Natooz deleted the tokenize-dataset-file-tree branch October 31, 2023 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants