In this document, we demonstrate our standard data processing pipeline in our team. Following the instructions and runnung corresponding python scripts, you can easily generate and customized your your own dataset.
We collect audio clips of piano performance from YouTube.
- run google magenta's onsets and frames
In this step, we use madamom for beat/downbeat tracking. Next, We interpolate 480 ticks between two adjacent beats, and map the absolute time into its according tick. Lastly, we infer the tempo changes from the time interval between adjacent beats. We choose beat resolution=480 because it's a common setting in modern DAW. Notice that we don't quantize any timing in this step hence we can keep tiny offset for future purposes.
- run
synchronizer.py
In this step, we develop in-house rule-based symbolic melody extraction and chord recognition algorithm to obtain desired information. The code for chord recognition are open sourced here. We plan to open rhe code for melody extraction in the future.
- run
analyzer.py
We quantize everything (duration, velocity, bpm) in this step. Also append the data with EOS(end of sequence) token.
- run
midi2corpus.py
We have 2 kinds of representation - Compound Word (CP) and REMI, and 2 tasks - unconditional and conditional generation, which resulting 4 combinations. Go to corresponding folder <task>/<repre>
and run the scripts.
- run
corpus2events.py
: to generate human readable tokens and re-arrange data. - run
events2words.py
: to build dictionary and renumber the tokens. - run
compile.py
: to discard disqualified songs that exceeding length limits, reshape the data for transformer-XL, and generate mask for variable length.
Alternatively, you can refer to here to obtain the entire workspace and the pre-processed training data, which originally used in our paper.