-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data formatting issue: missing unique IDs & stray HTML #9
Comments
Thanks Gwern. Yes, there are a few of the html lines left in the raw data. For our experiments with the tokenized transcriptions, we only add X: to generated transcriptions when we want to notate them. Please let us know about your results! |
I see. That may be fine with your particular tool but others, like You can see the current GPT-2 ABC samples at https://mega.nz/#!rDhTBYCA!6DK6dnPFAyIjwad7On_S3H59rI0Lze4VfRPjM6swM3I (9742 steps at n=5, loss of 0.46). And valid samples compiled to OGG Vorbis: https://www.dropbox.com/s/liknp4tz61jvgac/2019-10-21-gwern-gpt2-folkrnn-samples.ogg?dl=0 |
We don't want a model to waste resources learning about X: and other non-musical information, so we leave those things out. We use a simple script to render the model output to compliant ABC. Thanks for sharing some of your results! So that I understand what you have done, you have taken the GPT-2 language model and trained it on a concatenation of our datasets, and then had the resulting model generate new text data one character at a time? Did you prime the model with anything? I see some titles are repeated within samples. How did you select these samples? Some of the results are very good! Like this one: X: 16017 That has good repetition and variation. It's a rather boring polka, but it works. Here's a nice and simple strathspey: This is a good reel: The two parts are nicely related. Another great reel: This one goes a bit off the rails: This doesn't sound particularly Irish, but I like the first two parts and how they work against the time signature. Here's a nice goofy hornpipe: The following four-part reel holds together nicely. I am surprised that the model returns to the same ending in each part. X: 71612 Finally found one that is not as good as the above: The last part is only seven bars. Here's a poor slide Here's a better slide Very impressive results in general! |
Any model which is unable to learn that a field
Yes, but it's not quite 'one character'; it's a 'byte pair encoding' adaptively chosen on English text, see the GPT-2 OA paper, which is an encoding intermediate between character and 'word'. It can still represent all individual characters, it just is able to chunk much larger segments of English text like up to whole common words. For GPT-2 training, we usually freeze the BPE embedding to save a ton of VRAM, so it won't adapt and since ABC format isn't very English-vocab-like, is probably roughly equivalent to character-level by forcing it to fall back to character-equivalent BPEs to represent unusual fragments like 'A2B'. I don't think this is much of a problem because GPT-2 has a window of somewhere around 500, so an entire ABC transcription should fit within its window already (which is why it's so easy for it to repeat or vary itself and maintain some degree of global coherency). But one could go back, unfreeze the embedding, retrain, and see if that helps. The samples here are entirely unprimed. These were generated unconditionally. (The sampling is the new 'nucleus sampling' method, an improved variant on the old temperature Boltzmann sampling.) These samples are also entirely randomly selected - I just dumped 1000 'samples' (which internally switch between multiple pieces). I'm not sure why titles repeat. It looks very nonrandom, but the model should vary titles more than that because the separate pieces are clearly separated by the Overall the main problem with the samples appears to be the |
We want to model music and not peculiarities of ABC, and we want a compact and expressive vocabulary that facilitates interpretability (see my growing series https://highnoongmt.wordpress.com/2019/08/21/making-sense-of-the-folk-rnn-v2-model-part-12 ). The resulting folk-rnn models are surprisingly successful, even though they have on the order of 1000 times fewer parameters than GPT-2 (https://www.youtube.com/channel/UC7wzmG64y2IbTUeWji_qKhA , https://soundcloud.com/oconaillfamilyandfriends ). From what I have seen, folkrnn models produce transcriptions of about the same quality as yours, for both Irish and Swedish dance music. (Do you know about https://folkrnn.org/ and https://themachinefolksession.org/ ?) I can see what you mean by saving effort, the spirit of push-button systems, etc., but trying to stay true to ABC is no less an arbitrary decision. It of course depends on what you are trying to do. So, what are you trying to do?
In the training data, which comes from thesession.org, tunes appear with settings, so one might have several settings of Connaughtman's Rambles one after another. That could be why.
There are three versions of our data. The first is just a cleaned up version of the dataset from thesession.org. We used this dataset to create our v1 model (which was just char-rnn). The second is a tokenized version. We transposed all tunes to have a root of C, and removed all titles. This created our v2 model (which is implemented here https://folkrnn.org). The third dataset is a further tokenized version, where we transpose all tunes to have a root of C and C#, remove titles, make all pitches explicit, and change the K: field to a mode. If a generated tune has a lot of sharps, just change K:min to K:C#min, etc. If a generated tune does not have a lot of sharps, change K:maj to K:Cmaj, etc. Have you trained your model on a single concatenation of all these datafiles? Do you have your models online somewhere to try? |
I posted a blog about some of these tunes! https://highnoongmt.wordpress.com/2019/10/28/machine-folk-from-gpt-2/ |
I haven't put the models up so far because it's not really done. For example, the 1k sample from above had a loss of 0.46, but it turns out it can go as low as 0.09 on just the original corpus (minus the other transformed versions) before my simple plagiarism checks using
So that's where the max-likelihood model currently is. I'm still struggling with the RL tuning. My hacks worked fine for poetry generation, but lead to really bad divergence in every training run for the currently-final 117M model I trained despite me rating 800+ pairs of samples. The average reward would get worse every iteration, and it would degenerate, until sometimes it discovered a highly repetitive sequence like 'X. X. X.' which somehow earned a relatively high reward from the reward model, and would continue with that... My current theory as to what is going on is that the OpenAI RL tuning code is framed as a conditional generation task: a prompt->response. I've been using it for unconditional generation by hacking the config to change one of the tasks to poetry and expect the model to simply ignore the random English words being used to prompt it. This is OK for the poetry model because it just ignores the prompts or works off of them (poetry is robust like that), but I think what happens with the ABC 117M reward model is that it is so finely tuned to ABC notation that when it sees a string with the random English word in it, it knows it's invalid ABC and penalizes it harshly, so every generated sample looks bad to it, and this destroys the training dynamics. What I need to do is somehow eliminate the prompt or make it innocuous, like zeroing it out with spaces, so all the generated samples can be valid ABC and the reward model can start producing a sane training signal for the generator... Haven't quite done that because all of the rating of the poems and music samples have really exhausted me. I switched back to working on the poems while I meditated on what was going wrong with the music, since adding more ratings clearly was not fixing things. (The poetry isn't suffering any weird problems like that, but the improvements from RL training are thus far small, and I suspect poetry is intrinsically much more difficult than music and may simply require way more ratings to train a reward model good enough to make a dramatic difference.)
Ah. Now I see what you mean. Yes, that would explain it. I think that's OK to leave in place? I thought it was some sort of error by the model, but since they are all settings or 'variants' in the original (right?), then it'd be interesting to see the model also generate multiple 'variants' on a theme.
Sure. You can always get decent performance with small models, because the log-loss decreases roughly logarithmically with parameter count. And NNs are always highly overparameterized, so you should be able to cut down the trained GPT-2 by a factor of 10 or 100 without degrading quality too badly. My belief is that even a 117M is more than big enough and expressive enough to generate great music, and the problems are architectural or training-related (optimizing for the wrong thing). I did hope that starting with OA's GPT-2-117M would make the English titles more interesting because the original model knows all sorts of names and objects, but it seems once you've trained it far enough to generate good ABC, it's just largely memorized the titles. Too small a dataset to benefit from the pretraining. Oh well.
Yes.
Oh, my only real goal here was to see if switching to the RL setting could yield a qualitative improvement over likelihood training using the same data as the starting point, because RL seems to me philosophically the right way to approach the problem and likelihood fundamentally optimizing for the wrong thing. Aside from that, I am mildly curious how much a Transformer can improve over an older RNN. |
To update this: I figured out the problem which was causing divergence. It turned out to not be the prompt issue, but once I fixed that, the problem became more obvious. The OA code has a sort of heuristic where it looks for a particular character/BPE as a validity check, and ignores the score in favor of a fixed penalty if the check fails; the check they had was to search for '.', which they interpret as indicating a valid English sentence. Unfortunately, it seems periods show up very infrequently in ABC music... So every sample was being given a heavy penalty and quality was being ignored. Once I removed that check entirely, the RL tuning code now runs just fine. I've begun the iterative loop and so far have done ~3 loops and have rated ~2136 pairs of samples. My first impressions are that the RL finetuning is improving the overall syntactic validity of the ABC samples, and has largely eliminated the repetition failure mode (like in the poetry samples). I think it has also made them more 'musical' but it's hard for me to say for sure. (For example, there seem to be much more in the way of 'endings' than I usually see with char-RNN stuff, but I'm not sure how much the RL finetuning is helping there versus the Transformer seeing the entire ABC sample.)
Incidentally, I noticed some more problems with the original data. There are English comments dumped in at random in various places. For example, if you grep for 'should' you can find lines like
Which don't look like ABC to me and make abc2midi unhappy when they pop up in GPT-2-generated outputs. |
Great! Please post samples once the model has finished training and I can take a closer look. The "cleaned" data apparently still has a ways to go! Contributors to thesession.org have left comments on tune settings, and my cleaning script has missed many of them. You are seeing those. Hopefully the parsed datasets don't have these. |
So an update. I've partially written up all the work to date: https://www.gwern.net/GPT-2-music The model has enjoyed 2 major updates since you listened to the samples:
The RL preference learning is still ongoing, but I decided to leave that for another writeup, and cover just the baseline ABC GPT-2. You mention in the blog post you were going to compare it to folk-RNN pieces, but I hope you'll redo your assessment with the latest GPT-2 model samples before you do that! |
To update: I regret to report that the preference-learning was a failure. However, there is good news in that we are now experimenting with training GPT-2 with context windows up to 30,000 (that is not a typo), and with generating the ABC-equivalents of MIDI files. |
Very interesting experiments! Might you be interested in participating in this challenge: https://www.kth.se/en/eecs/om-oss/konferenser-och-event/aimusic2020/ai-music-generation-challenge-2020-1.946478 |
OT: Given that jig is a dance form, I think danceability should be added to the rating system. |
It is a part of stage 1, point b, and stage 2, point a(ii). |
I think I would have a hard time understanding and passing all the constraints: that's a lot of conditions to backfit onto generation. You could probably do reasonably well by training on all ABC available if it came with labels defining double jigs and other things, to condition in generation, and maybe tacking on a filtering step by a reward model. I doubt I'll do it because once we finish with the 30k context window GPT-2 (it should be long since done but our TPUs got preempted by a massive surge in TPU demand) we'll be moving on to scaling up StyleGAN 2 to try to generate 1024px anime images. |
Yes, "solving" the task will require thinking about the problem instead of just throwing data at it. Especially so since the dataset for comparison only has 365 double jigs. There will be a variety of baseline models for comparison, including folkrnn. |
We've finished and written up the MIDI generation with lots of samples at https://www.gwern.net/GPT-2-music#generating-midi-with-10k30k-context-windows Interested to hear what anyone makes of our post-October-2019 results. (StyleGAN 2 has proven, as I suspected, unable to generate large-scale complex images due to its architecture, although we got further than most people would have predicted, so we've moved onto BigGAN and have quite nice results already and 512px is looking feasible for eg, although you guys probably don't care about generating anime etc.) |
Thanks for the update Gwern. I listened to several of the examples, but didn't hear any that resemble Irish traditional music. IMO, the two orders of magnitude increase in the model size from folk-rnn models is not showing any advantages. I did enjoy some of your generated examples though. This one reminds me of Steve Martland: https://www.gwern.net/docs/ai/music/2020-04-15-gpt2-midi-thesessionsabc-6473931123.mp3. It's a large amount of money you have spent training your model! Why are you doing this work? |
If you wanted to get out reliably just Irish music, probably need to finetune it. I didn't try that because I thought the original model covered that task pretty well and I was more interested in generating a broader variety of music. The Sessions is, all things considered, a very restricted niche of music, so it's not too surprising that very small models can do a decent job on it. Generating MIDI in general is far broader and harder a task - you say that the increased model size isn't helping, but nevertheless, it underfit the MIDI dataset substantially, to my surprise. If I had known we'd be able to hit a only loss of 0.2, I would've started with a larger model than just GPT-2-117M, and accepted any necessary limits on context window. Since likelihood loss seems to track quality so closely & the final few increments of loss make a big difference perceptually, it's possible we could've done a lot better. As for why: well, no one else is trying, and I could, and it's interesting. It works better than most people would've expected, which is good enough for me. |
Thanks. There are several researchers achieving compelling results with ML modeling and generation of different styles of music, e.g., https://magenta.tensorflow.org/music-transformer, https://chrisdonahue.com/, http://dadabots.com/, https://musicai.citi.sinica.edu.tw/, https://csl.sony.fr/team/gaetan-hadjeres/, https://www.york.ac.uk/music/staff/academic/tom-collins/, not to mention a variety of start-ups like AIVA and melodrive and jukedeck (now part of tiktok). Many of these groups are also studying their systems in contexts of music creation, seeing how useful they actually are, i.e., how they can contribute to music creation. It is a simple matter to throw a bunch of data at ML wizardry. Have you thought about going further to see what your model can contribute by working with people who create music? If I find some time I will download your models and interact them in the same way I do with folkrnn in creating music. |
I've been trying out ABC with GPT-2 along the lines of my poetry generation (max-likelihood training and then I'll use OA's RL preference-learning for finetuning), and I've found that some of the ABC files in
data/
appear to have 2 issues:there are ~2k stray
</html>
lines which should be removed, as this is not ABC music (unsure if it's syntactically invalid, but they definitely shouldn't be there). This can be easily removed with any search-and-replace.lack of unique IDs: some ABC compilers (specifically,
abc2midi
) require unique IDs likeX: 1
for each ABC song. http://abcnotation.com/wiki/abc:standard:v2.1#xreference_number notes that(I worked around this by using an Emacs macro to prefix it and an incrementing integer.)
The text was updated successfully, but these errors were encountered: