Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data formatting issue: missing unique IDs & stray HTML #9

Open
gwern opened this issue Oct 20, 2019 · 20 comments
Open

Data formatting issue: missing unique IDs & stray HTML #9

gwern opened this issue Oct 20, 2019 · 20 comments

Comments

@gwern
Copy link

gwern commented Oct 20, 2019

I've been trying out ABC with GPT-2 along the lines of my poetry generation (max-likelihood training and then I'll use OA's RL preference-learning for finetuning), and I've found that some of the ABC files in data/ appear to have 2 issues:

  1. there are ~2k stray </html> lines which should be removed, as this is not ABC music (unsure if it's syntactically invalid, but they definitely shouldn't be there). This can be easily removed with any search-and-replace.

  2. lack of unique IDs: some ABC compilers (specifically, abc2midi) require unique IDs like X: 1 for each ABC song. http://abcnotation.com/wiki/abc:standard:v2.1#xreference_number notes that

    The X: field may be empty, although this is not recommended.

    (I worked around this by using an Emacs macro to prefix it and an incrementing integer.)

@boblsturm
Copy link
Collaborator

Thanks Gwern. Yes, there are a few of the html lines left in the raw data. For our experiments with the tokenized transcriptions, we only add X: to generated transcriptions when we want to notate them. Please let us know about your results!

@gwern
Copy link
Author

gwern commented Oct 21, 2019

I see. That may be fine with your particular tool but others, like abc2midi, will simply error out, so it's not a great situation to leave the Xs out.

You can see the current GPT-2 ABC samples at https://mega.nz/#!rDhTBYCA!6DK6dnPFAyIjwad7On_S3H59rI0Lze4VfRPjM6swM3I (9742 steps at n=5, loss of 0.46). And valid samples compiled to OGG Vorbis: https://www.dropbox.com/s/liknp4tz61jvgac/2019-10-21-gwern-gpt2-folkrnn-samples.ogg?dl=0

@boblsturm
Copy link
Collaborator

We don't want a model to waste resources learning about X: and other non-musical information, so we leave those things out. We use a simple script to render the model output to compliant ABC.

Thanks for sharing some of your results! So that I understand what you have done, you have taken the GPT-2 language model and trained it on a concatenation of our datasets, and then had the resulting model generate new text data one character at a time? Did you prime the model with anything? I see some titles are repeated within samples. How did you select these samples?

Some of the results are very good! Like this one:

X: 16017
T: Johnny O'Leary's
M: 2/4
L: 1/8
K: Gmaj
|: B/A/ | G>A Bd | ed e/f/g | G>A B/c/d | e2 dg |
G>A B/c/d | ed e/f/g | G>A B/c/d | e2 d :|
|: g | gf ed | ed e/f/g | G>A B/c/d | e2 d2 |
g>f ed | ed e/f/g | G>A B/c/d | e2 d :|

That has good repetition and variation. It's a rather boring polka, but it works.

Here's a nice and simple strathspey:
X: 77304
T:The Heathery Cruach
M:4/4
K:Cmaj
|:c>A|G2C>D E<G G<c|e>dc>edB|c>AG<CEe|d>Ge>dc2:|
|:B>c|dfd>cA<c|e>dc>edf|e>fg>ed>cd>e|f>de>dc2:|

This is a good reel:
X: 90323
T:The White Crow
M:4/4
K:Cdor
|:g2fgec(3ccc|efgabgfg|bgfgec(3ccc|efgec2ce|
g2fgec(3ccc|efgab3c'|bgfdefga|(3bagfdec(3ccc:|
|:b3gc'2c'b|gbfbefga|b3gc'2c'b|gbfdec(3ccc|
b3gc'2c'b|gbfgb3c'|bgfdefga|(3bagfdec(3ccc:|

The two parts are nicely related.

Another great reel:
X: 114292
M:4/4
K:maj
=G|:=G=C=E=C=G=C=A=C|=G=C=E=C=G=C=E=C|=D=A,=D=E=F=D=E=C|=D=A,=D=E=F=E=F=A|=G=C=E=C=G=C=A=C|=G=C=E=C=G=C=E=C|=D=A,=D=E=F=G=A=B|1=c=A=G=E=D2=E=F:||2=c=A=G=E=D=A,=D=E|:=F3=G=A2=A=B|=c=d=c=A=A=G=E=D|=C3=E=G=C=E=C|=F=A=G=F=E=D=C=E|=F3=G=A3=B|=c=d=c=A=A=G=E=C|=D=A,=D=E=F=G=A=B|1=c=A=G=E=D=A,=D=E:||2=c=A=G=E=D2=E=F|

This one goes a bit off the rails:
X: 107310
M:6/8
K:Cmix
|:=c3=B=A=G|=G=A=B=d2=B|=c2=d=e=d=c|=f2=d=e=d=c|
=B=d=G=B=d=G|=B=d=G=B=d=G|=c3=d=c=B|=G3=G3:|
|:=g2=a=b=a=g|=f=d=e=f2=g|=a=f=d=f=g=a|=b3=c'=b=a|
=b=d'=b=c'=a=g|=b=d'=b=c'=a=g|=b=a=g=f=e=d|=g3=g3:|
|:=d3=c3|=d=e=f=g=a=b|=d3=c3|=d=g=f=g=a=c'|
=b=a=g=d3|=d=e=f=g=a=c'|=b=a=g=f=g=a|=b=a=g=f=d=c|
=d3=c3|=d=e=f=g=a=b|=d3=c3|=d=g=f=g=a=c'|
=b=a=g=a=b=c'|=d'=b=c'=a=b=g|=a=f=d=e=f=g|=a=d=d=e=f=g|
=a=b=c'=a=b=c'|=d'=b=c'=a=b=g|=a=f=d=e=f=g|=a=d=d=e=f=g|
=a=b=c'=a=b=c'|=d'=b=c'=a=b=g|=a=f=d=e=f=g|=a=d=d=e=f=g|=a

This doesn't sound particularly Irish, but I like the first two parts and how they work against the time signature.

Here's a nice goofy hornpipe:
X: 2439
T: Corney Is Coming
M: 4/4
L: 1/8
K: Gmaj
|: (3Bc^c | dBgd egdB | cBAG FEDF | EGce DFAc | Bdgd cBAc |
(3Bc^c dB gdBG | eBcA F3 D | E3 c D3 F | (3EFE AG (3FGF :|
|: Dd | cDFA DFAc | BAGA B2 AB | cDFA DFAc | (3BcB (3ABA G2 AB |
cDFA DFAc | BAGA B2 AB | c3 A (3Bcd gd | cAFA G2 :|

The following four-part reel holds together nicely. I am surprised that the model returns to the same ending in each part.

X: 71612
T:The Cup Of Tea
M:4/4
K:Cmin
|:GccBcdec|B2FBDBFB|GccBc2ef|gbfgecBA|
GccBcdec|B2FBDBFB|GccBc2ef|gbfgecc2
|:dffde2ce|dBFBDBFB|dffde2ce|gbfgecc2|
dffde2ce|dBFBDBFB|GccBc2ef|gbfgecBA
|:GcecBcec|BFF2DFF2|GccBc2ef|gbfgecBA|
GcecBcec|BFF2DFF2|GccBc2ef|1gbfgecBA:||2gbfgecc2
|:e2gebege|d2fdadfd|e2gebege|gbfgecc2|
e2gebege|d2fdadfd|GccBc2ef|1gbfgecc2:||2gbfgecBA|

Finally found one that is not as good as the above:
X: 89143
T:The Pipe Slang
M:6/8
K:Cmaj
|:C2ccGE|F2FFED|C2ccGE|FDEFED|C2ccGE|F2FFED|E3D3|EGEFED:|
|:EGGAGG|AGGGFE|EGGAGG|cBcdBG|cBcdBG|cGED3|EGEFED:|

The last part is only seven bars.

Here's a poor slide
X: 38347
T: Mind Your Leg Of The Train Maid Anne
M: 12/8
L: 1/8
K: Amaj
|: E3 A2B c2A A2F | E3 A2F E3 E2F |
E3 A2B c2A A2c |1 BAG F2A A3 A2F :|2 BAG F2A A3 A2e ||
a3 f2e d3 d2f | a2g f2e d3 c2e |
a3 f2e d3 d2f | a2g f2e d3 d2e |
a3 f2e d3 d2f | a2g f2e d3 d2e |
f3 e3 d2B c2d | B2A F2E A3 A2F |]

Here's a better slide
X: 138045
M:6/8
K:C#maj
|:^G|=F>^F=F=F^D^C|^C>=F^G^c2^d|=f>^d^c^A>=c^c|^G^F=F^D2^G|=F2=F=F^D^C|^C>=F^G^c2^d|=f>^d^c^A^G^A|^c3^c2:|
|:^d|=f^d^c^c>^A^c|=f^d^c^A2^c|=f^d^c^c=c^A|^G3^G2^d|=f^d^c^c^A^c|=f^d^c^A2^f|=f>^d^c^A^G^A|^c3^c2:|

Very impressive results in general!

@gwern
Copy link
Author

gwern commented Oct 22, 2019

We don't want a model to waste resources learning about X: and other non-musical information, so we leave those things out. We use a simple script to render the model output to compliant ABC.

Any model which is unable to learn that a field X: $RANDOM_INT is present in every entry with only trivial capacity is probably too simple-minded to be of much interest, and it saves effort to not have to post-process; I also feel it's more in the spirit of the the generative approach - the more you have to do to clean it up post hoc, the less interesting it is.

So that I understand what you have done, you have taken the GPT-2 language model and trained it on a concatenation of our datasets, and then had the resulting model generate new text data one character at a time? Did you prime the model with anything? I see some titles are repeated within samples. How did you select these samples?

Yes, but it's not quite 'one character'; it's a 'byte pair encoding' adaptively chosen on English text, see the GPT-2 OA paper, which is an encoding intermediate between character and 'word'. It can still represent all individual characters, it just is able to chunk much larger segments of English text like up to whole common words. For GPT-2 training, we usually freeze the BPE embedding to save a ton of VRAM, so it won't adapt and since ABC format isn't very English-vocab-like, is probably roughly equivalent to character-level by forcing it to fall back to character-equivalent BPEs to represent unusual fragments like 'A2B'. I don't think this is much of a problem because GPT-2 has a window of somewhere around 500, so an entire ABC transcription should fit within its window already (which is why it's so easy for it to repeat or vary itself and maintain some degree of global coherency). But one could go back, unfreeze the embedding, retrain, and see if that helps.

The samples here are entirely unprimed. These were generated unconditionally. (The sampling is the new 'nucleus sampling' method, an improved variant on the old temperature Boltzmann sampling.)

These samples are also entirely randomly selected - I just dumped 1000 'samples' (which internally switch between multiple pieces). I'm not sure why titles repeat. It looks very nonrandom, but the model should vary titles more than that because the separate pieces are clearly separated by the <|endoftext|> token which it should have learned meant that the next piece was completely unrelated to the current piece (including the title); possibly it hasn't quite learned that and the titles are stereotypical enough to not challenge it.

Overall the main problem with the samples appears to be the K field, where abc2midi complains the key is invalid for something like a third of them. For those, my rating-comparison script (for making the pairwise comparisons for the preference learning RL phase) simply defines failing to compile as being a loss of a comparison, so in theory, it should be fixed very quickly in the RL phase. I also think that there's overall too much repetition, which is a very typical problem with likelihood-trained models whether char-RNN or GPT-2, and something which is a severe problem in my poetry models, but also something that the RL training phase fixes so far for the poetry GPT-2. Finally, another problem is that some are too short - that one is mostly an artifact of the sampling (the code stops sampling after a fixed number of BPEs so a lot of songs just get cut off without having finished yet).

@boblsturm
Copy link
Collaborator

Any model which is unable to learn that a field X: $RANDOM_INT is present in every entry with only trivial capacity is probably too simple-minded to be of much interest, and it saves effort to not have to post-process; I also feel it's more in the spirit of the the generative approach - the more you have to do to clean it up post hoc, the less interesting it is.

We want to model music and not peculiarities of ABC, and we want a compact and expressive vocabulary that facilitates interpretability (see my growing series https://highnoongmt.wordpress.com/2019/08/21/making-sense-of-the-folk-rnn-v2-model-part-12 ). The resulting folk-rnn models are surprisingly successful, even though they have on the order of 1000 times fewer parameters than GPT-2 (https://www.youtube.com/channel/UC7wzmG64y2IbTUeWji_qKhA , https://soundcloud.com/oconaillfamilyandfriends ). From what I have seen, folkrnn models produce transcriptions of about the same quality as yours, for both Irish and Swedish dance music. (Do you know about https://folkrnn.org/ and https://themachinefolksession.org/ ?) I can see what you mean by saving effort, the spirit of push-button systems, etc., but trying to stay true to ABC is no less an arbitrary decision. It of course depends on what you are trying to do. So, what are you trying to do?

I'm not sure why titles repeat.

In the training data, which comes from thesession.org, tunes appear with settings, so one might have several settings of Connaughtman's Rambles one after another. That could be why.

Overall the main problem with the samples appears to be the K field, where abc2midi complains the key is invalid for something like a third of them.

There are three versions of our data. The first is just a cleaned up version of the dataset from thesession.org. We used this dataset to create our v1 model (which was just char-rnn). The second is a tokenized version. We transposed all tunes to have a root of C, and removed all titles. This created our v2 model (which is implemented here https://folkrnn.org). The third dataset is a further tokenized version, where we transpose all tunes to have a root of C and C#, remove titles, make all pitches explicit, and change the K: field to a mode. If a generated tune has a lot of sharps, just change K:min to K:C#min, etc. If a generated tune does not have a lot of sharps, change K:maj to K:Cmaj, etc.

Have you trained your model on a single concatenation of all these datafiles?

Do you have your models online somewhere to try?

@boblsturm
Copy link
Collaborator

I posted a blog about some of these tunes! https://highnoongmt.wordpress.com/2019/10/28/machine-folk-from-gpt-2/

@gwern
Copy link
Author

gwern commented Oct 28, 2019

I haven't put the models up so far because it's not really done. For example, the 1k sample from above had a loss of 0.46, but it turns out it can go as low as 0.09 on just the original corpus (minus the other transformed versions) before my simple plagiarism checks using grep began turning up hits of ABC in the original corpus indicating memorization of more than just titles. (Where more 'musical' level plagiarism begins, I couldn't say.) I discovered I had simply given up too early in the training when I went back to modify the corpus more to make it more acceptable to abc2midi and trained it some more. After I got down to 0.09, I retrained on the concatenated corpus down to 0.29, so it roughly halved the loss. The result is that the syntax seems much more correct and now most samples pass abc2midi, and the musical quality seems somewhat better to me. Anyway:

So that's where the max-likelihood model currently is.

I'm still struggling with the RL tuning. My hacks worked fine for poetry generation, but lead to really bad divergence in every training run for the currently-final 117M model I trained despite me rating 800+ pairs of samples. The average reward would get worse every iteration, and it would degenerate, until sometimes it discovered a highly repetitive sequence like 'X. X. X.' which somehow earned a relatively high reward from the reward model, and would continue with that...

My current theory as to what is going on is that the OpenAI RL tuning code is framed as a conditional generation task: a prompt->response. I've been using it for unconditional generation by hacking the config to change one of the tasks to poetry and expect the model to simply ignore the random English words being used to prompt it. This is OK for the poetry model because it just ignores the prompts or works off of them (poetry is robust like that), but I think what happens with the ABC 117M reward model is that it is so finely tuned to ABC notation that when it sees a string with the random English word in it, it knows it's invalid ABC and penalizes it harshly, so every generated sample looks bad to it, and this destroys the training dynamics. What I need to do is somehow eliminate the prompt or make it innocuous, like zeroing it out with spaces, so all the generated samples can be valid ABC and the reward model can start producing a sane training signal for the generator... Haven't quite done that because all of the rating of the poems and music samples have really exhausted me. I switched back to working on the poems while I meditated on what was going wrong with the music, since adding more ratings clearly was not fixing things.

(The poetry isn't suffering any weird problems like that, but the improvements from RL training are thus far small, and I suspect poetry is intrinsically much more difficult than music and may simply require way more ratings to train a reward model good enough to make a dramatic difference.)

In the training data, which comes from thesession.org, tunes appear with settings, so one might have several settings of Connaughtman's Rambles one after another. That could be why.

Ah. Now I see what you mean. Yes, that would explain it. I think that's OK to leave in place? I thought it was some sort of error by the model, but since they are all settings or 'variants' in the original (right?), then it'd be interesting to see the model also generate multiple 'variants' on a theme.

The resulting folk-rnn models are surprisingly successful, even though they have on the order of 1000 times fewer parameters than GPT-2 (https://www.youtube.com/channel/UC7wzmG64y2IbTUeWji_qKhA , https://soundcloud.com/oconaillfamilyandfriends )

Sure. You can always get decent performance with small models, because the log-loss decreases roughly logarithmically with parameter count. And NNs are always highly overparameterized, so you should be able to cut down the trained GPT-2 by a factor of 10 or 100 without degrading quality too badly. My belief is that even a 117M is more than big enough and expressive enough to generate great music, and the problems are architectural or training-related (optimizing for the wrong thing).

I did hope that starting with OA's GPT-2-117M would make the English titles more interesting because the original model knows all sorts of names and objects, but it seems once you've trained it far enough to generate good ABC, it's just largely memorized the titles. Too small a dataset to benefit from the pretraining. Oh well.

Have you trained your model on a single concatenation of all these datafiles?

Yes.

So, what are you trying to do?

Oh, my only real goal here was to see if switching to the RL setting could yield a qualitative improvement over likelihood training using the same data as the starting point, because RL seems to me philosophically the right way to approach the problem and likelihood fundamentally optimizing for the wrong thing. Aside from that, I am mildly curious how much a Transformer can improve over an older RNN.

@gwern
Copy link
Author

gwern commented Nov 1, 2019

To update this: I figured out the problem which was causing divergence. It turned out to not be the prompt issue, but once I fixed that, the problem became more obvious. The OA code has a sort of heuristic where it looks for a particular character/BPE as a validity check, and ignores the score in favor of a fixed penalty if the check fails; the check they had was to search for '.', which they interpret as indicating a valid English sentence. Unfortunately, it seems periods show up very infrequently in ABC music... So every sample was being given a heavy penalty and quality was being ignored. Once I removed that check entirely, the RL tuning code now runs just fine.

I've begun the iterative loop and so far have done ~3 loops and have rated ~2136 pairs of samples.

My first impressions are that the RL finetuning is improving the overall syntactic validity of the ABC samples, and has largely eliminated the repetition failure mode (like in the poetry samples). I think it has also made them more 'musical' but it's hard for me to say for sure. (For example, there seem to be much more in the way of 'endings' than I usually see with char-RNN stuff, but I'm not sure how much the RL finetuning is helping there versus the Transformer seeing the entire ABC sample.)


Incidentally, I noticed some more problems with the original data. There are English comments dumped in at random in various places. For example, if you grep for 'should' you can find lines like

End of second last line should read: |{cd}e3dF2|G6|G6:|
End of second last line should read: |{cd}e3dF2|G6|G6:|
The last bar of the B part should be: |afeg fedB||
Martin Connolly says that the 1st bar should read |AFD ECA,| instead of |AFD EDA,|. I've had another listen to my recording of Brendan Bulger and he's definitely playing a D in his version.
I think something is wrong with ABC. It should read |A3G E2DE| or |A3G E2D2|.
This transcription is inaccurate. There should be a 2nd time ending on the B-part. Last 2 bars are |2 cefe ~a3f|ecBd cAA2||
peakfiddler, a small change in your ABC: |"G"1 should be |1"G", and |"G"2 should be |2"G", just so the repeat counting is next to the bar signs.
Last 2 measures should probably be  |efg  edB| A3 A2z:|
So, ceolachan, my  |Acd/e/|f  should in fact have been  |Ace/f/|g.F|D2F|FGF|E2E|GAG|CDE|A,B,C|DCD|A,B,C|D2F|FGF|E2E|GAG|CEG|ADE|F3|F2:|D|D2G|BGB,/C/|D2F|AFD|D2C|B,2A,|G,B,G,|A,B,C|D3|=C3|B,3|^G,3|A,GF|EDA|F3|F2:|
After listening to CD I reckon last bar should read, | ed cA B2 :||
Bars 3 & 4 (A music) should read;  | FD/F/ EC | D/E/D/C/ B,/A,/G, |Bars 7 & 8 (A music) should read;  | FD/F/ EC | DB, C2 :|Bars 10 & 11 (B music) should read;  | ED/C/ B,>C | D/C/D/E/ F/E/D/C/ | Bars 14 & 15 (B music) should read;  | ED/C/ B,>C | DD/E/ F/E/D/C/ |
I had a note missing on the second ending of the B part. Should be |2GFGF E3B||
The last bar of the B part should be: |afeg fedB||
Martin Connolly says that the 1st bar should read |AFD ECA,| instead of |AFD EDA,|. I've had another listen to my recording of Brendan Bulger and he's definitely playing a D in his version.
I think something is wrong with ABC. It should read |A3G E2DE| or |A3G E2D2|.
This transcription is inaccurate. There should be a 2nd time ending on the B-part. Last 2 bars are |2 cefe ~a3f|ecBd cAA2||
peakfiddler, a small change in your ABC: |"G"1 should be |1"G", and |"G"2 should be |2"G", just so the repeat counting is next to the bar signs.
Last 2 measures should probably be  |efg  edB| A3 A2z:|
So, ceolachan, my  |Acd/e/|f  should in fact have been  |Ace/f/|g.F|D2F|FGF|E2E|GAG|CDE|A,B,C|DCD|A,B,C|D2F|FGF|E2E|GAG|CEG|ADE|F3|F2:|D|D2G|BGB,/C/|D2F|AFD|D2C|B,2A,|G,B,G,|A,B,C|D3|=C3|B,3|^G,3|A,GF|EDA|F3|F2:|
After listening to CD I reckon last bar should read, | ed cA B2 :||
Bars 3 & 4 (A music) should read;  | FD/F/ EC | D/E/D/C/ B,/A,/G, |Bars 7 & 8 (A music) should read;  | FD/F/ EC | DB, C2 :|Bars 10 & 11 (B music) should read;  | ED/C/ B,>C | D/C/D/E/ F/E/D/C/ | Bars 14 & 15 (B music) should read;  | ED/C/ B,>C | DD/E/ F/E/D/C/ |
I had a note missing on the second ending of the B part. Should be |2GFGF E3B||

Which don't look like ABC to me and make abc2midi unhappy when they pop up in GPT-2-generated outputs.

@boblsturm
Copy link
Collaborator

Great! Please post samples once the model has finished training and I can take a closer look.

The "cleaned" data apparently still has a ways to go! Contributors to thesession.org have left comments on tune settings, and my cleaning script has missed many of them. You are seeing those. Hopefully the parsed datasets don't have these.

@gwern
Copy link
Author

gwern commented Dec 13, 2019

So an update. I've partially written up all the work to date: https://www.gwern.net/GPT-2-music

The model has enjoyed 2 major updates since you listened to the samples:

  1. we found that there's some sort of subtle bug in how the GPT-2 text encoding works with spaces, which makes it difficult or impossible to generate completely syntactically valid ABC music, and appears to damage learning as well; we bypassed it by simply eliminating spaces from the ABC, which apparently is legal.
  2. we scraped ABCnotation.com to add their 157,240 pieces to the dataset, and retrained on that, which took a good week. Had to remove a lot of metadata because many pieces are quite overenthusiastic in that.

The RL preference learning is still ongoing, but I decided to leave that for another writeup, and cover just the baseline ABC GPT-2.

You mention in the blog post you were going to compare it to folk-RNN pieces, but I hope you'll redo your assessment with the latest GPT-2 model samples before you do that!

@gwern
Copy link
Author

gwern commented Jan 28, 2020

To update: I regret to report that the preference-learning was a failure. However, there is good news in that we are now experimenting with training GPT-2 with context windows up to 30,000 (that is not a typo), and with generating the ABC-equivalents of MIDI files.

@boblsturm
Copy link
Collaborator

Very interesting experiments! Might you be interested in participating in this challenge: https://www.kth.se/en/eecs/om-oss/konferenser-och-event/aimusic2020/ai-music-generation-challenge-2020-1.946478

@joebowbeer
Copy link

OT: Given that jig is a dance form, I think danceability should be added to the rating system.

@boblsturm
Copy link
Collaborator

It is a part of stage 1, point b, and stage 2, point a(ii).

@gwern
Copy link
Author

gwern commented Feb 6, 2020

Might you be interested in participating in this challenge

I think I would have a hard time understanding and passing all the constraints: that's a lot of conditions to backfit onto generation. You could probably do reasonably well by training on all ABC available if it came with labels defining double jigs and other things, to condition in generation, and maybe tacking on a filtering step by a reward model. I doubt I'll do it because once we finish with the 30k context window GPT-2 (it should be long since done but our TPUs got preempted by a massive surge in TPU demand) we'll be moving on to scaling up StyleGAN 2 to try to generate 1024px anime images.

@boblsturm
Copy link
Collaborator

Yes, "solving" the task will require thinking about the problem instead of just throwing data at it. Especially so since the dataset for comparison only has 365 double jigs. There will be a variety of baseline models for comparison, including folkrnn.

@gwern
Copy link
Author

gwern commented Apr 25, 2020

We've finished and written up the MIDI generation with lots of samples at https://www.gwern.net/GPT-2-music#generating-midi-with-10k30k-context-windows Interested to hear what anyone makes of our post-October-2019 results.

(StyleGAN 2 has proven, as I suspected, unable to generate large-scale complex images due to its architecture, although we got further than most people would have predicted, so we've moved onto BigGAN and have quite nice results already and 512px is looking feasible for eg, although you guys probably don't care about generating anime etc.)

@boblsturm
Copy link
Collaborator

Thanks for the update Gwern. I listened to several of the examples, but didn't hear any that resemble Irish traditional music. IMO, the two orders of magnitude increase in the model size from folk-rnn models is not showing any advantages. I did enjoy some of your generated examples though. This one reminds me of Steve Martland: https://www.gwern.net/docs/ai/music/2020-04-15-gpt2-midi-thesessionsabc-6473931123.mp3.

It's a large amount of money you have spent training your model! Why are you doing this work?

@gwern
Copy link
Author

gwern commented Apr 26, 2020

If you wanted to get out reliably just Irish music, probably need to finetune it. I didn't try that because I thought the original model covered that task pretty well and I was more interested in generating a broader variety of music. The Sessions is, all things considered, a very restricted niche of music, so it's not too surprising that very small models can do a decent job on it.

Generating MIDI in general is far broader and harder a task - you say that the increased model size isn't helping, but nevertheless, it underfit the MIDI dataset substantially, to my surprise. If I had known we'd be able to hit a only loss of 0.2, I would've started with a larger model than just GPT-2-117M, and accepted any necessary limits on context window. Since likelihood loss seems to track quality so closely & the final few increments of loss make a big difference perceptually, it's possible we could've done a lot better.

As for why: well, no one else is trying, and I could, and it's interesting. It works better than most people would've expected, which is good enough for me.

@boblsturm
Copy link
Collaborator

Thanks. There are several researchers achieving compelling results with ML modeling and generation of different styles of music, e.g., https://magenta.tensorflow.org/music-transformer, https://chrisdonahue.com/, http://dadabots.com/, https://musicai.citi.sinica.edu.tw/, https://csl.sony.fr/team/gaetan-hadjeres/, https://www.york.ac.uk/music/staff/academic/tom-collins/, not to mention a variety of start-ups like AIVA and melodrive and jukedeck (now part of tiktok). Many of these groups are also studying their systems in contexts of music creation, seeing how useful they actually are, i.e., how they can contribute to music creation. It is a simple matter to throw a bunch of data at ML wizardry. Have you thought about going further to see what your model can contribute by working with people who create music? If I find some time I will download your models and interact them in the same way I do with folkrnn in creating music.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants