Adding a Judge
Up until now, I’ve been developing my music generator by first making a change, and then next listening to a bunch of samples and trying to decide whether my change was good or not. This process has been helpful initially (and has been occasionally hilarious, when the generations are really bad). I definitely have a better intuition for the problem now after listening to many different kinds of generations. However, it’s obviously neither a scientific nor scalable approach.
This week I worked on creating a separate music critic network that would judge the samples and only give me the best one. Also, I’m working on a more general metric (a musical Inception score) to judge my various music generator networks.
Music Critic Network
The music samples from my generator are all decent, but as I listen to them, some are clearly better than others. I wondered this week if I could improve the quality by having a setup that generated 100 samples, and then a critic that would pick the best one.
I trained a simple music critic network to judge whether a sample piece is real or generated. This is very much akin to how the discriminator in a GAN works, but I am not feeding the score back to train the generator, I’m only using it to select the “best” piece.
Initially, I thought I would take a copy of my generator, take off the top layer, add a classifier layer instead, and then fine tune that to judge whether pieces were real or not. For the fake pieces, I generated many samples from current and past generators, and divided those all into train/fake and test/fake folders. For real samples, I chose 2000-note samples from real training data at random (I again divided samples into train/real and test/real folders).
I don’t yet know why, but copying the weights from my generator to my critic and fine tuning from that actually led to much lower accuracy than if I trained a critic with randomly initialized weights. I initially thought that the critic would benefit from getting the pretrained embedding weights and early LSTM layer weights from the generator (indeed, this is why we use word2vec, or more recently, people use pretrained language models). It seems that these pretrained weights are pushing the critic into a bad loss local minimum, and even though there is a better global minimum, the training isn’t able to reach that point. (I’m not surprised that it could fail to reach this point when I only fine tune the final layer, but even when I allow all layers to be trainable, the model is still stuck).
Switching to Parallel Generation
Initially, I just set up a simple loop around my music generator, and would have it create 50 or 100 samples in a row. I then realized of course that this was unnecessarily slow, and that it would be much better to generate the samples in parallel. Since the generator was already set up to train on mini-batches, it would have no trouble creating a batch of songs, rather than an individual one. This was a fairly straightforward matter of vectorizing the prompt to the generator (this way feeding it 16 or 32 different song samples in parallel). Then I sample a batch of outputs at each time step, which I would then feed back into the model. At each output step, I randomly decide whether to sample the most likely output or an output based on the model’s predicted probabilities. I use a batch-sized vector of random numbers to make this decision, so that at a given time step, some generations were selecting the most likely output, and others were selecting a more random option.
Adapting Music Encoding
Finally, I’ve continued to refine the way I encode the midi files. Previously, I was consolidating the timing markers (so that if I had ‘wait’ ‘wait’ next to each other, I would turn that into ‘wait2’). I realized this was making my vocab unnecessarily large – particularly if there were pauses between movements in my training set, it was meaningless to ask my model to try to predict ‘wait223’ instead of ‘wait224’. I changed this so it consolidates a maximum of 12 at a time. (So to encode ‘wait24’, I’d now have it write ‘wait12’ ‘wait12’.) This week I’ll continue to explore ways to tweak the encoding. I’m also curious about reordering the piano and violin notes – right now the model first predicts all the piano notes and then lists the violin notes (if any), and then outputs a ‘wait’. I wonder if putting the violin notes first for each step might change the generation. I’d like to explore this for two reasons: first, it seems an easier task to learn to predict a violin note after a ‘wait’, rather than to learn to predict a violin note after some piano note. I hypothesize that this will lead to more violin notes, even when far from the known training prompts. Secondly, I wonder if this could lead to more varied melodies, while keeping sensible harmonies – particularly if I can find a way to select only the violin notes with more randomness, and then choose the most likely piano notes given that violin choice.
In addition to the Music Critic Network, I’m looking this week to add a critic that will guess the composer, given a 2000-step sample of music. I’ll train it on clips from composers already in my music generator training set. My mentor Karthik has been pushing me to come up with some way of scoring the generator models. This way I’ll be able to train several different music generators with slightly different architectures and hyperparameters, and I’ll have a fast and concrete way of judging which generators are better.