The program is starting to be able to generate music that is both new and musical. Last week I had the problem that the LSTM was able to overfit to Mozart, and could reproduce that exactly, but it had a hard time generalizing and creating anything good but not an exact copy. I naively thought that it would be straightforward to throw in more training data and then add more dropout and experiment with sampling the generated output in different ways. This worked somewhat, but there was still the fundamental problem that if the network saw a progression of time steps that was very far from its training experience, it panicked and often just output all rests.
I think there is still room to push on this model, and I plan to return to it soon. However, in the meantime, I tried switching to a character-based approach. I refer to my previous word-like attempt as “chord-wise generation” and this new approach as “note-wise generation”.
Currently, I’m dividing each quarter note into 12 “musical time steps”. I chose this division so that the model would be able to generate both triplets (splitting the quarter into 1/3) and sixteenths (splitting into 1/4). It also makes trills sound better (previously, when going back and forth quickly between two notes, I was simply assigning both notes to the same musical time step, so it sounded like the notes were played simultaneously. Now I’m able to capture alternating back and forth.) The tradeoff is that this does make the timing prediction more complicated, and if the model doesn’t learn well, it generates odd sounding rhythms.
In generating the new “compositions”, the biggest factor is how I choose to sample from the LSTM’s output at each time step. I have tried several different options (and I currently have the program set so that I can easily toggle between any of these):
- Choose the LSTM’s best guess at that time step
- Choose randomly from the top n guesses. In practice, I usually set n to 2 or 3, and I choose the most likely guess 95% of the time, and the random guess 5% of the time.
- Choose randomly, weighted by the LSTM’s output probabilities
- Choose randomly, weighted by the LSTM’s output probabilities, but truncated so that only the top 4 or 5 guesses are considered
Currently, I’m finding that the last option is usually the best. It breaks down however if I train a model too long, so that the output guesses become too definite (so that the probability for guess #1 is so much higher than for guess #2 that the model never selects #2 randomly). At that point, the forced randomness of option 2 generally sounds more interesting.
I also spent some time this week cleaning up the way I went from midi file to training data for the LSTM. I’m still using the MIT music21, but I spent a while learning the details of both the midi representation and the music21 stream representation.
Additionally, I wrote a program to scrape midi recordings from the Classical Archives, so that I’d be able to train on significantly more data.
The move to dividing the beat into 12 seems also to push the LSTM to generate music that is very “busy” sounding. This is partially a reflection of the training set, but I suspect it is also that the model is having a hard time learning to output longer rests. I plan to explore this during the upcoming week, to see if I can make it easier for the model to learn calmer pieces.
I’m also planning to try to add an additional instrument (probably violin or cello) to the generation. I have some ideas how to tackle this (hopefully including pretraining on the large corpus of piano solo repertoire), but I’m still experimenting.