I also spent two days this week working on generating musical patterns. I initially thought I would just take the character-prediction RNNs I worked on earlier in the week, and switch to predicting notes instead of characters. I decided instead it would be much more fun to try creating harmonies as well. With this in mind, instead of asking the model to predict one next note at each time step, I allow the model to say yes or no to any of 88 notes. Now, instead of 88 possible states at each timestep, we have 2^88 different possibilities.
I began with midi files of short chord progressions, which I found online here. Since it was a small dataset, I augmented it by shifting it slightly in time, and by modulating it up or down a few steps.
I then used a GAN setup to train a model that would generate chord-progression-like patterns, given noise as input. (GANs pit two neural nets against each other: a generator against a critic. They each push each other to get better, like a good art forger making better and better works to try fool a great detective.)
In the musical case, the adversary classifies whether the faked data looks like a real chord progression data or not. The generative model is pushed to make increasingly good fakes, and the adversary gets increasingly good at telling a fake from a real pattern.
In the audio samples above, the first track is an example of a random stream. The neural net outputs an array of numbers. I then use MIT’s music21 to create notes, then musical streams, and then finally midi files. For the sake of this blog post, I used TiMidity++ to convert the midi files to wav (the fuzziness in the sound of the individual notes comes from that conversion).
Initially, I tried basic, fully-connected networks for both the generation network and the adversary network. The second track is an example of music that was created after training both networks. Next, I changed the adversary network so that its first layer is a convolutional step: the kernel covers all 88 pitches, but only 1 timestep. This pushes the network to encode each timestep (into 32 output channels), before then looking at the sequence across time. Tracks 3 and 4 have two results of generation against this adversary. Here the generator learns it needs to create recognizable chords in order to convince the adversary.
It’s interesting that music generation is far less robust to noise than image generation. Where we might not care about a blurry section or wrong pixel in an image, we very clearly hear occasional wrong notes.
Next, I’ll be continuing to work on music generation. I’ll use data from Classical Archives and I’ll try to train a GAN on music of Brahms or Mozart.
I’m also moving towards word generation, and planning to use a GAN to improve language generators, particularly focusing on medical text.