GOTO 2017 • Composing Bach Chorales Using Deep Learning • Feynman Liang

[Music]
cool thank you um just so I know what
level to speak at raise your hands if
you know who Bach is great raise your
hand if you know what a neural network
is oh this is the perfect crowd awesome
if you don’t know don’t worry I’m going
to cover the very basics of both so
let’s talk about Bach I’m going to play
to you some music
[Music]
now what you just heard is what’s known
as a coral there are four parts to it a
soprano alto tenor bass playing at the
exact same time and there’s very regular
phrasing structure where you have the
beginning of a phrase the determination
of a phrase followed by the next phrase
except that wasn’t Bach rather that was
a computer algorithm called Bach bot and
that was one sample out of its outputs
if you don’t believe me it’s on
soundcloud it’s called sample one go
listen for yourself so instead of
talking about box today I’m going to
talk to you about Bach bot hi my name is
phiman and it’s a pleasure to be here at
Amsterdam and today we’ll talk about
autumn is automatic stylistic
composition using long short term memory
so then a background about myself I’m
currently a software engineer at gigster
where I walk at work on interesting
automation problems regarding I’m taking
contracts divided them into sub
contracts and then freelancing them out
the work on Bach bot was done as part of
my master’s thesis where which I did at
the University of Cambridge with
Microsoft Research Cambridge in line
with the track here I do not have a PhD
and so and I still can do machine
learning so this is the fact this is a
fact you can do machine learning without
a PhD for those of you who just want to
know what’s going to happen and then get
out of here because it’s not interesting
here is the executive summary I’m going
to talk to you about how to train end to
end starting from data sets preparation
all the way to model tuning and
deployment of a deep recurrent neural
network for music this neural network is
capable of polysemy multiple
simultaneous voices at the same time
it’s capable automatic composition
generating a composition completely from
scratch as well as harmonization given
some fixed parts such as the soprano
line of the melody generate the
remaining supporting parts this model
learns music theory without being told
to do so providing empirical validation
of what music theorists have been using
for centuries and
finally it’s evaluated on an online
musical Turing test we’re out of 1700
participants only nine percent are able
to distinguish actual Bach from Bach
Bach
when I set off on this research there
were three primary goals the first
question I wanted to answer was what is
the frontier of computational creativity
now creativity is something we take to
be an 8 li human innately special in
some sense computers shouldn’t ought not
to be able to replicate this about us is
this actually true can we have computers
generate art that is convincingly human
the second question I wanted to answer
was how much does deep learning impacted
automatic music composition now
automatic music composition is a special
field it has been dominated by symbolic
methods which utilize things like formal
grammars or context-free grammars such
as this parse tree we’ve seen
connectionist methods in the early 19th
century however we have it however they
have they followed in popularity and
most recent systems have used symbolic
methods with the work here I wanted to
see did the new advances in deep
learning in the last 10 years can they
be transferred over to this particular
problem domain and finally the last
question I wanted to look at is how do
we evaluate these generative models I
mean we’ve seen we’ve seen in the
previous talk a lot of a lot of models
they generate art we look at it and as
the author we say oh that’s convincing
but oh that’s beautiful and great that
might be a perfectly valid use case but
it’s not sufficient for publication to
publish something we need to establish a
standardized benchmark and we need to be
able to evaluate all of our models about
it so we can objectively say which model
is better than the other now if you’re
still here I’m assuming you’re
interested this is the outline we’ll
start with a quick primer on music
theory giving you just the basic
terminology you need to understand the
remainder of this presentation we’ll
talk about how to prepare a data set of
Bach Corral’s
well then gate will get the give a
primer on recurrent neural networks
which is the actual deep learning model
architecture used to build Bach Bach
we’ll talk about the Bach Bach model
itself the tips and tricks and
techniques that we used in order to
train it
have it run successfully as well as
deploy it and then we’ll show the
results
well show how this model is able to
capture statistical regularities in box
musical style and we’ll prove a we won’t
prove we’ll provide very convincing
evidence that music theory does have
theoretical gesture empirical
justification and finally I’ll show the
results of the musical Turing test which
was our proposed evaluation methodology
for saying yes
this model has solves our research goal
the the task of automatically composing
convincing Bach chorale is more closed
than open of a problem as a result of
Bach plot and if you’re a hands-on type
of learner we’ve containerized the
entire deployment so if you go to my
website here I have a copy of the slides
which have all of these instructions you
run this eight lines of code and it runs
this entire and pipeline right here
where it takes the corrals it pre
processes them puts them into a data
store trains of trains the deep learning
model samples the deep learning model
produces outputs that you can listen to
let’s start with basic music theory now
when people think of music this is
usually what you think about you got
these bar lines you got notes and these
notes are on different horizontal and
vertical positions some of them have
interesting ties some of them of dots
this is interesting little weird hat
looking thing we don’t need all this we
need three fundamental concepts the
first is pitch pitch is often referred
to as how low or how high a note is so
if I play this we can distinguish that
some notes are lower and some notes are
higher in frequency and that corresponds
to the vertical axis here as the notes
of the notes sound ascending they appear
ascending on the bar lines the second
attribute we need is duration and this
is really how long a notice so this one
note these two notes these four and
these eight all have equal total
duration but they are they’re having zuv
each other so if we take a listen
the general intuition is the more bars
there are on these tides the faster the
notes appear with just those two
concepts this is starting to make a
little bit more sense this right here is
twice as fast as this note we can see
this note is higher than this note and
you can generalize this to the remainder
of this but there’s still this funny hat
looking thing we’ll get to the hat in a
sec but with pitch and duration we can
rewrite the music like so rather than
representing it using notes which may be
kind of cryptic we show it here as a
matrix where on the x axis we have time
so the duration and on the y-axis we
have pitch how high or low and frequency
that note is and what we’ve done is
we’ve taken the symbolic representation
of music and we’ve turned it into a
digital computable format that we can
train models on back to the hat looking
thing this is called a Fermata and Bach
used it to denote the ends of phrases we
had originally said about this research
completely neglecting for modest and we
found that the phrases generated by the
model just kind of wandered they never
seem to end there was no sense of
resolution or conclusion and that was
unrealistic but by adding these four
modest all of a sudden the model turned
around and we and we suddenly found
realistic phrasing structure cool and
that’s all the music you need to know
the rest of it is machine learning now
the biggest part of a machine learning
engineer’s job is preparing their data
sets this is a very painful task usually
have to scour the internet or find some
standardized data set that you train and
evaluate your models on that usually
these data sets have to be pre processed
and massaged into a format that’s
amenable for learning upon and for us it
was no different box works however
fortunately over the years have been
transcribed into excuse my German Bach
worka Vera – Nix BW sorry
dwv is how I’ve been referring to this
corpus it contains about all 438
harmonizations of Bach
Corral’s and conveniently it is
available through the software package
called music21
this is a Python package that you can
just tip install and then import it and
now you have an iterator over a
collection of music the first
pre-processing step we did is we took
the music the original music here and we
did two things we transposed it and then
we quantize it in time now you can
notice the transposition by looking at
these accidentals right here these two
little funny backwards or forwards B’s
and then they’re absent over here
furthermore that note has shifted up by
half a line that’s a little hard to see
but it’s happening and the reason why we
did this is we didn’t want to learn key
signature key signature is usually
something decided by the author before
the pieces even begun to compose and so
we can and so key signature itself can
be injected as a pre-processing step
where we sample over all the keys Bach
did use so we removed key fingers from
the equation through transposition and
I’ll justify why that’s an okay thing to
do in the next slide this first measure
is written is is a progression of five
notes written in C major and then what I
did in the next measure is I just moved
it up by five whole steps
[Music]
so yeah the pitch did change it’s
relatively higher it’s absolutely higher
on all accounts
but the relations between the notes
didn’t change and the sensation the the
motifs that the music is bringing out
those still remain fairly constant even
after transposition quantization that
however is a different story if I go
back to slides will notice quantization
to this 30-second note and turn it into
a sixteenth note by removing that second
bar we’ve distorted time is that a
problem it is it’s not it’s not perfect
but it’s a very minor problem so over
here I’ve plotted a histogram of all of
the durations inside of the corral
corpus and this quantization affects
only 0.2% of all the notes that we’re
training on the reason that we do it is
by quantizing in time we’re able to get
discrete representations in both time as
well as in pitch whereas working on a
continuous time axis now you have to
deal computers are discrete and are
unable to operate on the continuous
representation has to be quantized into
a digital format somehow the last
challenge polyphony so polysemy is the
presence of multiple simultaneous voices
so far the examples that I’ve shown you
you’ve just heard a single voice playing
at any given time but a Corral has four
voices the soprano the alto the tenor
the bass and so here’s a question for
you if I have four voices and they can
each represent 128 different pitches
that’s the constraint in MIDI
representation of music how many
different chords can I construct very
good yes 128 ^ 4 that’s correct
I put a Big O because some like some
like you can rearrange the ordering but
more or less yeah that’s correct and why
is this a problem well this is the
problem because most of these chords are
actually never seen especially after you
transposed a C major a minor in fact
looking at the data set we can see that
just the first 20 chords or 20
notes rather occupy almost 90% of the
entire dataset so if we were to
represent all of these we would have a
ton of symbols in our vocabulary which
we had never seen before the way we deal
with this problem is by serializing so
that is instead of representing all four
notes as an individual symbol we
represent each individual note as a
symbol itself and we serialized in
soprano alto tenor bass order and so
what you end up getting is a reduction
from 128 to the 4th all possible chords
into just 128 possible pitches now this
may seem a little unjustified but this
is actually done all the time with
sequence processing if you took like
take a look at traditional on language
models you can represent them either at
the character level or at the word level
similarly you can represent music either
at the note level or at the chord level
after serializing the the data looks
like this we have assembled a noting the
start of a piece and this is used to
initialize our model we then have the
four chords soprano alto tenor bass
followed by a delimiter indicating the
end of this frame and time has advanced
one in the future followed by another
soprano alto tenor bass we also have
these funny-looking dot things which I
came up with to denote the self firmata
so that we can encode when the end of a
phrases in our input training data after
all of our pre-processing our final
corpus looks like this there’s only 108
symbols left so not a hundred all
hundred 28 pitches are used in Bach’s
works and there’s about I would say four
hundred thousand total where we split
three hundred and eighty thousand or
three hundred and eighty thousand into a
training set and forty thousand into a
validation set we split between training
and validation in order to prevent
overfitting we don’t want to just
memorize box Corral’s rather we want to
be able to produce very similar samples
which are not exact identical and that’s
it with that you have the training set
and it’s encapsulated by the first three
commands on that slide I showed earlier
with Bach
make data set Bach bot extract
vocabulary the next step is to train the
recurrent neural network to talk about
recurrent neural networks let’s break
the word down recurrent neural network
I’m going to start with neuro neural
neural just means that we have very
basic building blocks called neurons
which look like this they take a
d-dimensional input x1 XD these are
numbers like 0.9 0.2 and they’re all
added together with a linear combination
so what you end up getting is this
activation Z which is just the sum of
these inputs weighted by WS so if a
neuron really cares about say X 2 W 2 W
1 and the rest will be zeros and so this
lets the neuron preferentially select
which of its inputs that cares more
about and allows to specialize for
certain parts of its input this
activation is passed through this X
shaped thing called an on called an
activation function commonly a sigmoid
but all it does is it introduces a
non-linearity into the network and
allows you to explore expressive on the
types of functions you can approximate
and we have the output called Y you take
these neurons you stack them
horizontally and you get what’s called a
lair so here I’m just showing four
neurons in this layer three neurons in
this layer two neurons on this top layer
and I represented the network like this
here we take the input X so this bottom
part we multiply by a matrix now because
we’ve replicated the neurons
horizontally and what w’s represents the
weights we pass it through this sigmoid
activation function to get these first
layer outputs this is recursively done
through all the layers until you get to
the very top where we have the final
outputs of the model the W’s here the
weights those are the parameters of the
network and these are the things that we
need to learn in order to train the
neural network great
we know that feed-forward neural
networks now let’s introduce the word
recurrent recurrent just means that the
previous input or the previous hidden
states are used in the next time step
the prediction so what I’m showing here
is again if you just pay attention to
this input area
and this layer right here and this
output this part right here is the same
thing as this thing right here however
we’ve added this funny little loop
coming back with this is electrical
engineering notation for a unit time
delay and what this is saying is take
the hidden state from time T minus 1 and
also include it as input into the next
into the prime T predictions in
equations it looks like this
the current hidden state is equal to the
act or the previous inputs plus the free
or an activation of the previous inputs
waited plus the the weighted activations
of the previous hidden states and the
outputs is only a function of just the
current hidden states we can take this
loop right here
oh sorry before I go there um this is
called a Elmen type recurrent neural
network this memory cell is very basic
it’s just doing the exact same thing a
normal neural network would do it turns
out there’s some problems with just
using the basic architecture and so the
architecture that the field has been
converging towards is known as long
short-term memory
it looks really complicated it’s not you
take the inputs and the hidden states
and you put them into three spots right
here the inputs an input gate a forget
gate and output gate and the point of
adding all this art complexity is to
solve a problem known as the vanishing
gradient problem where this constant
error carousel of the hidden state being
fed back to itself over and over and
over results in signals converging
toward zero or diverging to infinity
this is fortunately this is usually
available as just a black box
implementation in most software packages
you just specify I want to use an LS TM
and all of this is abstracted away from
you now here if you squint you can kind
of see that the memory cell that I’ve
shown previously where we have the
inputs the hidden States hidden facing
back to itself to generate an output I
distract it away like this and I’ve
stacked it up on top of each other so
rather than just having the outputs come
out of this H right here I’ve actually
made it the inputs to get another memory
cell
this is where the word deep comes from
deep networks are just networks that
have a lot of layers and by stacking I
get to use the word deep inside of my
deep LS TM model but I’ll show you later
that I’m not just doing it for the
buzzword depth actually matters as well
see in results another operation that’s
important for LS CMS is unrolling and
what unrolling does is it takes this
unit time delay and it just replicates
the LS TM units over time so rather than
show in this delay like this I’ve taken
it I’ve shown the the – once hidden unit
passing state into the the t hidden unit
passing stages the T plus first hidden
unit your input is a variable length and
to train the network what you do is you
expand this graph you unroll the lsdm so
the same length as your variable length
input and in order to get these
predictions up at the top great we know
all we need to know about music and rnns
let’s move on to a Bach bot have Bach
Bach works to Train Bach bot we apply
sequential prediction criteria now I’ve
stolen this from Andre carpet thieves
github but the principles are the same
suppose we’re given the input characters
hello and we want to model it using a
recurrent neural network the training
criteria is given the current input
character and the previous hidden state
predicts the next character so notice
down here I have a CH and I’m trying to
predict e I’ve e and I’m trying to
predict L I’ve L and I’m trying to
predict L and I have Allen I’m trying to
predict oh if we take this analogy to
music I have all of the notes I’ve seen
up until this point in time and I’m
trying to predict the next note I can
iterate this process forwards to
generate compositions the criteria we
want to use is and so the output layer
here is actually a probability
distribution sorry so take in the
previous slide and now I put it on top
of my unrolled Network so given the
initial hidden state which we just
initialized all zeroes because we have a
unique start symbol used to initialize
our pieces and the RNN dynamics so this
is the probability distribution over the
next state given the current state
so this YT is
for that and it’s a function of the
currents the current input XT as well as
the previous hidden states from t minus
1 we need to choose the r and n
parameters so these weight matrices the
weights of all the connections between
all the neurons in order to maximize
this probability right here the
probability of the real Bach chorale so
down here we have all the notes of the
real Bach chorale and up here we have
the next notes of this of those in an
ideal world if we just initialize it
with some Bach chorale it’ll just
memorize and return the remainder and
that will that will do great on this
prediction criteria but that’s not
exactly what we want but nevertheless
once we have this criteria the way that
the model is actually trained is by
using the chain rule from calculus where
we take partial derivatives up here we
have an error signal so I know this is
the real Bach note the real note that
Bach used and this is the thing my model
is predicting ok they’re a little bit
different how do I change the parameters
this weight matrix between the hidden
state the outputs this weight matrix
between the previous in stay in the
current hidden state and this weight
matrix between the hidden state the
inputs how can I change those around how
do I wiggle those to make this output up
here closer to what Bach actually had
produced now this training criteria can
be just formalized
used by taking gradients using calculus
and iterating and then optimization
known as stochastic gradient descents
and when applied to neural networks it’s
an algorithm called back propagation
well back propagation through time if
you want to get nitty-gritty because
we’ve unrolled the neural network over
time but again this is also abstraction
that need not concern you because this
is also usually provided for you as a
black box inside of common frameworks
such as tensor flow and caris we now
have all we now have the Bach bot model
but there’s a couple parameters that we
need to look at I haven’t told you
exactly how deep Bach bot is nor have I
told you how big these layers are before
we start when optimizing models this is
this is a very important learning and
it’s probably obvious by now GPUs are
very important for rapid experimentation
I did a quick benchmark and I found that
a GPU delivers an 8x perform
speed up making my training time goes
down from 256 minutes down to just 28
minutes so if you want to iterate
quickly getting a GPU will save you
April like will make you eight times
more productive did I just put the word
deep onto my neural network because it
was a good buzz word it turns out no
depth actually matters what I’m showing
you here are the training losses as well
as the validation losses as I change the
depth the training loss is how well is
my model doing on the training data set
which I’m letting it see and letting it
tune its parameters to do better on and
the validation loss is how well is my
model doing on data that I didn’t let it
see so how well is it generalizing
beyond just memorizing its inputs and
what we notice here is that with just
one layer the validation error is quite
high and as we increase layers – it gets
you down here three gets you this red
curve which is as low as it goes and if
you keep going for with four it goes
back up should this be surprising it
shouldn’t and the reason why it
shouldn’t is because as you add more
layers you’re adding more expressive
power notice that we’re here with four
layers you’re actually doing just as
good as the red curve so you’re doing
great on the training set but because
your model is now so expressive you’re
memorizing the inputs and so you
generalize more poorly so a similar
story can be told about the hidden state
sighs so how wide those memory cells are
how many units do we have in them as we
increase the hidden state layer it’s
hidden state size we get performance
improvements in generalization from this
blue curve all the way down until we get
to 256 hidden units this green curve
after that we see the same kind of
behavior where the training set error
goes lower and lower but because you’re
memorizing the inputs because your model
is now too powerful you’re out your
generalization error actually gets worse
finally LST em they’re pretty
complicated the reason why I introduced
it is because it’s actually so critical
for your performance the the basic
element type recurrent neural network or
just reuses the standard recurrent
neural network architecture for the
memory cell is shown here in
side of this green curve right here
which actually doesn’t do to both too
badly but by using a long short term
memory you get this yellow curve which
is at the very bottom it’s doing as best
as out of all the architectures we
looked at in terms of memory cells gated
recurrent units are ass more simpler or
simpler generalization of LF CMS they
haven’t been used as much and so there’s
less literature about them but on this
task they also appear to be doing quite
well cool after all of this
experimentation and all of this manual
grid search we finally arrived at a
final architecture where notes are first
embedded into real numbers a 32
dimensional real or vector rather and
then we have a three layer stacked
long short term memory recurrent neural
network which processes these notes
sequences over time and we trained it
using standard gradient descent with a
couple tricks we use this thing called
drop out and we drop out with a setting
of 30% and what this means is in between
subsequent connections between layers
randomly turns 30% of the neurons off
that seems a little bit counterintuitive
why might you want to do that it turns
out by turning off neurons during
training you actually force the neurons
to learn more robust features that are
independent of each other
if the neurons are not always reliably
avail if those connections are not
always reliably available then there are
always reliably available then neurons
may learn that to combine these two
features and to happen so you end up
getting correlated features where to
newer ons are actually learning the
exact same feature with dropout we’re
able we will actually show in the next
slide that generalization improves as we
increase this number to a certain point
we also conduct something called
Bachelor Malaysian it basically just
takes your data and centers it back
around zero and rescales the variance so
that you don’t have to worry about
floating-point number overflows or under
flows and we use 128 kind step truncated
back propagation through time again
another thing that your optimizer will
handle for you but at a high level what
this is doing is rather than unrolling
the entire network which over the entire
input sequence which could be tens of
thousands of notes long
got tens of thousands thousands of notes
long we only unroll it 128 and we
truncate the air signals we basically
say after 120 time steps whatever you do
over here is not going to affect the
future
too much here’s my promise slide about
drop out counter-intuitively as we turn
that as we start dropping out or turning
off random neurons or random neuron
connections we actually generalize
better we see that without drop out the
model actually starts to overfit
dramatically you know it gets better at
generalizing that it gets worse and
worse and worse at generalizing because
it’s got so many connections it can
learn so much you turn to and drop out
up to 0.3 you get this purple curve at
the bottom where you’ve turned just to
the right amount so that the features
the model of learning are robust they
can generalize independently of other
features and if you turn it up too high
then now you’re dropping up so much
you’re injecting more noise than
regularizing your model you actually
don’t generalize that well and the story
on the training side is also consistent
as we increase dropout you do strictly
worse on training and that makes sense
too because this isn’t generalization
this is just how well can the model
memorize its input data and if you turn
inputs off you will memorize this good
great with the Train model we can do
many things we can compose and we can
harmonize and the way we compose is the
following we have the hidden states and
we have the inputs and we have the model
weights and so we can use the model
weights to form this predictive
distribution what is the probability of
my current note given all of the
previous notes I’ve seen before from
this probability distribution we just
written we pick out a note according to
how that distribution is parameterised
so up here this could be like I think L
has the highest weight here and then so
after we sample it we just set XT equal
to whatever we sampled out of there and
we just treat it as truth we just assume
that whatever the output was right there
is now the input for the next time step
and then we iterate this process for
words so starting with no notes at all
you sample the start symbol and then you
just keep going until you sample the end
symbol and then
that way we’re able to generate novel
automatic compositions harmonization is
actually a generalization of composition
in composition what we basically did was
I got a start symbol fill in the rest
harmonization is where you say I’ve got
the melody I’ve got the baseline or I’ve
got these certain notes fill in the
parts that I didn’t specify and for this
we actually proposed a suboptimal
strategy so I’m going to let alpha
denote the stuff that we’re given so it
alpha could be like 1 3 7 the points in
time where the notes are fixed and the
privatization problem is we need to
choose the notes that aren’t fixed or we
subdues the input the sequence X 1 to X
also we need to choose the entire
composition such that the notes that
we’re given X alpha are already fixed
and so our decision variables are the
things that are not in alpha and we need
to maximize this probability
distribution my kind of greedy solution
which I’ve received a lot of criticism
for is okay you’re at this point in time
just sample the the most likely thing at
the next point in time the reason why
this gets criticized is because if you
greedily choose without looking at what
influence this decision now could impact
on your future you might choose
something that just doesn’t make any
sense in the future harmonic context but
may sound really good right now it’s
kind of like thinking it’s kind of like
acting without thinking about the
consequences of your action but the
testament to how well this actually
performs is not what could it how bad
could it be theoretically it’s actually
how well does it do empirically is this
still convincing and we’ll find out soon
but before we go there let’s uncover the
black box I’ve been talking about neural
networks is just this thing which you
can just optimize throw data at it it’ll
learn things let’s take a look inside
and see what’s actually going on and so
what I’ve done here is I’ve taken the
various memory cells of my recurrent
neural network and I’ve unrolled it over
time so on the x axis you see time and
on the y axis I’m showing you the
activations of all of the hidden units
so this is like neuron number
one tuner on number 32 this is neuron
number one – neuron number 256 in the
first hidden layer and similarly this is
neuron number one – neuron number 256 in
the second hidden layer these any
pattern there I don’t I mean I kind of
do I see like there’s like this little
smear right here and it seems to show up
everywhere as well as right here but
there’s not too much intuitive sense
that I can make out of this image and
this is a common criticism of deep
neural networks they’re like black boxes
where we don’t know how they really work
on the inside but they seem to do
awfully good as we get closer to the
output things start to make a little bit
more sense so over so I previously was
showing the hidden units of the first
and second layer now I’m showing the
third layer as well as a linear
combination of the third layer and
finally the outputs of the model and as
you get towards the end you start seeing
oh there’s this little dotty pattern
this almost looks like a piano roll if
you remember the representation of music
I showed earlier where we had time on
the x-axis and pitch on the y-axis this
looks awfully similar to that and this
isn’t surprising either recall we
trained the neural network to predict
the next note given the current note or
all the previous notes if the network
was doing perfectly we would expect to
just see the input here delayed by a
single time step and so it’s
unsurprising that we do see something
that resembles the input but it’s not
quite exactly the input sometimes we see
like multiple predictions at one point
in time and this is really representing
the uncertainty inside of our
predictions so if I represented the
probability distribution we’re not just
saying the next note is then is this
rather we’re saying we’re pretty sure
than that next note is this with this
probability but it could also be this
with this probability that probability I
called this the probabilistic piano roll
I don’t know if that’s standard
terminology here’s one of my most
interesting insights that I found from
this model it appears to actually be
learning music theory concepts so what
I’m showing here is some input that I
provided to the model and here I picked
out some neurons and oh no these neurons
are randomly selected so I didn’t just
go and I fished for the ones that
like that rather I just ran a random
number generator got eight of them out
and then I handed them off to my music
dearest collaborator and I was like hey
is there anything there and here’s the
end here’s the notes he made for me
he said that neuron 64 this one and
layer one neuron 138 this one they
appear to be picking out perfect
Cadence’s with root position chords in
the tonic key more music theory than I
can understand but if I look up here
it’s like that shape right there on the
piano roll looks like that shape on the
piano roll looks like that shape on the
piano roll interesting neuron layer one
or neuron 151 I believe that is this one
a minor Cadence’s ending phrases two and
four no that’s this one sorry and and
again I look up here okay yeah that kind
of chord right there looks kind of like
that chord right there they seem to be
specializing to picking out specific
types of chords okay so it’s learning
Roman numeral analysis and tonics and
root position chords and Cadence’s and
the last one where one neuron eighty
seven and layer two neuron 37 I believe
that’s this one in this one they’re
picking out I six chords I have no idea
what that means
so I showed you automatic composition at
the beginning of the presentation when I
took some Bach Bach music and I
allegedly claimed it was Bach I’ll now
show you what harmonization sounds like
and this is with the sub optimal
strategy that I proposed so we take a
melody such as
[Music]
we tell the model this has to be the
soprano line what are the others likely
to be like that’s kind of convincing
it’s almost like a baroque C major chord
progression what’s really interesting
though is that not only can we just
harmonize simple melodies like that we
can actually take popular tunes such as
this
[Music]
we can generate a novel baroque
harmonization of what Bach might have
done had he heard twinkle twinkle little
star during his lifetime
now I’m going off the track where it’s
like oh this is my model it looks so
good it sounds so realistic yeah but I
was just criticizing at the beginning of
the talk
my third research goal was actually how
can we determine a standardized way to
quantitatively assess the performance of
generative models for this particular
task and one which I recommend for all
of automatic composition is to do a
subjective listening experiment and so
what we did is we built
václav comm and it looks like this it’s
got a splash page and it’s kind of
trying to go viral it’s asking can you
tell the difference between Bach and a
computer they used to say man versus
machine but but the interface is simple
you’re given two choices one of them is
Bach one of them is Bach bot and you’re
asked to distinguish which one was the
actual Bach we put this up out on the
Internet
I’ve got around nineteen hundred
participants from all around the world
participants tended to be within the
eighteen to forty five age group the
district we got a surprisingly large
number of expert users who decided to
contribute we defined expert as a
researcher someone who is published or a
teacher someone with professional
accreditation as a music teacher
advanced as someone who has who have
studied in a degree program for music
and intermediate someone who plays an
instrument and here’s how they did so
I’ve coded these like I’ve coded these
with SAT B to represent the part that
was asked to be harmonized so this is
given the alto tenor bass harmonized
with soprano this year was given just
the soprano wood bass harmonized the
middle – and this is composed everything
I’m going to give you nothing this is
the result that I’ve been coding this
entire talk only participants are only
able to distinguish Bach from Bach
bought 7% better than random chance but
there’s some other interesting findings
in here
well I guess this isn’t too surprising
if you delete the soprano line then then
Bach bot is off to create a convincing
melody and it doesn’t do too well
whereas if you delete the bass line
Bach lots of a lot better now I think
this is actually a consequence of the
way I chose to deal with polyphony in
the sense that I serialized the music
from soprano alto tenor bass and so by
the time Bach Bach got to figuring out
what the bass note might be it already
seen the soprano alto and tenor note
within that time instant and so it
already had a very strong harmonic
context about what note might sound good
whereas if I whereas when I’ve got the
soprano note Bach watt has no idea what
the alto tenor bass note might be and so
just going to make a random guess that
could be totally out of place to
validate this hypothesis which is a work
left for the future you could serialize
in a different order such as bass tenor
Alto soprano you could run this
experiment again and you can see and you
would expect to see it go down like this
if the hypothesis is true and
differently if not here I’ve taken the
exact same plot from the previous plot
except I’ve now broken it down by music
experience unsurprisingly
you kind of see this curve where people
are doing or doing better as they get
more experienced so the novices are like
almost only three percent better where
the experts are sixteen percent better
they probably know Bach they’ve got it
memorized so they can tell the
difference but the interesting one is
here the experts do significantly worse
than random chance when getting when
comparing Bach versus Bach bought bass
harmonizations I actually don’t have a
good reason why but it’s surprising to
me it seems that the experts think block
bot is more convincing than actual Bach
so in conclusion I’ve presented a deep
long short term memory generative model
for composing completing and generating
polyphonic music and this model isn’t
just like research that I’m talking
about that no one ever gets to use it’s
actually open source it’s on my github
and moreover Google’s Google brains
magenta project has actually integrated
it already into Google magenta so if you
use the
polyphonic recurrent neural network
model at magenta and the tensor flow
projects you’ll be using the bok-bok
model the model appears to learn music
theory without any prior knowledge we
didn’t tell it this is a chord this is
the cadence this is a tonic it just
decided to figure that out on its own in
order to optimize performance on an
automatic composition task to me this
suggests that music theory with all of
its rules and all of its formalisms
actually is useful for for comp
composing in fact it’s so useful that a
machine trained to optimize compose
composition decided to specialize on
these concepts finally we conducted the
largest musical Turing test to date with
1,700 participants only 7% of which
performed better than random chance
obligatory note to my employer we do
open slitter we do freelance outsourcing
if you need a development team let me
know other than that thank you so much
for your attention it was a pleasure
speaking to you all
[Applause]

Please follow and like us:

GOTO 2017 • Composing Bach Chorales Using Deep Learning • Feynman Liang

Be First to Comment

Leave a Reply Cancel reply