[Music]

cool thank you um just so I know what

level to speak at raise your hands if

you know who Bach is great raise your

hand if you know what a neural network

is oh this is the perfect crowd awesome

if you don’t know don’t worry I’m going

to cover the very basics of both so

let’s talk about Bach I’m going to play

to you some music

[Music]

now what you just heard is what’s known

as a coral there are four parts to it a

soprano alto tenor bass playing at the

exact same time and there’s very regular

phrasing structure where you have the

beginning of a phrase the determination

of a phrase followed by the next phrase

except that wasn’t Bach rather that was

a computer algorithm called Bach bot and

that was one sample out of its outputs

if you don’t believe me it’s on

soundcloud it’s called sample one go

listen for yourself so instead of

talking about box today I’m going to

talk to you about Bach bot hi my name is

phiman and it’s a pleasure to be here at

Amsterdam and today we’ll talk about

autumn is automatic stylistic

composition using long short term memory

so then a background about myself I’m

currently a software engineer at gigster

where I walk at work on interesting

automation problems regarding I’m taking

contracts divided them into sub

contracts and then freelancing them out

the work on Bach bot was done as part of

my master’s thesis where which I did at

the University of Cambridge with

Microsoft Research Cambridge in line

with the track here I do not have a PhD

and so and I still can do machine

learning so this is the fact this is a

fact you can do machine learning without

a PhD for those of you who just want to

know what’s going to happen and then get

out of here because it’s not interesting

here is the executive summary I’m going

to talk to you about how to train end to

end starting from data sets preparation

all the way to model tuning and

deployment of a deep recurrent neural

network for music this neural network is

capable of polysemy multiple

simultaneous voices at the same time

it’s capable automatic composition

generating a composition completely from

scratch as well as harmonization given

some fixed parts such as the soprano

line of the melody generate the

remaining supporting parts this model

learns music theory without being told

to do so providing empirical validation

of what music theorists have been using

for centuries and

finally it’s evaluated on an online

musical Turing test we’re out of 1700

participants only nine percent are able

to distinguish actual Bach from Bach

Bach

when I set off on this research there

were three primary goals the first

question I wanted to answer was what is

the frontier of computational creativity

now creativity is something we take to

be an 8 li human innately special in

some sense computers shouldn’t ought not

to be able to replicate this about us is

this actually true can we have computers

generate art that is convincingly human

the second question I wanted to answer

was how much does deep learning impacted

automatic music composition now

automatic music composition is a special

field it has been dominated by symbolic

methods which utilize things like formal

grammars or context-free grammars such

as this parse tree we’ve seen

connectionist methods in the early 19th

century however we have it however they

have they followed in popularity and

most recent systems have used symbolic

methods with the work here I wanted to

see did the new advances in deep

learning in the last 10 years can they

be transferred over to this particular

problem domain and finally the last

question I wanted to look at is how do

we evaluate these generative models I

mean we’ve seen we’ve seen in the

previous talk a lot of a lot of models

they generate art we look at it and as

the author we say oh that’s convincing

but oh that’s beautiful and great that

might be a perfectly valid use case but

it’s not sufficient for publication to

publish something we need to establish a

standardized benchmark and we need to be

able to evaluate all of our models about

it so we can objectively say which model

is better than the other now if you’re

still here I’m assuming you’re

interested this is the outline we’ll

start with a quick primer on music

theory giving you just the basic

terminology you need to understand the

remainder of this presentation we’ll

talk about how to prepare a data set of

Bach Corral’s

well then gate will get the give a

primer on recurrent neural networks

which is the actual deep learning model

architecture used to build Bach Bach

we’ll talk about the Bach Bach model

itself the tips and tricks and

techniques that we used in order to

train it

have it run successfully as well as

deploy it and then we’ll show the

results

well show how this model is able to

capture statistical regularities in box

musical style and we’ll prove a we won’t

prove we’ll provide very convincing

evidence that music theory does have

theoretical gesture empirical

justification and finally I’ll show the

results of the musical Turing test which

was our proposed evaluation methodology

for saying yes

this model has solves our research goal

the the task of automatically composing

convincing Bach chorale is more closed

than open of a problem as a result of

Bach plot and if you’re a hands-on type

of learner we’ve containerized the

entire deployment so if you go to my

website here I have a copy of the slides

which have all of these instructions you

run this eight lines of code and it runs

this entire and pipeline right here

where it takes the corrals it pre

processes them puts them into a data

store trains of trains the deep learning

model samples the deep learning model

produces outputs that you can listen to

let’s start with basic music theory now

when people think of music this is

usually what you think about you got

these bar lines you got notes and these

notes are on different horizontal and

vertical positions some of them have

interesting ties some of them of dots

this is interesting little weird hat

looking thing we don’t need all this we

need three fundamental concepts the

first is pitch pitch is often referred

to as how low or how high a note is so

if I play this we can distinguish that

some notes are lower and some notes are

higher in frequency and that corresponds

to the vertical axis here as the notes

of the notes sound ascending they appear

ascending on the bar lines the second

attribute we need is duration and this

is really how long a notice so this one

note these two notes these four and

these eight all have equal total

duration but they are they’re having zuv

each other so if we take a listen

the general intuition is the more bars

there are on these tides the faster the

notes appear with just those two

concepts this is starting to make a

little bit more sense this right here is

twice as fast as this note we can see

this note is higher than this note and

you can generalize this to the remainder

of this but there’s still this funny hat

looking thing we’ll get to the hat in a

sec but with pitch and duration we can

rewrite the music like so rather than

representing it using notes which may be

kind of cryptic we show it here as a

matrix where on the x axis we have time

so the duration and on the y-axis we

have pitch how high or low and frequency

that note is and what we’ve done is

we’ve taken the symbolic representation

of music and we’ve turned it into a

digital computable format that we can

train models on back to the hat looking

thing this is called a Fermata and Bach

used it to denote the ends of phrases we

had originally said about this research

completely neglecting for modest and we

found that the phrases generated by the

model just kind of wandered they never

seem to end there was no sense of

resolution or conclusion and that was

unrealistic but by adding these four

modest all of a sudden the model turned

around and we and we suddenly found

realistic phrasing structure cool and

that’s all the music you need to know

the rest of it is machine learning now

the biggest part of a machine learning

engineer’s job is preparing their data

sets this is a very painful task usually

have to scour the internet or find some

standardized data set that you train and

evaluate your models on that usually

these data sets have to be pre processed

and massaged into a format that’s

amenable for learning upon and for us it

was no different box works however

fortunately over the years have been

transcribed into excuse my German Bach

worka Vera – Nix BW sorry

dwv is how I’ve been referring to this

corpus it contains about all 438

harmonizations of Bach

Corral’s and conveniently it is

available through the software package

called music21

this is a Python package that you can

just tip install and then import it and

now you have an iterator over a

collection of music the first

pre-processing step we did is we took

the music the original music here and we

did two things we transposed it and then

we quantize it in time now you can

notice the transposition by looking at

these accidentals right here these two

little funny backwards or forwards B’s

and then they’re absent over here

furthermore that note has shifted up by

half a line that’s a little hard to see

but it’s happening and the reason why we

did this is we didn’t want to learn key

signature key signature is usually

something decided by the author before

the pieces even begun to compose and so

we can and so key signature itself can

be injected as a pre-processing step

where we sample over all the keys Bach

did use so we removed key fingers from

the equation through transposition and

I’ll justify why that’s an okay thing to

do in the next slide this first measure

is written is is a progression of five

notes written in C major and then what I

did in the next measure is I just moved

it up by five whole steps

[Music]

so yeah the pitch did change it’s

relatively higher it’s absolutely higher

on all accounts

but the relations between the notes

didn’t change and the sensation the the

motifs that the music is bringing out

those still remain fairly constant even

after transposition quantization that

however is a different story if I go

back to slides will notice quantization

to this 30-second note and turn it into

a sixteenth note by removing that second

bar we’ve distorted time is that a

problem it is it’s not it’s not perfect

but it’s a very minor problem so over

here I’ve plotted a histogram of all of

the durations inside of the corral

corpus and this quantization affects

only 0.2% of all the notes that we’re

training on the reason that we do it is

by quantizing in time we’re able to get

discrete representations in both time as

well as in pitch whereas working on a

continuous time axis now you have to

deal computers are discrete and are

unable to operate on the continuous

representation has to be quantized into

a digital format somehow the last

challenge polyphony so polysemy is the

presence of multiple simultaneous voices

so far the examples that I’ve shown you

you’ve just heard a single voice playing

at any given time but a Corral has four

voices the soprano the alto the tenor

the bass and so here’s a question for

you if I have four voices and they can

each represent 128 different pitches

that’s the constraint in MIDI

representation of music how many

different chords can I construct very

good yes 128 ^ 4 that’s correct

I put a Big O because some like some

like you can rearrange the ordering but

more or less yeah that’s correct and why

is this a problem well this is the

problem because most of these chords are

actually never seen especially after you

transposed a C major a minor in fact

looking at the data set we can see that

just the first 20 chords or 20

notes rather occupy almost 90% of the

entire dataset so if we were to

represent all of these we would have a

ton of symbols in our vocabulary which

we had never seen before the way we deal

with this problem is by serializing so

that is instead of representing all four

notes as an individual symbol we

represent each individual note as a

symbol itself and we serialized in

soprano alto tenor bass order and so

what you end up getting is a reduction

from 128 to the 4th all possible chords

into just 128 possible pitches now this

may seem a little unjustified but this

is actually done all the time with

sequence processing if you took like

take a look at traditional on language

models you can represent them either at

the character level or at the word level

similarly you can represent music either

at the note level or at the chord level

after serializing the the data looks

like this we have assembled a noting the

start of a piece and this is used to

initialize our model we then have the

four chords soprano alto tenor bass

followed by a delimiter indicating the

end of this frame and time has advanced

one in the future followed by another

soprano alto tenor bass we also have

these funny-looking dot things which I

came up with to denote the self firmata

so that we can encode when the end of a

phrases in our input training data after

all of our pre-processing our final

corpus looks like this there’s only 108

symbols left so not a hundred all

hundred 28 pitches are used in Bach’s

works and there’s about I would say four

hundred thousand total where we split

three hundred and eighty thousand or

three hundred and eighty thousand into a

training set and forty thousand into a

validation set we split between training

and validation in order to prevent

overfitting we don’t want to just

memorize box Corral’s rather we want to

be able to produce very similar samples

which are not exact identical and that’s

it with that you have the training set

and it’s encapsulated by the first three

commands on that slide I showed earlier

with Bach

make data set Bach bot extract

vocabulary the next step is to train the

recurrent neural network to talk about

recurrent neural networks let’s break

the word down recurrent neural network

I’m going to start with neuro neural

neural just means that we have very

basic building blocks called neurons

which look like this they take a

d-dimensional input x1 XD these are

numbers like 0.9 0.2 and they’re all

added together with a linear combination

so what you end up getting is this

activation Z which is just the sum of

these inputs weighted by WS so if a

neuron really cares about say X 2 W 2 W

1 and the rest will be zeros and so this

lets the neuron preferentially select

which of its inputs that cares more

about and allows to specialize for

certain parts of its input this

activation is passed through this X

shaped thing called an on called an

activation function commonly a sigmoid

but all it does is it introduces a

non-linearity into the network and

allows you to explore expressive on the

types of functions you can approximate

and we have the output called Y you take

these neurons you stack them

horizontally and you get what’s called a

lair so here I’m just showing four

neurons in this layer three neurons in

this layer two neurons on this top layer

and I represented the network like this

here we take the input X so this bottom

part we multiply by a matrix now because

we’ve replicated the neurons

horizontally and what w’s represents the

weights we pass it through this sigmoid

activation function to get these first

layer outputs this is recursively done

through all the layers until you get to

the very top where we have the final

outputs of the model the W’s here the

weights those are the parameters of the

network and these are the things that we

need to learn in order to train the

neural network great

we know that feed-forward neural

networks now let’s introduce the word

recurrent recurrent just means that the

previous input or the previous hidden

states are used in the next time step

the prediction so what I’m showing here

is again if you just pay attention to

this input area

and this layer right here and this

output this part right here is the same

thing as this thing right here however

we’ve added this funny little loop

coming back with this is electrical

engineering notation for a unit time

delay and what this is saying is take

the hidden state from time T minus 1 and

also include it as input into the next

into the prime T predictions in

equations it looks like this

the current hidden state is equal to the

act or the previous inputs plus the free

or an activation of the previous inputs

waited plus the the weighted activations

of the previous hidden states and the

outputs is only a function of just the

current hidden states we can take this

loop right here

oh sorry before I go there um this is

called a Elmen type recurrent neural

network this memory cell is very basic

it’s just doing the exact same thing a

normal neural network would do it turns

out there’s some problems with just

using the basic architecture and so the

architecture that the field has been

converging towards is known as long

short-term memory

it looks really complicated it’s not you

take the inputs and the hidden states

and you put them into three spots right

here the inputs an input gate a forget

gate and output gate and the point of

adding all this art complexity is to

solve a problem known as the vanishing

gradient problem where this constant

error carousel of the hidden state being

fed back to itself over and over and

over results in signals converging

toward zero or diverging to infinity

this is fortunately this is usually

available as just a black box

implementation in most software packages

you just specify I want to use an LS TM

and all of this is abstracted away from

you now here if you squint you can kind

of see that the memory cell that I’ve

shown previously where we have the

inputs the hidden States hidden facing

back to itself to generate an output I

distract it away like this and I’ve

stacked it up on top of each other so

rather than just having the outputs come

out of this H right here I’ve actually

made it the inputs to get another memory

cell

this is where the word deep comes from

deep networks are just networks that

have a lot of layers and by stacking I

get to use the word deep inside of my

deep LS TM model but I’ll show you later

that I’m not just doing it for the

buzzword depth actually matters as well

see in results another operation that’s

important for LS CMS is unrolling and

what unrolling does is it takes this

unit time delay and it just replicates

the LS TM units over time so rather than

show in this delay like this I’ve taken

it I’ve shown the the – once hidden unit

passing state into the the t hidden unit

passing stages the T plus first hidden

unit your input is a variable length and

to train the network what you do is you

expand this graph you unroll the lsdm so

the same length as your variable length

input and in order to get these

predictions up at the top great we know

all we need to know about music and rnns

let’s move on to a Bach bot have Bach

Bach works to Train Bach bot we apply

sequential prediction criteria now I’ve

stolen this from Andre carpet thieves

github but the principles are the same

suppose we’re given the input characters

hello and we want to model it using a

recurrent neural network the training

criteria is given the current input

character and the previous hidden state

predicts the next character so notice

down here I have a CH and I’m trying to

predict e I’ve e and I’m trying to

predict L I’ve L and I’m trying to

predict L and I have Allen I’m trying to

predict oh if we take this analogy to

music I have all of the notes I’ve seen

up until this point in time and I’m

trying to predict the next note I can

iterate this process forwards to

generate compositions the criteria we

want to use is and so the output layer

here is actually a probability

distribution sorry so take in the

previous slide and now I put it on top

of my unrolled Network so given the

initial hidden state which we just

initialized all zeroes because we have a

unique start symbol used to initialize

our pieces and the RNN dynamics so this

is the probability distribution over the

next state given the current state

so this YT is

for that and it’s a function of the

currents the current input XT as well as

the previous hidden states from t minus

1 we need to choose the r and n

parameters so these weight matrices the

weights of all the connections between

all the neurons in order to maximize

this probability right here the

probability of the real Bach chorale so

down here we have all the notes of the

real Bach chorale and up here we have

the next notes of this of those in an

ideal world if we just initialize it

with some Bach chorale it’ll just

memorize and return the remainder and

that will that will do great on this

prediction criteria but that’s not

exactly what we want but nevertheless

once we have this criteria the way that

the model is actually trained is by

using the chain rule from calculus where

we take partial derivatives up here we

have an error signal so I know this is

the real Bach note the real note that

Bach used and this is the thing my model

is predicting ok they’re a little bit

different how do I change the parameters

this weight matrix between the hidden

state the outputs this weight matrix

between the previous in stay in the

current hidden state and this weight

matrix between the hidden state the

inputs how can I change those around how

do I wiggle those to make this output up

here closer to what Bach actually had

produced now this training criteria can

be just formalized

used by taking gradients using calculus

and iterating and then optimization

known as stochastic gradient descents

and when applied to neural networks it’s

an algorithm called back propagation

well back propagation through time if

you want to get nitty-gritty because

we’ve unrolled the neural network over

time but again this is also abstraction

that need not concern you because this

is also usually provided for you as a

black box inside of common frameworks

such as tensor flow and caris we now

have all we now have the Bach bot model

but there’s a couple parameters that we

need to look at I haven’t told you

exactly how deep Bach bot is nor have I

told you how big these layers are before

we start when optimizing models this is

this is a very important learning and

it’s probably obvious by now GPUs are

very important for rapid experimentation

I did a quick benchmark and I found that

a GPU delivers an 8x perform

speed up making my training time goes

down from 256 minutes down to just 28

minutes so if you want to iterate

quickly getting a GPU will save you

April like will make you eight times

more productive did I just put the word

deep onto my neural network because it

was a good buzz word it turns out no

depth actually matters what I’m showing

you here are the training losses as well

as the validation losses as I change the

depth the training loss is how well is

my model doing on the training data set

which I’m letting it see and letting it

tune its parameters to do better on and

the validation loss is how well is my

model doing on data that I didn’t let it

see so how well is it generalizing

beyond just memorizing its inputs and

what we notice here is that with just

one layer the validation error is quite

high and as we increase layers – it gets

you down here three gets you this red

curve which is as low as it goes and if

you keep going for with four it goes

back up should this be surprising it

shouldn’t and the reason why it

shouldn’t is because as you add more

layers you’re adding more expressive

power notice that we’re here with four

layers you’re actually doing just as

good as the red curve so you’re doing

great on the training set but because

your model is now so expressive you’re

memorizing the inputs and so you

generalize more poorly so a similar

story can be told about the hidden state

sighs so how wide those memory cells are

how many units do we have in them as we

increase the hidden state layer it’s

hidden state size we get performance

improvements in generalization from this

blue curve all the way down until we get

to 256 hidden units this green curve

after that we see the same kind of

behavior where the training set error

goes lower and lower but because you’re

memorizing the inputs because your model

is now too powerful you’re out your

generalization error actually gets worse

finally LST em they’re pretty

complicated the reason why I introduced

it is because it’s actually so critical

for your performance the the basic

element type recurrent neural network or

just reuses the standard recurrent

neural network architecture for the

memory cell is shown here in

side of this green curve right here

which actually doesn’t do to both too

badly but by using a long short term

memory you get this yellow curve which

is at the very bottom it’s doing as best

as out of all the architectures we

looked at in terms of memory cells gated

recurrent units are ass more simpler or

simpler generalization of LF CMS they

haven’t been used as much and so there’s

less literature about them but on this

task they also appear to be doing quite

well cool after all of this

experimentation and all of this manual

grid search we finally arrived at a

final architecture where notes are first

embedded into real numbers a 32

dimensional real or vector rather and

then we have a three layer stacked

long short term memory recurrent neural

network which processes these notes

sequences over time and we trained it

using standard gradient descent with a

couple tricks we use this thing called

drop out and we drop out with a setting

of 30% and what this means is in between

subsequent connections between layers

randomly turns 30% of the neurons off

that seems a little bit counterintuitive

why might you want to do that it turns

out by turning off neurons during

training you actually force the neurons

to learn more robust features that are

independent of each other

if the neurons are not always reliably

avail if those connections are not

always reliably available then there are

always reliably available then neurons

may learn that to combine these two

features and to happen so you end up

getting correlated features where to

newer ons are actually learning the

exact same feature with dropout we’re

able we will actually show in the next

slide that generalization improves as we

increase this number to a certain point

we also conduct something called

Bachelor Malaysian it basically just

takes your data and centers it back

around zero and rescales the variance so

that you don’t have to worry about

floating-point number overflows or under

flows and we use 128 kind step truncated

back propagation through time again

another thing that your optimizer will

handle for you but at a high level what

this is doing is rather than unrolling

the entire network which over the entire

input sequence which could be tens of

thousands of notes long

got tens of thousands thousands of notes

long we only unroll it 128 and we

truncate the air signals we basically

say after 120 time steps whatever you do

over here is not going to affect the

future

too much here’s my promise slide about

drop out counter-intuitively as we turn

that as we start dropping out or turning

off random neurons or random neuron

connections we actually generalize

better we see that without drop out the

model actually starts to overfit

dramatically you know it gets better at

generalizing that it gets worse and

worse and worse at generalizing because

it’s got so many connections it can

learn so much you turn to and drop out

up to 0.3 you get this purple curve at

the bottom where you’ve turned just to

the right amount so that the features

the model of learning are robust they

can generalize independently of other

features and if you turn it up too high

then now you’re dropping up so much

you’re injecting more noise than

regularizing your model you actually

don’t generalize that well and the story

on the training side is also consistent

as we increase dropout you do strictly

worse on training and that makes sense

too because this isn’t generalization

this is just how well can the model

memorize its input data and if you turn

inputs off you will memorize this good

great with the Train model we can do

many things we can compose and we can

harmonize and the way we compose is the

following we have the hidden states and

we have the inputs and we have the model

weights and so we can use the model

weights to form this predictive

distribution what is the probability of

my current note given all of the

previous notes I’ve seen before from

this probability distribution we just

written we pick out a note according to

how that distribution is parameterised

so up here this could be like I think L

has the highest weight here and then so

after we sample it we just set XT equal

to whatever we sampled out of there and

we just treat it as truth we just assume

that whatever the output was right there

is now the input for the next time step

and then we iterate this process for

words so starting with no notes at all

you sample the start symbol and then you

just keep going until you sample the end

symbol and then

that way we’re able to generate novel

automatic compositions harmonization is

actually a generalization of composition

in composition what we basically did was

I got a start symbol fill in the rest

harmonization is where you say I’ve got

the melody I’ve got the baseline or I’ve

got these certain notes fill in the

parts that I didn’t specify and for this

we actually proposed a suboptimal

strategy so I’m going to let alpha

denote the stuff that we’re given so it

alpha could be like 1 3 7 the points in

time where the notes are fixed and the

privatization problem is we need to

choose the notes that aren’t fixed or we

subdues the input the sequence X 1 to X

also we need to choose the entire

composition such that the notes that

we’re given X alpha are already fixed

and so our decision variables are the

things that are not in alpha and we need

to maximize this probability

distribution my kind of greedy solution

which I’ve received a lot of criticism

for is okay you’re at this point in time

just sample the the most likely thing at

the next point in time the reason why

this gets criticized is because if you

greedily choose without looking at what

influence this decision now could impact

on your future you might choose

something that just doesn’t make any

sense in the future harmonic context but

may sound really good right now it’s

kind of like thinking it’s kind of like

acting without thinking about the

consequences of your action but the

testament to how well this actually

performs is not what could it how bad

could it be theoretically it’s actually

how well does it do empirically is this

still convincing and we’ll find out soon

but before we go there let’s uncover the

black box I’ve been talking about neural

networks is just this thing which you

can just optimize throw data at it it’ll

learn things let’s take a look inside

and see what’s actually going on and so

what I’ve done here is I’ve taken the

various memory cells of my recurrent

neural network and I’ve unrolled it over

time so on the x axis you see time and

on the y axis I’m showing you the

activations of all of the hidden units

so this is like neuron number

one tuner on number 32 this is neuron

number one – neuron number 256 in the

first hidden layer and similarly this is

neuron number one – neuron number 256 in

the second hidden layer these any

pattern there I don’t I mean I kind of

do I see like there’s like this little

smear right here and it seems to show up

everywhere as well as right here but

there’s not too much intuitive sense

that I can make out of this image and

this is a common criticism of deep

neural networks they’re like black boxes

where we don’t know how they really work

on the inside but they seem to do

awfully good as we get closer to the

output things start to make a little bit

more sense so over so I previously was

showing the hidden units of the first

and second layer now I’m showing the

third layer as well as a linear

combination of the third layer and

finally the outputs of the model and as

you get towards the end you start seeing

oh there’s this little dotty pattern

this almost looks like a piano roll if

you remember the representation of music

I showed earlier where we had time on

the x-axis and pitch on the y-axis this

looks awfully similar to that and this

isn’t surprising either recall we

trained the neural network to predict

the next note given the current note or

all the previous notes if the network

was doing perfectly we would expect to

just see the input here delayed by a

single time step and so it’s

unsurprising that we do see something

that resembles the input but it’s not

quite exactly the input sometimes we see

like multiple predictions at one point

in time and this is really representing

the uncertainty inside of our

predictions so if I represented the

probability distribution we’re not just

saying the next note is then is this

rather we’re saying we’re pretty sure

than that next note is this with this

probability but it could also be this

with this probability that probability I

called this the probabilistic piano roll

I don’t know if that’s standard

terminology here’s one of my most

interesting insights that I found from

this model it appears to actually be

learning music theory concepts so what

I’m showing here is some input that I

provided to the model and here I picked

out some neurons and oh no these neurons

are randomly selected so I didn’t just

go and I fished for the ones that

like that rather I just ran a random

number generator got eight of them out

and then I handed them off to my music

dearest collaborator and I was like hey

is there anything there and here’s the

end here’s the notes he made for me

he said that neuron 64 this one and

layer one neuron 138 this one they

appear to be picking out perfect

Cadence’s with root position chords in

the tonic key more music theory than I

can understand but if I look up here

it’s like that shape right there on the

piano roll looks like that shape on the

piano roll looks like that shape on the

piano roll interesting neuron layer one

or neuron 151 I believe that is this one

a minor Cadence’s ending phrases two and

four no that’s this one sorry and and

again I look up here okay yeah that kind

of chord right there looks kind of like

that chord right there they seem to be

specializing to picking out specific

types of chords okay so it’s learning

Roman numeral analysis and tonics and

root position chords and Cadence’s and

the last one where one neuron eighty

seven and layer two neuron 37 I believe

that’s this one in this one they’re

picking out I six chords I have no idea

what that means

so I showed you automatic composition at

the beginning of the presentation when I

took some Bach Bach music and I

allegedly claimed it was Bach I’ll now

show you what harmonization sounds like

and this is with the sub optimal

strategy that I proposed so we take a

melody such as

[Music]

we tell the model this has to be the

soprano line what are the others likely

to be like that’s kind of convincing

it’s almost like a baroque C major chord

progression what’s really interesting

though is that not only can we just

harmonize simple melodies like that we

can actually take popular tunes such as

this

[Music]

we can generate a novel baroque

harmonization of what Bach might have

done had he heard twinkle twinkle little

star during his lifetime

now I’m going off the track where it’s

like oh this is my model it looks so

good it sounds so realistic yeah but I

was just criticizing at the beginning of

the talk

my third research goal was actually how

can we determine a standardized way to

quantitatively assess the performance of

generative models for this particular

task and one which I recommend for all

of automatic composition is to do a

subjective listening experiment and so

what we did is we built

václav comm and it looks like this it’s

got a splash page and it’s kind of

trying to go viral it’s asking can you

tell the difference between Bach and a

computer they used to say man versus

machine but but the interface is simple

you’re given two choices one of them is

Bach one of them is Bach bot and you’re

asked to distinguish which one was the

actual Bach we put this up out on the

Internet

I’ve got around nineteen hundred

participants from all around the world

participants tended to be within the

eighteen to forty five age group the

district we got a surprisingly large

number of expert users who decided to

contribute we defined expert as a

researcher someone who is published or a

teacher someone with professional

accreditation as a music teacher

advanced as someone who has who have

studied in a degree program for music

and intermediate someone who plays an

instrument and here’s how they did so

I’ve coded these like I’ve coded these

with SAT B to represent the part that

was asked to be harmonized so this is

given the alto tenor bass harmonized

with soprano this year was given just

the soprano wood bass harmonized the

middle – and this is composed everything

I’m going to give you nothing this is

the result that I’ve been coding this

entire talk only participants are only

able to distinguish Bach from Bach

bought 7% better than random chance but

there’s some other interesting findings

in here

well I guess this isn’t too surprising

if you delete the soprano line then then

Bach bot is off to create a convincing

melody and it doesn’t do too well

whereas if you delete the bass line

Bach lots of a lot better now I think

this is actually a consequence of the

way I chose to deal with polyphony in

the sense that I serialized the music

from soprano alto tenor bass and so by

the time Bach Bach got to figuring out

what the bass note might be it already

seen the soprano alto and tenor note

within that time instant and so it

already had a very strong harmonic

context about what note might sound good

whereas if I whereas when I’ve got the

soprano note Bach watt has no idea what

the alto tenor bass note might be and so

just going to make a random guess that

could be totally out of place to

validate this hypothesis which is a work

left for the future you could serialize

in a different order such as bass tenor

Alto soprano you could run this

experiment again and you can see and you

would expect to see it go down like this

if the hypothesis is true and

differently if not here I’ve taken the

exact same plot from the previous plot

except I’ve now broken it down by music

experience unsurprisingly

you kind of see this curve where people

are doing or doing better as they get

more experienced so the novices are like

almost only three percent better where

the experts are sixteen percent better

they probably know Bach they’ve got it

memorized so they can tell the

difference but the interesting one is

here the experts do significantly worse

than random chance when getting when

comparing Bach versus Bach bought bass

harmonizations I actually don’t have a

good reason why but it’s surprising to

me it seems that the experts think block

bot is more convincing than actual Bach

so in conclusion I’ve presented a deep

long short term memory generative model

for composing completing and generating

polyphonic music and this model isn’t

just like research that I’m talking

about that no one ever gets to use it’s

actually open source it’s on my github

and moreover Google’s Google brains

magenta project has actually integrated

it already into Google magenta so if you

use the

polyphonic recurrent neural network

model at magenta and the tensor flow

projects you’ll be using the bok-bok

model the model appears to learn music

theory without any prior knowledge we

didn’t tell it this is a chord this is

the cadence this is a tonic it just

decided to figure that out on its own in

order to optimize performance on an

automatic composition task to me this

suggests that music theory with all of

its rules and all of its formalisms

actually is useful for for comp

composing in fact it’s so useful that a

machine trained to optimize compose

composition decided to specialize on

these concepts finally we conducted the

largest musical Turing test to date with

1,700 participants only 7% of which

performed better than random chance

obligatory note to my employer we do

open slitter we do freelance outsourcing

if you need a development team let me

know other than that thank you so much

for your attention it was a pleasure

speaking to you all

[Applause]

Please follow and like us:

## Be First to Comment