Press "Enter" to skip to content

GOTO 2017 • Cloud-Native Data Science • Phil Winder

thank you very much so this is a talk
about applying client native principles
to data science in 2016
Microsoft made a very gutsy move and
they released a new breed of chat bot
into the public domain
the company’s website claimed that it
had been built using relevant public
data that it had been modelled cleaned
and filtered you may have heard of it it
was called tayi the purpose of the bot
was to respond to tweets in a humanistic
manner you could send it questions on
Twitter using its handle and it did a
really good a really good job of
answering like her like a youth actually
as I say a youth because I didn’t
understand a lot of the acronyms are
used but in a well when it was released
everything was actually going swimmingly
and it worked remarkably well it really
did sound like a human as long neither
end but when a big tech company like
this like this releases a product like
this usually they’re the first users of
this service are engineers and given
that you’re all engineers in the room
would you a test out this service
appreciate it for what it is and you
know ask it sensible questions or B
would you try and break it would you
send it the most horrific things that
you could think of in order to try and
force it to give us the answer well you
know engineers are a sadistic bunch and
you can guess which option they chose
the bot went from a mild-mannered well
answering chat bot to a sexist racist
genocide all Nazi in about 24 hours
you’ve got a collection of tweets you
can see there where it started off you
know looking quite good and we ended up
with Hitler you know if you end up with
Hitler you know it’s gone wrong one of
my favorite tweets actually was about a
British comedian called Ricky Gervais
and it had a very you know
decent-enough question is ricky gervais
an atheist the response ricky gervais
slant totalitarianism from adolf hitler
the inventor of atheism now for all I
know about Hitler I don’t think that’s
his most famous trait but I’ll give it
10 out of 10 for you no imagination for
that one and ultimately the result of
this wonderful experiment 24 hours later
it was dead gone and although that’s
quite a hilarious story I’m actually
quite impressed with Microsoft it was a
very gutsy move to allow this to happen
they managed to deliver something that
was really quite impressive but I think
I think and this is just speculation
that I think that might some of
Microsoft’s traditional organizational
stuff got in the way I think that the
people if people were able to spot these
problems and they were in a position to
be able to spot these problems then they
could have stopped it before something
like this happened and that’s really
about what this talk is about today so
in normal life tradition is a fantastic
in important part of of culture cultural
meme but in engineering it’s actually
the harbor of bad habits you know if we
stick to traditions then we tend to
repeat the same mistakes I used to work
as a moor in the data science field and
in software engineering and what we used
to do was I would go away and I would
write my models and do my research and
then the only thing that everybody else
would see was just this massive code
which I would throw to software
engineers and I would say there you go
software engineers I finished my job now
it’s your turn you implement it and
obviously you know most of the time that
just didn’t work some of the time it
partially worked but it never worked as
well as it should have done I actually
spoke to a client the other day and he
was worried that he had paid for a
project for his company using he sent
some data off to a research arm and they
he was worried that
the the work that these researchers were
doing were kind of not really applicable
in real life they were the words he used
he thought it was a bit too academic
with the words he used and what he meant
was the the types of things that they
were coming up with we’re not really
realistic and relevant to you know
modern-day industrial software so yeah
tradition is a bit of a problem but
traditionally data scientists have
worked towards a certain type of model
this is a model because the called the
cross industry standard practice with
data mining and this is the nearest
thing we’ve got to you know a process in
data science you’ll see that there’s
lots of loops in this process and that’s
just indicative of the fact that most of
data science is kind of open-ended and
continuous it never really stops the
problem with this is that pretty much
all of these steps are you know the very
individual individual and they don’t
scale very well so the first problem is
the deployment face as I just said when
I was a data scientist I would throw my
models over to the software engineers
and then I would never see it ever again
I kind of think this is probably
something that happened at Microsoft we
get the software engineers that have not
been trained in data science we give
them I give them you know poorly
documented uninterpretable code and
expect them to understand it and
implement it efficiently it’s it’s never
going to happen
and then we start going through the
other parts of the model the first is
data understanding this is a major issue
in in data science because the data is
the most impart the most important part
of the problem and the data
understanding part we rely on domain
experts in order to interpret the the
data so we had a great talk earlier on
from Feynman about how he was working on
the music domain he was working with if
you weren’t there he was working with
classical music and to to a normal data
scientist you would know nothing about
the terminology used in music but
throughout his years of doing this
research he finally became a domain
but it took years and years of work and
years of time so that’s a good example
of that data preparation is another
problem area because it’s often done by
one person and it’s only done once so
what happens is you get one guy that
really delves into the data and they’re
the ones that really understand it but
it’s kind of it’s hard to reproduce
because only one person understands it
and the output is again just this
amorphous blob of software which takes
messy data and spits out good data and
finally modeling this is a little bit
easier to reason about if you have more
and more people understanding the types
of models used and how to do modeling
but the issue with modeling is that
again this is a very one person only
process because the the process usually
involves trying lots of different things
basically picking the best one but that
process that trial and error is never
recorded anywhere the only output is the
model you know that is the answer so if
we want to scale this past one person
then we need to there that usually what
happens is the second person repeats all
the same mistakes and that actually ends
up at a different result because you
know his biases and his preferences to
us algorithms usually ends up in a
different model so you know the the
whole part of that side of the model is
kind of like a murky canal you know it’s
like a mucky Amsterdam canal you know
the ships can go off and down but you
wouldn’t want to jump in and and follow
it and then if at the end of all that
you know we’ve got the operation side
the deployment side we’ve got the vast
majority of the data science research
phase that’s not going well actually the
vast majority of projects fail because
there’s a lack of business understanding
and that’s either because the business
doesn’t understand the technical
implications that they’re proposing or
the tech guys don’t understand the
business problem enough so a whole host
of problems so I think what I’m going to
do now is I’m gonna ignore the business
side a little bit because that is
actually a separate problem in itself
and you’re all tech guys so and gals so
I’m just going to stick to
three distinct phases we’ve got the
research phase which was the bit that
talks about you know understanding the
data massaging the data and producing
the data in the model I’ve got the build
face trying to prove we’re doing what
we’re doing is correct and then the
actual deployment phase the bit that we
want to rush into production
so yeah the research phase consists of
the initial data science that can be
anything from performing experiments
gathering more data preparation data
cleaning modeling all that good stuff
this is kind of this is called the
research phase because it is a very
scientific process and the biggest
problem with that is that it’s it’s
inherently open-ended and therefore it’s
very high-risk so there is a high
probability of failure at this point
because you might find that either you
don’t have the data to do the job
properly or you just can’t do the job
because it’s you know intractable for
some reason so stepping back a bit
believe it or not Britain actually had a
very rich motoring heritage you might
not think it these days you might think
of Germany or something like that but
there’s a manufacturing plant near
Oxford which started in 1913 so this is
a picture from that same manufacturing
plant in 1943 and from about then until
the 1970s it was owned by a company
called British Leyland this is a picture
of their manufacturing line building
cromwell tanks for world war ii by the
time it got to the 70s it was building
this little couch you probably all
recognize but at the start and during
the 70’s things started going wrong and
the the ultimate reason why they went
wrong was because there were better
cheaper alternatives available other
companies were investing in the
automation of these lines in order to
produce better quality and cheaper
products and you know when we talk about
software engineering software
engineering or engineering
it’s just converting a process into code
so that we can automate it that’s all is
that’s where I should have said that
word and today I think that data science
is actually the automation of the data
so we’re starting to see like a
three-tier hierarchy here between you
know we’ve got data science at the
bottom which is taking all the data and
automating things based upon that data
to feed in to the process so I think the
data science and software engineering
actually make a very good fit they go
together very well because the data and
the science feeds into the software
which then feeds into the the value that
you’re trying to provide and this is a
picture of the same manufacturing plant
in 2013 so this is a hundred years after
the the manufacturing plant opened and
you can now see that it’s far more
automated and it’s basically no humans
there and that allows this company to
build better more reliable cars and as
you probably know this company’s now
owned by BMW so you know the the the
great Golden English company was eaten
up by German manufacturers damn it so so
yeah anyway my point is that I think
software engineers or the software
engineers are actually in a really good
position to actually push ourselves into
data science not the other way around
because we’ve come away with all of the
things that we’ve learned during this
you know more traditional automation
phase and we can start applying it to
data science because the fact is like at
the moment none of this happens in data
science at the moment and this leads me
to data Sciences dirty dirty little
secret and the little secret is that the
vast majority of your effort and time
and engineering skill as a data
scientist goes into the data just
messing around with the data incessantly
you know we are fixing problems with the
data we are imputing missing values we
are removing invalid data and so on and
so on and so on and the vast majority of
the PUF
the final performance of the model is
based upon how much you can improve that
data not on the model so you know we’ve
had some too great fantastic talks this
morning you know all about deep learning
all about very sexy technologies but
that’s a very Silicon Valley problem no
offense to the Silicon Valley guys but
that’s a very Silicon Valley bro for
everybody else outside of Silicon Valley
we’re still living in the world where
you know it is the simple techniques
that really make a difference and it
doesn’t have to be a complex model
simple things can go a long way and one
of the biggest issues with this process
is that this discovery this fixing the
data this understanding the data as I
said it’s only done by one person so I
think actually this is just a problem of
visibility there is very little
visibility within data science there’s
only generally one person that’s working
on a problem at a time
and it’s you know it’s very difficult to
scale or at least it’s very inefficient
to scale over the years software
engineering has done a really great job
in improving this because we had exactly
the same problems you know we could
distribute binaries quite effectively
but when it came to source code we’ve
gone through you know decades of trying
to improve the visibility and the
resiliency of our source code and
thankfully data science is finally
starting to get there and these two
tools in particular have been very
prolific so you probably all know one of
those so I’m not going to talk about
that but the second is a notebook some
of you probably come across it but you
may be not so I’ll just I’ll just
introduce it so this is Jupiter
notebooks it’s a an evolution of ipython
notebooks the idea is that inside the
notebook there is a series of cells and
each cell can either be marked down or
it can be code what this is done is it
is single-handedly improved the
visibility from pretty much zero all the
way to almost as good as it’s get I
think this is actually probably better
than software engineer in terms of
visibility what it means that it’s when
I’m when I first gets
data and I’m doing my analysis I can
document everything I do even the
mistakes I can write the code I can
write you know words if I need to and
whenever anybody else was to repeat that
process they can just come along and
read this like a document if they want
to they can come in and they can
actually start playing with the code
they can start doing tests if they if
you think oh I think your models rubbish
I’m gonna try another model or I’m gonna
try some different parameters it’s very
easy for someone to just come in change
something and run it so this is a very
iterative very visual way of doing data
science and yeah it’s it’s made a huge
impact and then when you team it up
we’ve get and I think we’ve got you know
the holy grail of repeatability from get
visibility from jupiter notebooks and
even like like like this is something we
don’t take for granted for granted for
example like when we’re looking at code
normal normal software code we’re using
you know github and get lab and whatever
just that the online viewers to view
code far more often than we actually
think that we are and that alone is is
super super helpful for for the
visibility there so and that is fine a
very good and a huge advancement for
individual developers but how do we
scale it to multiple developers we do
that with another project from Jupiter
called Jupiter hub and it’s quite a
simple architecture as you can see main
parts comprise of HTTP proxy we’ve got
the individual notebook so this notebook
part is the bit I’ve just explained to
you the Jupiter notebook and then we’ve
got a couple of user base stuff in there
on the left hand side there to handle
the multi-tenant instances but the most
interesting thing is this thing the
spawner because what we could do is we
can override that spawner and plug in a
whole range of tools we can plug in
dhaka and we could start spinning up
docker containers where you can start
spinning up you know Mises containers
an Orchestrator we could spin it up in
some sort of cloud-based environment
it’s incredibly incredibly useful
possibly my favorite is we can start
kubernetes jobs you know start pods with
our own containers in and you know
fraught ask software engineers we know
that this provides us with a huge amount
of flexibility we can simply scale out
when we need to if you’ve got more
developers working on a different
problem just add more pods if you need
bigger machines just scale out the
number of machines we can select our
machines whether we want GPUs or CPUs
it’s all years ahead of data science so
we’ve got the visibility we’ve now
started to containerize the process so
this is you know two core tenants of
cloud native containers visibility now
let’s build on last hour let’s move on
to the build face not build on to the
move face so we’re teaching on this at
the start but in the past and continuing
today still happens today data scientist
is a very general term I don’t
necessarily mean you know people with
PhDs that are working with high level
tools deep learning this that neither
just normal people just working with
normal data they come up with an idea
maybe a simple model they throw it over
to the software engineers this is
completely analogous to where software
engineering it was about 10 years ago
software engineers would take their
binary their software throw it over to
the Ops guys and we’re combating that
with the idea of DevOps
so I think that there’s an equivalent
shift that needs to happen with data
scientists the data scientists need to
be become more integrated with the
software engineers and ultimately more
integrated with the ops people as well
so you know data ops if you were and at
best if we don’t have that at best we
have inefficient models but at worst and
you know what’s what’s more likely to
happen is that things don’t happen at
all and if you don’t get that transition
right then you just end up with our
products and talking of AI and robots I
love that video
the lipstick robot Simone get hilarious
ah you’re all boring I find that funny
I’m gonna I’m gonna laugh and so how do
we improve this well like like like we
saw from the devops transition from from
therefore knops much of the problem is
actually a people problem it’s it’s
about getting people to accept their
role is changing and it’s changing for
the better for the benefit of everyone
and that’s and that’s okay but it’s a
bit boring what we can do technically is
start to enforce quality we can enforce
quality with surprise-surprise
continuous deployment continuous
integration this is a classic continuous
delivery pipeline the I’m sure you all
know this the engineer you know would
commit is code it will go into a build
server it would run through pipeline be
deployed into production now the
pipeline is possibly the most important
part of this entire process and it needs
to be customized to your domain and your
I like I always like talking about the
testing triangle this is quite common in
the CI literature if you haven’t seen it
before it’s an image where on the x-axis
we’ve got the number of tests on the
we’ve got like the scope or the depth of
the test so at the bottom we’ve got unit
tests who we have very large numbers of
unit tests that are telling testing very
small bits of code all the way up to the
top where we have very few tests
acceptance tests but they’re testing a
huge amount of code and you know that
testing process is possibly the most
important part of the build phase if you
don’t test your models then you end up
with something like this this is my
colleague he was trying to book a flight
from Amsterdam to Prague I think and
kayak kindly recommended the flight me
you know that was a direct output of one
of their recommendations models it was
me so that was for this guy he obviously
couldn’t book a flight
they lost his revenue they lost his
money I dread to imagine how many other
people were using the site at the same
time and they all received big me and
they must lost a lot of money I think if
that if anything is a clear indication
that that the data science people need
to be more integrated into the
operations of their actual software
because they’re the only ones that know
you know how to implement monitoring the
best way they know how to fix it if it
goes wrong and then we get on to the
deploy phase and this is a bit more
difficult to talk about because it’s a
bit more domain-specific it’s very tech
stack specific so it depends what
technology stack you’re using but I can
generalize it a little bit by talking
about containers but I mean ultimately
the the goals are exactly the same
we want our software to be reactive
resilient and reproducible we want it to
be reactive so that when we have changes
to the outside world we can scale up and
scale down as accordingly we want it to
be resilient so if it ever fails in ease
automatically repair itself and
reproducible if we can quickly reproduce
our cluster in another location of a
testing or something that improves
testability and that kind of represents
this tiny little arrow in the in the
build pipeline and even in continuous
delivery this is often overlooked and
it’s always represented by a little
arrow as if it’s like this simple thing
where you just push it to production
flowers smiley faces done and it’s never
like that it’s kind of it’s a bit more
difficult and far more specific and
there’s a lot of engineering effort that
you spent you know trying to push this
out to production for data science land
one of the easiest things we can do is
bring in containers again you know so
how do you do that well you know you you
have some sort of model you can quite
easily stuff that into a container and
if you’ve just got interfaces and
rooters then they’re all pretty
standardized once you’ve got to that
point then it becomes much easier to to
not only make sure it runs on your
machine and it works the same way in
production but also it’s easier for
other people to reason about as well
because here you know you’re reducing
the domain that people have to
understand in order to use your service
and that model can be anything it could
be you know just a simple Python model
it can be Fianna derivative tensorflow
whatever and if you’re into sort of more
streaming technologies and you know we
can easily apply streaming technologies
here as well if we just package up the
whatever it is in your particular
streaming X streaming package that
you’re using and like a source or spark
executor it’s still perfectly reasonable
to do that and that fits really nicely
into the testing triangle because we can
build that container as part of our
delivery pipeline and start testing that
container as opposed to just testing the
code itself so you know it’s all fairly
standard stuff everybody aims for but
it’s amazing at how much this doesn’t
happen in real life in data science and
then finally we can simply stuff that
container into production however you
want you know using some sort of
Orchestrator or you know if you’re using
some sort of streaming based system
selecting GPUs and CPUs it’s the
ultimate in flexibility if it works
there if it works on your laptop it
doesn’t matter and just to finally push
home one of the this is a slightly
different domain but and I know there’s
a few thought works people here today so
I’ve got to be a little bit careful
there are a great company an amazing
company but their marketing department I
think also needs to be integrated in
into production as well because they
sent out this email last week and I
would be really interested in finding
out what thought works seismic shits I
find that really fascinating actually I
think this is a genius move by the
marketing department because so many
people were talking about this in the
office and I think that’s done far more
for thought works than than anything
they could have sent out so well done
that marketing person that made that
okay so now I have a quick demo
demonstrating all of these concepts
I’ve tried think of a simple example my
example is a a whisky shop so my
business requirement is I have a client
which is a whisky shop because I think
whisky and their
have come to me because they want to
provide a USP in the fact that they can
recommend better whiskies than anybody
else but the problem is they want this
to be able to scale they can’t really
afford to employ whiskey experts every
single one of their shops so it’s much
more efficient to write an algorithm to
do that for them so their requirements
are they want somebody to pass a
favorite whiskey in and they want
recommendations out they want to start
off with a limited set of whiskey’s but
want to be able to update their data in
the model in the future this is all
available on my get repository you can
get that for it’s all open source and
it’s it’s pretty simple you know the
algorithm of amusing for this it’s
pretty knotty it’s the kind of famous
standard whiskey dataset and just to
cover that a little bit it’s a simple
nearest neighbor algorithm so if you
have two whiskeys if sorry so all
whiskies are characterized by a set of
numbers where the numbers correspond to
a particular feature of that whisky so
the features might be smokiness or
sweetness toffee things like that so
what would happen is that it would
calculate the distance between someone’s
chosen whisky and all of the whiskies
and then we would pick the top five or
ten or whatever recommendations based
upon that so pretty simple but you know
works remarkably effectively but the key
thing here is got a full continuous
delivery pipeline so all of those stages
have all been implemented with you know
unit tests and mock data and real data
and acceptance tests and I’ve used
Jupiter notebook for the initial
analysis and we’re able to insert new
data simply by stuffing it into git and
then watching it flow through the
pipeline so hopefully this is going to
play it is excellent so I’ve made a
video here because as you probably know
you know a lot of this takes a lot of
time so now I’m just messing around with
terraform creating my new infrastructure
for this project and we’re going about
10 times speed at the moment labid bla
bla bla bla bla bla bla bla probably all
used to to seeing this and then the end
result is working server
the cloud with some initial software
deployed Oh Deary me can you see the
bottom of that screen oh you can it’s
just this monitor it’s okay so what I’ve
just done there is I’m just fixing books
because it didn’t work and finally we’ve
got our algorithm actually working so
this is running out of the container and
when I curl the container then I get my
recommendations back so a simple REST
API testing you know a passed in mcallen
I want that’s my favorite whiskey and so
I’m gonna get these recommendations here
so first job as a software engineer I’ve
figured out that there’s maybe a little
bug in my code so I’ve got a UCF ass
mcallen there and he’s actually returned
Macallan as one of the recommendations
so that’s a bit pointless so that’s my
first book I’m gonna go and try and fix
that so now I’m just inside the code and
I’m just going to edit that code I’m
gonna basically ignore that first first
value there when I output my
recommendations and we’re going to write
that back then we’re going to push that
to the repository and there we go and
then we’re going to watch our pipeline
so this is quite cool we’ve got a
pipeline here where we’ve got all of the
tests running in parallel if those tests
pass then we go into a registry step
which pushes that that file to a
registry and then we’ll talk about the
deploy in a little bit but all of those
stages is just implemented with a simple
yeah more script you know and but the
beauty is is that we’re actually using
realistic data to test this software
which kind of isn’t something that
happens in real life I haven’t noticed
you know so what tends to happen is that
you implement it manually and then you
test it manually and then the software
engineers have some sort of dummy data
that they should use in their tests and
they have some expected output but it’s
a very small you know it’s usually mock
data it’s usually not realistic and it’s
certainly not real
and then at the end of the process the
the data scientists would come to the
software engineer and manually test his
software to see if it’s okay you know
it’s it’s a it’s a hugely manual and a
very poorly managed process if we can
stuff all of that into a pipeline like
we’ve done just here thumbs up okay so
all our tests have passed it’s now being
pushed the registry and once it gets
pushed to the registry then it will be
deployed to the server for this all I’ve
done is just in a real hockey
let’s SSH into the server and just do a
you know doc Apple docker run which
isn’t great if I had more time I’d
probably deploy it to Cuba Nettie’s or
something but it works
and it demonstrates it quite well so I’m
just going to watch that container on
the on the server now and in a minute
we’ll see that container there you go so
now it’s just been deleted and it’s
going to be recreated there so that’s
the deployment phase in action so if we
now go back and actually test this new
service and hopefully we should see a
better output I go search for the same
thing again account and again and you
can see we’ve removed Macallan from the
first entry there fan tastic
and that’s okay and that’s kind of a
traditional software task but it’s
something that a software engineer would
normally do not a data scientist so
again they’re the focus here is to try
and get the data scientist involved in
the software engineering or vice-versa
if now we have a second data scientist
or another engineer that you know what I
don’t like some of your data I’m going
to change the model so I’m going into
the ipython notebook and I’m looking at
what the previous person has done and
now you’re just going to see me hacking
around trying to get something working
but the idea is here that this is the
process that an engineer would normally
go through when he’s trying to implement
a new model or I think in this case I’m
to insert some new data lots of errors
lots of errors finally figure out how to
do it yep still not right data’s wrong
sighs how do I do it how do i there we
go there’s a few minutes there where I
was on Stack Overflow that’s why the
pause was there there we go it’s worked
so I’ve generated some new data I’ve
pushed that to the repository and now
we’re going through the build pipeline
again so this is the same build pipeline
but with the change of data so we
haven’t changed the model now so it’s
important to have the data almost as
part of the model if you can maybe the
data is too big but if you can it’s
really useful to be in there because you
can catch bugs like this so I think what
we just saw there was some of the tests
failed because the data who become in
such a state that it wasn’t giving out
the the output that it should have done
so now this time instead of adding data
I’m going to just remove some data so
I’ve removed the whiskey and we’ve
reached it I think what I find out now
is that I’ve actually caused some of my
unit tests to fail by removing removing
one of the whiskies that was in my unit
test so I’m just having to fix that
there you go it was a commit they’re
saying it’s really working now smiley
face and there we go our tests are
finally passing and once again we’re
going through the registry and we get to
deploy it there it is come on do it do
it do it
it’ll get there eventually and and you
know the result is the finally deployed
model there we go finally with the new
data all without touching the model but
still going through the pipeline to
guarantee that not only our model is
valid but the data makes sense and when
we throw different data and new data at
it it still makes sense so you can
imagine trying to apply this to your own
stuff that if you if you have got
requirements for like accuracy
requirements you could make that a hard
and fast rule in your pipeline to fail
when your model accuracy decreases to a
certain point and and that’s it so I
think that entire process was probably
about an hour I sped it up into about 10
minutes well that’s probably just due to
my poor software engineering more than
anything if you’d like to take a look
then just go to the link you can just
have a look at the slides or come and
see me and we’ll basically search for
for window research and we’ll get it
there so with that I’d like to say thank
you very much just
Please follow and like us:

Be First to Comment

Leave a Reply