GOTO 2015 • Scalable Data Science and Deep Learning with H2O • Arno Candel

so yes I spent a lot of years in physics
in high performance computing for
particle physics on the largest
supercomputers of the world at slac
working together wis earn that was from
a background and then i switched into
machine learning startups I’ve been
doing this for the last three and a half
years or so last year I got nominated
and called a big data all-star at the
Fortune magazine so that was a nice
surprise and you can follow me at are no
condell here and if anybody would be
willing to take a picture and tweet it
to me that will be great thanks so much
so yesterday we’re going to introduce
h2o and then talk about deep learning a
little bit in more detail and then there
will be a lot of live demos as much as
time allows I will go through all these
different things so we’ll look at
different data sets different api’s and
i’ll make sure that you have a good
impression about what h2o can do for you
and how it would look like and that you
definitely get an idea of what we can do
here so h2o is a in memory machine
learning platform it’s written in Java
it’s open source it distributes across
your cluster it sends the code around
not the data so your data can stay on
the cluster and you have a large data
set right and then you want to build
models on the entire data set you don’t
want to down sample and lose accuracy
that way but usual the problem is that
the tools don’t allow you to scale to
all the big data sets especially for
building machine learning models we’re
not just talking about summing up stuff
for computing aggregates you’re talking
about sophisticated models like gradient
boosting machines or neural networks and
h2 allows you to do this and you get the
the scalability and the accuracy from
this big data set in scale and as I
mentioned earlier we have a lot of AP is
that you’ll get to see today we also
have a scoring engine which is kind of a
key point of the product we are about 35
people right now we had our first h-2a
world conference last year in the fall
and the most
huge success and sri satish embody here
i was CEO he had he has a great great
mindset and culture culture is
everything to him so he likes to do meet
ups every week even twice a week to get
feedback from customers and so on so we
are very much community driven even
though we write most of the code at this
point so you can see here the growth
machine learning is really trending and
we think it’s the next SQL and
prediction is the next search there’s
not just predictive analytics there’s
also prescriptive analytics where you’re
trying to not just say what’s going to
happen tomorrow but you’re going to tell
the customers what to do such that they
can affect tomorrow so you can see the
growth here lots and lots of companies
are now using h2o and why is that well
because it’s a distributed system built
by the experts in house we have click
click click he’s our CTO he wrote
basically Java compiler jit right large
parts of it in every cell phone of yours
there’s parts of his code that are
executed all the time so he architected
the whole framework it’s a distributed
memory key value store based on a
non-blocking hash map it has a MapReduce
paradigm built in our own map produced
which is fine grain and make sure that
all the threats are working at all times
if you’re processing your data and of
course all the nodes are working in
parallel as you’ll see you later and we
also compress the data similar to the
park a data format and so you can really
store only the data you need and it’s
much cheaper to decompress on the fly in
the registers of the CPU then to send
the numbers across the wire and once you
have this framework in place you can
write algorithms that are using this
MapReduce paradigm and you can also do
less than an algorithm you can just say
compute aggregates for example it’s like
a mini algorithm if you want so you can
do all these things and in the end you
end up with a model that makes a
prediction of the future right you stand
with machine learning and that code can
then be exported then I’ll show you that
in a minute and of course we can suck in
data from pretty much anywhere and you
can talk to our Python via JSON from a
web browser
I routinely check the status of my jobs
from my cell phone for example so
there’s a bunch of customers using us
right now these that are referenceable
at this point there’s a lot more that we
can talk about at this moment but you’ll
hear about them soon they’re basically
doing big data right hundreds of
gigabytes dozens of nodes and they’re
processing data all the time and they
have faster turnaround times they’re
saving model saving millions by
deploying these models such as this
fraud detection model it has a safe
paypal millions in fraud so it’s very
easy to download you just go to h dot AI
and you can find the download button you
downloaded it once it’s downloaded you
unzip that that file and you go in there
and type java dejar right that’s it h2o
will be running on your system there’s
no dependencies it’s just one single
file that you need and you’re basically
running and you can do the same thing on
a cluster you expect to file everywhere
and you launch it that would be a bear
ball installation if you don’t want to
do bare bones you can do Hadoop you can
do yarn spark you can launch it from our
and from Python as well so let’s do a
quick demo here this is glm so i’m going
to a cluster here this cluster has my
name on it you got a dedicated cluster
for this demo so let’s see what this
past erase this cluster is an eighth
note cluster on ec2 it has I think 30
gigabytes of heap per machine yep here
and basically it’s just there waiting
for me to tell it what to do so one
thing I did earlier as I parse this
Airlines data set I’m going to do this
again the airlines data set has all the
flights from 2007 all the way back to
nineteen eighty seven and it’s parsing
this right now and let’s go look at the
cpu usage here you can see that all the
notes are active right now sucking in
the data
parsing it tokenizing it compressing it
into these these reduced representations
that are lost less of course so when we
have numbers like 719 and 120 then you
know that that fits into one bite so you
make a one-bite column right once you
see that their numbers there are more
dynamic ranged and just one bite then
you take two bites and so on you
basically just store what you need it’s
okay so now it part of this file in 35
seconds let’s go look at the file
there’s a frame summary that I’m
expecting it from the server and the
server now returns this and says here
160 million rows can you see this
there’s 160 million rows 30 columns
about 4 gigabytes compressed space you
see all these different columns here
they have like a summary a cardinality
some of them are categorical here so in
effect is about 700 or dictators in this
data set and we’re trying to predict
whether their plane is delayed or not
based on its like departure origin and
destination airport and so on so if i
wanted to do this i will just click here
build model i will say generalized
linear model that’s one that is fast and
the training frame is chosen here and i
will now choose some columns to use i’ll
first ignore all of them because there’s
a lot of columns i don’t want to use and
then i’ll add year month the day the
week at the day of the week let’s see we
want to know the departure time maybe
the carrier not the flight number that
doesn’t mean much maybe the origin and
destination and then all we really care
about is whether it’s the late or not so
that will be my response everything else
you don’t need because it would give
away the answer right so its departure
the late is what I’m going to try to
predict and it’s a binomial problem so
yes or no is the answer and now I just
have to press go and it’s building this
model as we as we speak and I can go to
the water meter to see the cpu usage and
you can see that all the nodes are busy
computing this model right now
and in a few seconds it will be done you
see the objective value doesn’t change
anymore yep so it’s done in 19 seconds
and I can look at the model and I can
see that we have an auc of 9.5 it’s a
little more than point five right it’s
not just random we have variable
importances here we can see that certain
airlines like Eastern Airlines is as a
negative correlation with the response
which means it’s it’s rarely if you take
this carrier you’re not going to be
delayed that’s because it didn’t have a
schedule it was always on time by
definition for example so this is like
one bit that comes out of this model
another thing is that Chicago and
Atlanta are often delayed when you start
there right when your journey starts
there as you know or for example San
Francisco if you want to fly to San
Francisco there’s a lot of people who
want to do that so that’s why it’s also
often delayed and as I mentioned earlier
the accuracy here flatlined after the
first few iterations so the model could
have been done even faster if you’re
looking at the metrics here for example
you can see that there’s a mean square
error reported an r square value report
at all this data science stuff aoc value
of point 65 and so on and there’s even a
POJO that we can look at you know what a
POJO is a plain old java object it’s
basically Java code that’s the scoring
code that you can take into production
that actually scores your flights in
real time and you could say okay if
you’re this airline and if you’re at
this time of day then you’re going to
have this probability to be delayed or
not and this is the optimal threshold
computed from the ROC curve that curve
that you saw earlier that tells you
where where best to pick your to
operating regime to say the later not
based on the falls and positives and
true positives and so on that you’re
balancing right so let it stand the data
science it’s all baked in for you you
get the answer right away so this was on
160 million rows and we just did this
life
so as you saw the pojo scoring code
there’s there’s more models that you can
build in in the flow user API the degree
that you saw earlier there’s a a Help
button on the right side here to bring
this back up there’s help I go down and
I can see here packs so there’s a bunch
of example packs that come with it so if
I click on this here I’ll do this
actually on my laptop now I’ll show you
how to run this on a laptop so I just
downloaded the the package from the
website and it only contains two files
one is an r package and one is the
actual java jar file I’m going to start
this on my laptop and I’m going to check
the browser localhost at port five four
three two one that’s our default port
and now I’m connected to this java JVM
that I just launched right and I can ask
it this is a little too big now let’s
make it smaller here we go I can look at
the cluster status yet it’s a one-note
clustered I gave it 8 gigs of heap you
can see that and it’s all ready to go so
now I’m going to launch this this flow
from this example pack this million
songs flow I’m going to load that
notebook and you can see this is the
million song binary classification demo
we basically have data set with 500,000
observations 90 numerical columns and
we’re going to split that and store the
next three well that’s done you already
have those files ready for you so now we
just have to parse them in here and I
put them already on my laptop so I can
just say download on import into the h2o
cluster I’ll take the non zip diversion
because that’s faster so this this file
is a few hundred megabytes it’s done in
three seconds and this one here is the
test set I’m also going to parse this
and you can see that you can even
specify the column types if you wanted
to turn a number into an enum for
classification you can do this here
explicitly if you’re not happy with the
default behavior of the parser but the
parts that is very robust and can
usually handle that so if you have
missing values if you have all kinds of
categoricals ugly strings stuff that’s
wrong we’ll handle it it’s very robust
it’s really made for enterprise-grade
datasets it’ll it’ll go through your
dirty data and just spit something out
that’s usually pretty good okay so now
we have these data sets and I’ll see but
what else we have here so let me go back
out here give your view you can click on
outline on the right and you can see all
these cells that I pre-populated here
and one of them says build a random
forest once has build a gradient
boosting machine one says build a linear
model logistic regression and one says
build a deep learning model right and I
can just say okay fine let’s build one
let’s say let’s go down to the GBM cell
and say execute this cell now it’s
building a gradient boosting machine on
this data set you can see the progress
bar here and violets building it I can
say hey how do you look right now let me
see how you’re doing so right now it’s
already giving me to scoring history
points where the error went down it’s
already in a OC curve an ROC curve with
an AOC of something like see point seven
I would hope yes point seven a you see
already right in just seconds that’s
pretty good for this data set if I do it
again it’s already down here the error
ghost keeps going down and you can keep
looking at that model feature
importances for which which variables
matter the most all in real time and I
can also look at the Poggio again this
time it’s a tree model not a logistic
regression model so you would expect
some decisions in this tree structure if
I go down there’s all these classes that
this all like Java code I think the tree
should be somewhere let me see I might
have to refresh this model
oh here we go so these are all the
forests here you see that there’s a lot
of forests that are being scored and now
we just have to find this function
somewhere down there and up here it is
so here you can see that this is
decision tree logic right if your data
is less than 4,000 in this column and
less than this endless and then in the
end your prediction will be so and so
much otherwise it will be this number so
basically this is the scoring code of
this model that you can put right into
production in storm or any other API
that you want to use your own basically
that’s just Java code without any
dependencies and you can build the same
thing with deep learning right you can
build a deep learning model on the same
data set at the same time that the other
one is building you can be able to
random forest model here also at the
same time or a glm and this is all on my
laptop right now so I’m building
different models at the same time and I
can ask hey what’s the status of them I
can just go to the right here in the
outline and click on giving my deep
learning model oh it’s already done
let’s see how well we’re doing here also
a good auc right and feature importances
and the scoring history and the metrics
and you can even get a list of optimal
metrics like what’s the best position i
can get what’s the best accuracy i can
get and then at what threshold so this
is all geared towards the data scientist
understanding what’s happening all right
so mild my laptop is churning out some
more models you can continue here and
talk about deep learning in more detail
so deep learning as you all know is
basically just connected neurons right
and it’s similar to logistic regression
except that there’s more multiplications
going on you take your feature times the
weight you get a number and then you add
it up and you do this for all these
connections your each connection is a
product of the wait times the input
gives you some output and then you apply
a nonlinear function like at NH
something is like a step function
move step function and you do this again
and again and again and at the end you
have like a hierarchy of nonlinear
transformations which will lead to very
complex nonlinearities in your model so
you can describe really weird stuff that
you would otherwise not be able to with
say a linear model or a simple random
forest that doesn’t go as deep to to
make up all these nonlinearities between
all these features so this is basically
the machinery you need for
nonlinearities in your data set and we
do this in a distributed way again
because we’re using the MapReduce we’re
doing this again on all the threads
right as you saw earlier for glm and
everything was Green deep learning is
also green it’s known to be green I
usually burn up the whole custom and I’m
running my models and everybody else has
to step back well of course there’s the
Linux scheduler that takes care of that
but still some claim it’s not
necessarily fair if I’m running some big
model so I haven’t done that lately and
that’s why I’m using these easy two
clusters now or maybe my laptop from
time to time but anyway you can see here
we have a lot of little details building
right it works automatically on
categorical data it were automatically
standardizes standardizes your data you
don’t need to worry about that it
automatically impedes missing values it
automatically does regularization for
you if you specify the option it does a
check pointing load balancing everything
you just need to say go and that’s it so
it should be like super easy for anyone
to just run it and if you want to know
how it works in the detail architecture
here it’s basically just distributing
the data set for it first right onto the
whole cluster let’s say you have a
terabyte of data and 10 notes every node
will get 100 gigabytes different data
and then you’re saying okay I’ll make an
initial deep learning model that’s a
bunch of weights and bias values all
just numbers and i’ll put that into some
place in the store and then i spread
that to all the notes all my 10 notes
get a copy of the same model and then i
say train on your local data so then all
the the models will get trained on their
local data with multi-threading so there
are some race conditions here that makes
this not reproducible
but in the end you will have n models in
this case for or on your cluster that
I’ve just mentioned 10 you will have 10
such models of that I’ve been built on a
part of these hundred gigabytes that you
have you don’t have to process all the
hundred gigabytes you can just sample
some of it right and then when you’re
done with that you reduce it basically
automatically will get average back into
one model and that one model is the one
that you look at from your browser from
our from Python and then you do this
again and every pass is a fraction of
the data that you’re passing through or
all of the data or more than all of your
data you can just keep iterating without
communicating you can tell each no to
just run for six weeks and then
communicate but by default it’s done in
a way that you spend about two percent
of your time communicating on the
cluster and ninety-eight percent
computing and this is all automatically
done so you don’t need to worry about
anything you just say go and it’ll
basically process the data in parallel
and make a good model and this averaging
of models this scheme works there’s a
paper about it but I’m also working on a
new scheme that’s called consensus a dmm
where you basically have penalty how far
you drift from the average but you keep
your local model and that keeps
everybody kind of going on their own
path in optimization land without
averaging all the time you just you know
that you’re drifting too far so you get
pulled back a little but you still have
your own model so this is going to be
promising upgrade soon that you can look
forward to already as it is it works
fairly well so this is the amidst right
two digits 0 to 9 handwritten digits 784
grayscale pixels you need to know which
one is it right from the grayscale pixel
values and in with a couple of lines
here in our you can get the world class
is actually actual world record no one
has published a better number in this
without using convolutional layers or
any other distortions this is purely on
the 60,000 training samples no
distortions no convolutions and you can
see here all the other implementations
Jeff Hinton’s and Microsoft’s point 83
is the world record of course you could
say the last digit is not quite
statistically
significant because you only have ten
thousand to test set points but still
it’s good to get down there so now let’s
do a little demo here this is a normally
detection I’ll show you how we can
detect the ugly digits in this Emily’s
data set on my laptop in a few seconds
so I just have this instance up and
running here from before so I’m going to
go into our in our I have this our units
this runs every day right every time we
commit something these tests are being
run so you can definitely check those
out from your github web page right now
if you want but still this is saying
build a an auto encoder model which is
learning what’s normal so it connects to
my cluster right now it learns what’s
normal what is a normal digit without
knowing but they do today’s it just says
look at all the data and learn what’s
normal and how does it do that well it
takes the 784 pixels it compresses them
into in this case 50 neurons 50 numbers
and then tries to make it back into 784
so it’s learning the identity function
of this data set in a compressed way
right so if you can somehow represent
the data with these 50 numbers and you
know the weights connecting in and out
then these 50 numbers they mean
something that’s what it takes to
represent those 10 digits let’s say
that’s roughly five numbers four digit
and those five numbers are enough to say
there’s an edge here as a round thing
here as a hole here something like that
like the features and with these 50
numbers in the middle and of course the
connectivity that make up the
reconstruction and the basically the
encoding and the decoding you can now
say what’s normal or not so because now
I’ll take the test set I let it go
through this network and I see what
comes out of the other side if it
doesn’t look like the original input
then it didn’t match my vision of what
this should look like right so I’m going
to let the test set go through this
model first I need to train the model so
right now it’s building this model on my
laptop 50 hidden neurons
10h activation function and auto encoder
is set to true and I had a couple of
extra options but that’s just to say
don’t drop any of the constant columns
at all as zero because I want to plot it
at the end okay so now let’s look at the
outlier nests of every point we just
scored the test set and computed the
reconstruction error so how how
different is the outcome from the income
how bad is my identity mapping that I
learned for the test set points and for
those points that are kind of ugly they
won’t match to what’s normal in the
training data right that’s an intuitive
thing all right so now let’s plot the
ones that match the best top 25 that’s
the reconstruction and now let’s look at
the actual ones well the same thing
right there match the best so I have to
look like the same this is the ones that
are the easiest to learn to represent in
your identity function just take the
middle ones and say keep them basically
now let’s look at the ones in the middle
out of 10,000 that’s the the ones the
median reconstruction error so these are
still reasonably good you can tell that
they’re digits but they’re already not
quite as pretty anymore and now let’s
look at the ugliest outliers so to speak
in the test set so these are all digits
that are coming out of my network but
they’re not really like digit anymore
right so something went wrong basically
the reconstruction failed the model said
these are ugly if you look at them they
are kind of ugly some of them are almost
not digits anymore right cut off or the
top right one for example is ugly and
you can tell that if you remember the
bottom line like in the optics test the
vision exam 6 40 35 right let’s go look
at my slides totally different so every
time I run it it’s different because its
neural nets with multithreading I can
turn it on to be reproducible but then i
have to say use one threat don’t do any
of this hog wild race condition updates
of the weight matrix by multiple threats
at the same time just run one
right through and give a seed and then
just wait until that one thread is done
and then it will be reproducible but in
this case I chose not to do this because
it’s faster this may and the results are
fine anyway every time you run it you’ll
get something like this you will not get
the ugly digits to be the good ones
right so this shows you basically that
this is a robust thing and again here
this is the network topography so I can
also go back to the browser now go to
localhost and say here clean up
everything by the way here this just ran
all the model so if I say get models I
should see all the models that were
built so that the last four rd models
they were built on the million song data
said earlier and the top one is the 1i
built from our the auto encoder and you
can see the auto encoder reconstruction
error started at point zero eight mean
square error and now it’s at point zero
two so it got it down it improved from
random noise for Otto encoders you
always want to check this convergence it
has to learn something right the
identity mapping and you can also see
here the status of the neuron layers the
thing I showed you earlier and of course
you can also get a POJO again here in
this case it’s a neural net so you would
expect some weights here and some here
what is this oh that’s the neurons here
we go I would expect the model to show
up somewhere see there’s a lot of
declarations you have to make to know
all these 784 features so if this is too
little for the preview then we have to
look at the other model we have yeah
let’s go back to get models and click on
the other deep learning model be made
earlier on the million song data set and
look at its pojo that should be smaller
because there were only 90 predictors
okay here we go so now you should see
that the deep learning math actually
printed out in plain text so you can
always check here activation something
with numerical something with
categoricals if you had any in this case
there are none and then it will save
aids activation biases and
they will do this matrix vector
multiplication so ax + y v 1 this is the
matrix vector multiplication that’s
inside of the deep learning model and
you can see here we do some partial some
tricks to be faster to basically allow
the CPU to do more additions and
multiplications at the same time so all
of this is optimized for speed and this
is as fast as any c++ implementation or
anything because we don’t really have GC
issues here all the arrays are allocated
one time and then just process all right
so now let’s get back to the bigger
problems deep learning and higgs boson
who has seen this data set before okay
great so this is a physics right 13
billion dollar biggest project ever
scientific experiment this data set has
10 million rows their detector events
each detector event has 21 numbers
coming out saying this is what I
measured for certain things and then the
physicists come up with seven more
numbers that they compute from those 21
something like square root of this
squared minus that square or something
and those formulas or formulae actually
help and you can see this down there if
you take just the low-level numbers this
is the AUC you get so point 5 is random
and one would be perfect and now it goes
up by something like 10 basis points
almost if you add those extra features
so it’s very valuable to have physicists
around to tell you like what to do right
but CERN basically had this baseline
here of 81 that was how good it was
working for them they used it gradient
boosted trees and neural networks with
layer with one layer one hidden layer so
their baseline was 81 AUC and this paper
came a long last summer saying we can do
better than that with deep learning and
they publish some numbers and now we are
going to run the same thing and see what
we can do so I’m going back to my
cluster my ec2 8 no cluster and I’ll say
get frames
and I will have to Hicks data set there
already because I parse it earlier you
can see here 11 million rows and 29
columns 2 gigabytes compressed is not
much to compress because it’s all
doubles and now I’m going to run a deep
learning model so I already saved the
flow for that so this flow says take the
split the split data set I split it in
two ninety percent and five five percent
so ten million and half a million each
take the training data and the
validation data and tell me how you’re
doing along the way so go and it builds
a three layer Network and uses a
rectifier activation everything else is
default and now it’s running so let’s go
look at the the water meter ok here we
go deep learning is taking over the
cluster and now it’s communicating and
now it’s sending that back out and then
computing again this might be initial
phases where its eyes to first it
rebalance the data set or something
usually you’ll see it up down up down so
let’s wait until the next communication
but you’ll see that all the CPUs are
busy updating weights with stochastic
gradient descent which means it takes a
point it trains goes through the network
makes a prediction says how wrong it is
and corrects the weights all the weights
that are affected get fixed basically by
every single point there’s no mini batch
or anything every single point updates
the whole model and that’s done by all
the threats in parallel so you’ll have
eight threats in parallel changing those
weights and I read you right I read you
right whatever we just compete but
usually we write different weights right
there’s millions of weight so you don’t
need to override too often but someone
else is reading at the time or something
so you can see here it’s mostly busy
computing if you wanted to know what
it’s exactly doing it can also click on
the profiler here and it will show you a
stack trace and sorted stack trace by
count what’s happening so this was
basically just communicating let’s do
this again
now it’s going to be slightly different
oh I see so now it was saying these are
basically idle because we have eight
notes and there are seven orders and
there’s one for read and one for right
so we got 14 threats actively listening
for communication here f 289 are in the
back propagation some of them are in the
forward propagation so you can see all
these exact things that are going on
with any moment in time for every note
right you can go to a different note and
you can see the same behavior so they’re
all just busy computing so by this model
is building we can ask how well is it
doing remember dat one baseline with the
human features let’s let’s see what we
have here on the validation data set
it’s already at 79 this already beat all
the random forests and grading boosted
methods and neural nets methods that
they had at CERN for many years so these
models there on the left that had 75 76
already beaten by this deep learning
model we just ran and this wasn’t even a
good model it was just a small like a
hundred neurons each layer right so this
is very powerful and by the time we
finish will actually get down to over 87
a you see that’s what the paper reported
they have an 88 they trained this for
weeks on a GPU and of course they had
only this data set and nothing else to
worry about and this is a small data set
still but you can see the power of deep
learning right especially if you feed it
more data and you give it more neurons
it’ll train and learn everything it’s
like a brain that’s trying to learn like
a baby’s brain it’s just sucking up all
the information and after 40 minutes
you’ll get an 84 AFC which is pretty
impressive right it beats all the other
baseline methods even with the human
features and this is without using the
human features you don’t need to know
anything you just take the sensor data
out of your machine and say go all right
another use case was deep learning used
for crime detection
and this is actually Chicago who can
recognize this pattern so my colleagues
Alex and Macau they wrote an article
actually that you can read here data
nami just a few days ago and they’re
using spark and h2o together to take
three different data sets and turn them
into something that you can use to
predict better crime is going to be
leading to an arrest or not so you take
the crime data set you take the census
data set to know something about the
socioeconomic factors and you take the
better because the better might have
impact on what’s happening and you put
them all together in spark first you
parse them in h2o because we know that
the parser works and it’s it’s fine so
in our demo we just suck it all in an
h2o we send it over to spark in the same
jvm and then we say you an SQL join and
once you’re done we split it again in
h2o and then we build a deep learning
model and for example GBM model i think
these two are being built by the demo
script that’s available so again both
h2o and sparks memory is shared it’s the
same jvm there’s no tachyon layer or
anything they are basically able to
transparently go from one side to the
order
and the product of course is called
sparkling water which was a brilliant
idea I think all right so this is the
place and github where you would find
this this example so you would download
sparkling water from our download page
and then you would go into that
directory set to environment variables
pointing to spark and saying how many
nodes you want and then you would start
the sparkling shell and then copy paste
this code into it for example if you
want to do it interactively so you can
see here there’s a couple of imports you
import deep learning in GBM and some
spark stuff and then you basically
connect to the h2o cluster we parse
datasets this way this is just a
function definition that gets used by
these other functions that actually do
the work to load the data and then you
can drop some columns and do some simple
munging in this case here we do some
date manipulations to standardize the
three datasets to have the same date
format so that we can join on it later
and you basically just take these three
datasets these are just small for a demo
but in reality they of course use the
whole data set on a cluster and then
once you have these three datasets in
memory as h2o objects we just converted
to a schema led with this call here and
now to become spark or disease for which
you can just call like a select
statement in SQL and then some join and
another join and all that it’s very nice
right this is a nice well understood API
the people can use and h2o does not have
this at this point but we’re working on
that so at some point we’ll have more
managing capabilities but for now you
can definitely benefit from the whole
spark ecosystem to do what it’s good for
so here in this case but is this we say
here’s a crime better data set that we
after be splitted I think we spent we
bring it back into h2o yes this is an
HTML helper function to split and now we
have basically a joint data set that
knows all about the socioeconomic
factors about the way
for a given time at a given place and
then we can build a deep learning model
just like you would do this in Java
Scala is very similar right you don’t
need to do much porting it’s just the
same members that you’re setting and
then you say run train model that gets
basically and that that at the end you
have a model available that you can use
to make predictions and it’s very simple
and you can definitely follow the
tutorials in the interest of time I’ll
just show you the sparkling she’ll start
here I’m basically able to do this on my
laptop as well while the other one is
still running here you see spark is
being launched and now it’s scheduling
those three worker nodes to come up once
it’s ready I can copy paste some code in
there and the code I would get from the
website here Chicago Crime demo it’s all
on github
so in sparkling water I will get up
project under examples there are some
scripts and so I can just take this
stuff here and just copy paste it all
oops I’m sure you believe me that this
is all doable right so here spark is not
ready and I just copy paste is in and
here it goes so that’s how easy it is to
do spark and h2o together and then also
once you have something in your memory
in the 8th show cluster right the model
for example or some data sets you can
just ask flow to visualize it you can
just type this this JavaScript or
CoffeeScript rather expression and plot
anything you want against anything and
you’ll see these interactive plots but
you can mouse-over and it will show you
what it is and so on so it’s very cool
you can plot for example the arrest rate
versus the relative occurrence of an
arrest for example gambling is always
arrested why is that well because
otherwise you wouldn’t know that the
gambling person was cheating or
something so so you basically have to
rest them right otherwise you don’t know
what’s happening some of the things are
undetected but the theft for example
it’s not always arrested because someone
knows that it was stolen without the
person actually being caught so you have
to be careful about all this data
science stuff but basically can plot
whatever you want against whatever you
want and that’s pretty powerful and we
have our state up table now in house so
Matt Dowell joined us recently he he
wrote the fastest data table a
processing engine in our and this is
used for financial institutions that
like to do aggregates a lot so just what
you saw on the previous slide will soon
have all this in H to go in a scalable
way that we can do fast joins aggregates
and so on and the same thing of course
goes for Python you have ipython
notebooks and there’s an example to do
something for the city bike company in
New York City where you want to know how
many bikes do you need for stations such
that you don’t run out of bikes so let’s
say you have 10 million rows of
historical data and you have some better
data you would imagine it you can join
those two and then basically based on
location
in time and better you can predict how
many bikes you’ll need right so if I
know today it’s going to be or tomorrow
is going to be that better I know I need
250 bikes at that station or something
and cliff our CTO who-who wrote a jit
basically also wrote this data science
example here so you can see there’s a
group by the top from ipython notebooks
and to show you that this is also life
impossible here I do this here I’ll type
ipython notebook citibike small and up
pops up my my my browser with ipython
notebook I will delete all the output
cells so we don’t cheat and I say go and
now it’s connecting to the cluster that
I started 30 minutes ago this means i
still have a little bit of time left i
will load some data here it up we go and
then let’s look at the data describe it
you can see here some some mean max and
so on whatever this is like a
distribution of the chunk of the frame
how many rows out of each machine in
this case is only one machine oops
there’s only one machine basically some
statistics that tells you how is the
data distributed across the cluster what
kinds of columns do I have what is their
mean max and so on all available from
from Python then you can do a group by
you don’t need to know all that but
basically just you want to know like at
what time of the day or what they how
many bikes are bitch station and so on
you can see that there’s a big
distribution here that’s some some
places only need 9 bikes on basically
the under bikes or even more and so on
right and you can do quantiles you see
the quantiles here from one percent all
the way to ninety-nine percent and you
see that there’s some pretty big numbers
here you can make new features stay if
the weekends on you can build models so
this is the fun part we have a bill to
GBM we build a random forest we build a
glm and we build a deep learning model
all on the same data that was joined
earlier and so now let’s say do this go
so now it’s building a GBM
all of my laptop so if I went to my
laptop right now I could say get models
and these models would just magically
pop up and this is deep learning and now
we can see how well they’re doing and
you get the idea right so now we get a
92 AAC by deep learning but the 93 a or
c by GBM but deep learning even took a
little less time than GBM so you could
say that both are very powerful methods
they beat the random forests and the
linear models here but of course nothing
beats the linear model in terms of time
Oh point one second to get an 81 and you
see it’s pretty remarkable it’s 50 times
faster and a random forest all right so
you believe me that I Python works as
well once you join the better data with
a simple merge command here in the
middle somewhere then you get a little
lift here because then you can even
predict better you need bikes are not
based on better right make sense if it
rains you might need fewer bikes so any
anything you might wonder what to do
with GBM linear models with deep
learning there’s booklets for that and
we’re currently rewriting them to the
new version of h2o which will have
slightly updated api’s and stuff for
consistency across our Python Scala JSON
and so on so it’s going to be very nice
and rewritten everything from scratch a
major effort but now we’re basically
going to be ready for release I think
this week actually so and another ! is
that we’re currently number one at this
caracal challenge Marc Landry who just
joined us who has been on teammates to
go for a while he was at the h2o world
last fall he is actually going to work
full-time almost half his time on Kaggle
challenges using h2o so we’ll be excited
to see this go across the finish line
and they will share how we did this or
rather he will share how he did it
because so far mark did most of the work
next week at h2o in Mountain View and
they’ll be live-streamed as well so if
you can make it be sure to listen in and
these are some examples of other caracal
applications
we have demo scripts that are posted
that are available and for example this
one I had hosted a few other maybe a
month ago or so I posted this example
GBM random parameter tooling logic where
you basically just make ten models with
random parameters and see which one is
the best that sometimes useful
especially if you have many dimensions
to optimize over and we don’t have
Beijing optimization yet but this might
be more efficient than just a brute
force grid search because the machine
gets luckier more than you tell it to be
lucky if you want that’s why montecarlo
integration works in higher and four
dimensions the same thing is true with
hyper parameter finding so don’t shy
away from these random approaches
they’re very powerful so this is the
outlook lots of stuff to do for data
science now that they have this
machinery in place that can scale to big
data sets customers are saying well if i
do i need to find parameters right yeah
sure automatic hybrid parameter tuning
is great they’ll do that for you soon
you’ll have ensembles like a framework
that you can in the GUI and all properly
define what you want to blend together
what way non- least squares to to stack
models of different kinds like a random
forest and the GBM and so on all on the
holdout sets and so on then we want to
have convolutional layers for deep
learning for example for people who want
to do more image related stuff but all
these things are on a to-do list right
we have to prioritize those based on
customer demand so that’s what our
customers get to do the paying customers
get to tell us basically what they want
and they’ll take that into account
natural language processing is high up
there especially now that you have this
framework we can characterize each
string as an integer and then process
all that fast and we have a new method
called generalized low-rank model which
comes right out of Stanford brand new it
can do all these methods pcie SVD
k-means matrix factorization of course
all this stuff fixing missing values for
you based on like a Taylor expansion of
your data set very powerful stuff can
also be used for a commander systems and
we have lots and lots of other zero
tickets and
stuff to work on so if you’re interested
in joining the effort please do and I
hope I left you with an impression of
what you can do with h2o and what the
state of the art is right now in machine
learning on big data sets and thank you
for your attention

Please follow and like us:

GOTO 2015 • Scalable Data Science and Deep Learning with H2O • Arno Candel

Be First to Comment

Leave a Reply Cancel reply