Press "Enter" to skip to content

GOTO 2017 • Improving Business Decision Making with Bayesian Artificial Intelligence • Michael Green


[Music]
[Applause]
so hi everyone as you heard my name is
michael green and i’m indeed here to
tell you about a different approach to
to building algorithms and building
machine learning methods really I’m also
going to argue that they are
fundamentally the same thing and you’ll
see that a little bit later in my talk
but that’s let’s get cracking basically
I will I’ll talk about the overview of
AI and machine learning and I’m not the
first one to do this and there are lots
of people who who have their take on it
but this will be my take I’ll also try
to extend to you the idea and concept of
why this is not enough we are very good
at telling ourselves that we have come
really far in AI and I would actually
tend to disagree with that I think we’re
we’re playing around in the pedaling
pool and it’s simply not good enough we
need to innovate this area we need to be
better I will also talk about how
perception versus inference can work in
a computer I will make a short note
about our patient brains because that’s
fundamentally how how we reason as
people at least from macroscopic
perspective I’ll also talk a little bit
about probabilistic programming and why
I see that as a very key point to to
marrying two very different field or
differentiated field today and in the
end I’ll tie all of it together so that
you can see how you can actually
practically deploy a solution like this
but basically if we just go back to
basic so I know a lot of different
definitions of artificial intelligence
there there are a lot of them out there
and none of them says the ability to
drive a car while not crashing that’s
simply not artificial intelligence that
is that is something that solves a
domain-specific problem that is
challenging yes but it’s not AI neither
is diagnosing a health disease in in a
page
that comes into the ER that’s also not a
I neither is actually well what I do in
my company that’s also not AI all of
those are examples of narrow AI where we
try to use machines to do more clever
things than an individual person could
do at the same task but my definition of
AI is is basically that it’s sort of the
behavior as shown by an agent that you
stuff into an environment and that
behavior in itself seems to optimize the
concept of future freedom now that is
the closest definition to to artificial
intelligence that I that I can come to
because that doesn’t say anything you
know yeah optimize the least square
error do black back propagation to to
make sure that the croissant repairer
looks good all of those things are
man-made and I assure you our brains do
not do Brack propagation it’s simply not
true
no one is telling our children how to
stand up they’re not getting smacked on
the hands for failing my son he failed
several times this morning but he
actually succeeded when I left the room
so without my encouragement he actually
did better that might say something
about my pedagogical skills or the fact
that it doesn’t need my training to do
these things so there’s a fundamental
thing that’s missing there’s a missing
piece in our understanding of how
knowledge is represented accumulated and
acted upon and that is what fascinates
me more than anything I’m sure you’ve
seen this before it’s just a definition
of what AI is today so there’s a lot of
things but but basically we are in the
top level there every single application
you have ever seen heard of today is in
this field artificial narrow
intelligence there is no such thing as
artificial general intelligence it
doesn’t exist today and if someone says
they have it they’re lying because we
don’t have the representation of how to
capture knowledge no one has that you
simply cannot express this in Python or
R or whatever language you want it
doesn’t exist we need to figure out how
to represent this
so artificial general intelligence that
is really the task of saying how could
we actually take an AI that knows how to
drive a car stuff that into a different
environment and make it utilize the
skills that they had learning how to
drive the car and apply that to a
completely different field that is the
main transfer and that is something that
no AI can do today
now artificial superintelligence and the
only reason I’m mentioning this is
because it’s really really far away
the only thing super about this house
super far away it is into the future and
and there’s been a lot of people you
know battling about this one of the one
of the famous guys Elon Musk he is more
of a doomsday kind of guy with respect
to this and he and he should be because
that gets him money into his company so
it’s it’s a very it’s a very smart smart
move that he says that AI is going to
destroy the world so I’m creating a
start-up that’s going to sort of
regulate that so imagine how hard it was
to raise money for that venture there
are other things to consider about super
intelligence and that’s that it is
conceptually possible it is something
that sooner or later if we do capture
how to represent knowledge how to
transfer knowledge how to accumulate
knowledge if we know that then there is
no stopping us from deploying this into
the world and for all practical purposes
now sounding a lot like musk what we
released at that time would basically be
a god to us and the whole thing in the
scary part about that is will it be a
nice God nobody knows but then again
there’s very little proof in history
that intelligence feeds violence so if
anything the world is a safer place than
it’s ever been before and and I would
like to see that as an evolution of our
intelligence as an evolution of our
compassion I don’t see intelligence
being a necessity for murderous robots
so I’m not very afraid of that scenario
I know we won’t be the smartest cookies
anymore in the world but maybe that’s
not so bad
that was always going to happen and
evolution will make sure that no matter
what
but basically the landscape looks like
this so you know you have this this
disturb artificial intelligence that
sort of ubiquitous and describes
everything from doing a linear
regression in Excel to a self-driving
car to identifying melanoma on a cell
phone and and and all of these things
are are not artificial intelligence but
bets just become a buzzword just like
big data I very much agree with the
previous speakers about this the way I
see it is that AI today is two things
it’s perception machines and there’s
inference machines and by inference that
only mean forecasting or sort of
prediction I mean really inference where
you actually predict without actually
having any data now under the perception
part we’ve come a long way perception
machines are everywhere those are the
machines that I should know how to drive
a car those are the machines that know
how to identify the kites in the in the
images that we saw all of those deep
learning applications that they’re
basically perception machine they can
conceptualize something that they
actually get as input either through
visual stimuli or auditory stimuli they
can sort of categorize it but they
cannot make sense of it and I’ll show
you examples of that and that’s why I
reasoned that we need more we need to
move into proper inference where we
actually have a causal understanding a
representation of the world that we’re
living in and only then can we actually
talk about pure intelligence but we can
get you know closer and I’ll show you
how to do that the biggest problems in
data science today which is also another
term for applied artificial intelligence
is that data is actually not as
ubiquitous and available as you might
think
for many interesting domains there is
simply no data and the data this there
is exceedingly noisy it might be a
flat-out lie it might be based on
surveys and we know that people lie in
service that’s also a problem structure
the problem with with structure is also
that how do you represent the concept in
the mathematical structure not
necessarily in parameter space but just
structurally how do you construct your
layers in a neural network for example
identifiability what I mean by that is
that for any given data sets there are
millions of models that fit that data
set generalizes from that data set
equally well and many of them do not
correspond to the physical reality that
we’re living in
so there are statistical truths
parameter truths and there are physical
realities and they’re not the same thing
that’s why my previous field theoretical
physics is sometimes problematic because
quantum quantum theory that I sort of
specialized in that’s has many different
interpretations and then nobody really
knows what’s going on but we know we can
calculate stuff from it so it makes
sense in the math but as soon as we push
this button but what’s really happening
then you know well we’re basically
screwed because no one knows and a lot
of people like to pretend that they know
and then there are some people like the
Copenhagen interpretation that says that
well just shut up and do the math which
is basically don’t ask the question
because they cannot be answered
Hawking adheres to this school by the
way he’s also one of one of the guys
who’s super scared of super intelligence
funnily enough because he’s a clever
cookie there’s also the thing about
priors so every time that you you
address a problem as as a human whatever
problem I give you as an individual you
will have a lot of prior knowledge
you’ll have a half or whole life
depending on how old you are of
knowledge that you’ve accumulated this
knowledge might transfer from another
person that they just told you about
something but you can apply this
knowledge to the problem at hand you can
represent that knowledge in the domain
of the problem that you’re trying to
solve and that is something that we also
can actually mimic today through the
concept of priors and that is that
basically the way of encoding an idea or
a sort of knowledge as the statistical
prior and as a statistical distribution
that can be put on par with data I’ll
show you later how to do that as well
the last part but not the least
important one is uncertainty I cannot
stress
enough how important uncertainty is to
do optimal decision-making you basically
cannot make optimal decisions without
knowing what you don’t know and I will
stress that point several times during
this talk during the remaining thirty
nine minutes of it it’s really great I
can actually see how little time I have
left so I will not show you more
equations and and it’s it’s not because
I I’m particularly fond of them but they
do help express ideas so in in the top
level that’s basically a complete a
compact way of describing any problem
that you might approach it’s basically a
probability distribution over the data
that you’re a Fed they are the X’s the
Y’s those are the things that you want
to be able to explain and the Thetas
they represent all of the different
parameters of your model stuff you don’t
know it can also be latent variables
concept that you know exists but that
you don’t have observational data for
all that is the definition of a problem
space now what machine learning has
traditionally done ever since Fisher
it’s basically that they that they
looked at this with a question that
everybody knew was wrong they basically
said that what is the probability
distribution of the data that I got
pretending that is random given a fixed
hypothesis that I don’t know that I’m
actually searching for so then the
problem actually became for all
machining applications which sort of
hypothesis could i generate that’s the
most consistent with the data set that
looks like my data set but that’s really
not my data set and you can you can ask
the question is that a reasonable
question and then I will tell you it is
not it is poppycock that question is not
worth asking why because you’re
basically just trying to find
explanations to fit your truth that is
not science ladies and gentlemen there
is only one way to do science you
postulate an idea and then you observe
data to see if you can verify that idea
or disregard it you cannot look at a
data set then generate a hypothesis that
best explains it and think that that’s
somehow is any physical representation
in this world because it doesn’t
and and that’s why a lot of a lot of
machine learning approaches a lot of
statistical approaches has actually
figured out after you know several
several years of hardcore science they
found out that the biggest risk for
dying from coronary artery disease is
actually going to the hospital yeah
that’s just not true and you know nobody
nobody stopped and and instead you know
why did this happen is it because the
the researchers are brain damaged could
have been the reason but but but but it
wasn’t it was the methodology it was
they were asking the wrong question
because if you ask that question I can
assure you that before you died at the
hospital you had to go there so this
makes perfect sense but it has no
representation of the problem you’re
trying to solve what you should have
said is given that you’re sick and you
go to the hospital and given that jack
to have something that’s worth visiting
the hospital for now that is predictive
of you being actually disposed to dying
for coronary artery disease so how do we
fix this we fix this by doing what we
should have been doing from the
beginning and this is not new this
formula down here below asks a different
question what does it ask it asks what
is the probability distribution of the
parameters on my model that I don’t know
by the way given that I have observed a
data set that is real it is not fake it
is not random it is a data set as been
observed what is the probability
distribution of my parameters now that
is an interesting question to ask and
that is a scientific question to ask but
what does that require it requires you
to state your mind the last part on the
denominator which is the P theta given X
that says what do you believe is true
about your parameters given the data set
that you have that’s very very important
ladies and gentlemen because this is the
difference between something great and
something completely insane
now then you might ask but okay why
didn’t we do this because it couldn’t be
done we simply didn’t have the
computational power to do this and it’s
not because of the guy to the right hand
side there
it’s also not to the guy on the left
hand side and denominator and you can
see that the guy on the left hand side
and nominated it’s exactly what machine
learning is doing today now why is that
it’s because of the fact that they knew
that the the guy in the denominator that
is an integral from hell and it cannot
be solved it it looks at every single
value of every single parameter that you
have and sums that out now this will end
up in a scenario we have to calculate a
lot of more things than the number of
atoms in the universe and there are a
lot of atoms in the universe even the
the part that one that we can see but
that basically meant that all of this is
out of the question so someone realized
hey that I don’t need to calculate that
I don’t know I don’t care about
probabilities you know I can just say
that the point that is the maximum will
be the same because the other thing is
just a normalizing factor it’s a
constant okay good enough we remove that
so done deal and then they said but the
prior ever what if I don’t know anything
what if I I don’t want to say anything I
don’t want to you know state my mind and
you know put my knowledge into the
problem so that’s just the uniform
distribution over minus infinity and
infinity and whoopty this this equation
here has been transferred to only the
likelihood but you made a lot of
assumptions there but people just forgot
that these assumptions are not true and
it also in a maximum likelihood which is
you know horrible way of doing things
it’s basically because you assume that
everything is independent you assume
that even when you’re doing time series
regression that observation one is
independent of observation – that’s
that’s like saying you know I wasn’t
last year I was not one year younger
than I am today of course I was and
that’s important
all of those things that are temporally
related are extremely important and the
reason why I’m saying this today is that
there’s no need to cheat anymore there’s
no need for these crazy statistical
results only you can state your mind you
can do the inference and all of it can
be done with probabilistic programming
and there are many frameworks for this
today including in Python and also
building on top of tensorflow by the
there’s really no excuse not to do this
and the best thing about it is that it’s
actually easier than than adhering to
normal statistics because the normal
statistics you were taught tools they
said that if you have two populations
and they are sort of varying together
then you use this magical tool if they
are independent then you use another
magical tool nobody really understood
why they just but in here is the t-test
in this one it’s a paired t-test and
this one is the Wilcox in this point you
should do a general logistic regression
in this one you should just do a normal
linear regression in this one is uses
port vector machine they are all the
same thing they are not different there
are different assumptions in the
likelihood functions there are different
assumptions in your priors there are
different assumptions in the physical
structure of your model that is all
there is no other difference all of it
comes back to probabilistic modeling and
if you can learn how to make these
assumptions explicitly then you have a
modeling language without limitations
then you don’t have to know the
difference between logistic regressions
and linear regressions because there is
none it is exactly the same thing and
that’s perhaps the most important thing
now wait the most important thing that
I’m gonna say today given that you think
it’s important is that you cannot do
science without assumptions that is
impossible just you know this dis is not
my belief this is just hardcore facts
you cannot do science without assumption
and and don’t rest your minds until you
understand this so without actually
risking something you can get no answers
so let’s have a look at neural networks
I’m sure how many of you have taken a
neural networks class in their days ok
then most of you have have solved this
problem I’m sure how many people have
solved this problem before ok a few guys
and girls so basically this problem is
is highly nonlinear it’s it’s a
classification task your job is to
separate the the blue dots from the red
dots by some line you can see this is
sort of a spiral that that’s that’s non
stationary it’s
quite nasty isn’t it Anna neural network
will how many hidden notes do you think
I have to have in a one-layer no natural
to solve this 10 20 50 100 let’s see
well with ten hit notes I can learn how
to separate this not great but there is
some signal there if you use up here
thirty hidden notes you can do a lot
better not surprising but still it’s
still not good because we know that this
problem can be solved exactly right so
with a hundred hidden notes you almost
have perfect classification right and if
you look at the accuracy table you will
see that the area under the curve is
100% with the 100 nodes now what is the
problem with this and this is on a this
is on a test data set mind you now the
problem with this is that this looks
great
this looks amazing I mean your job is
done right okay so let’s look at the
decision surfaces that were generated
from these guys now to the left-hand
side you have the decision surface based
on 10 hidden neurons and on the right
hand side you have the decision surfaces
based on 100 hidden nodes now you can
see here does those decision surfaces
look good to you does it look like they
actually have captured what you wanted
them to capture no it did not and this
is exactly how neural networks work they
are over parameterised very flexible
mathematical models that will do
everything they can to minimize that sum
square or the croissant repair so
there’s no penalisation for finding
statistical only results and what is the
worst thing with this the worst thing
here is that you see the regions in the
in the outskirts that are colored red
that is a signal that the neural network
is sure exists there’s there was no data
out there at all
but it knows that that has a
differentiated class now this might not
be a problem if you’re if you’re trying
to classify you know
maybe if there will rain extra much
tomorrow the what if you have a droid
with one target kill insurgents let
civilians live what if they identify one
of those asks you know one of those
outer regions that that just makes sense
that was never part of the training set
this is a truth that has been learned by
a network where data never actually
showed at this and there’s no
penalisation for this and the reason why
I’m saying this is not to be you know
don’t use AI or don’t use machine
learning in fact I’m saying the opposite
but what I want to say here is that be
responsible
every time you deploy a machine learning
algorithm you have to understand exactly
what it does because lack of
understanding is the most dangerous
thing that can exist today and it
doesn’t have to be artificial
superintelligence all that requires is a
screw-up in the engineer or the
scientists built this network and it can
have dramatic consequences especially
today in the in the time of self-driving
cars and and all these things and this
here I will show you another example of
why I think that this is interesting so
this is just a representation and mind
you this is only a single layer neural
network by the way no no you know super
deep structures where would have even
more parameters so I just want want to
show you that this problem here
represented in Cartesian coordinates is
what was being fed to the neural network
and what the neural network should have
realized is that in polar coordinates it
looks a lot simpler doesn’t it now I
know that problem I can separate that
with with just one hidden node and this
is my point you can over parameterize
and throw a lot of data things but if
you start to think about the problem at
hand and if we teach machines to learn
how to think how to reason how to look
at data instead of just number crunching
and this is why today I’m not scared of
artificial intelligence artificial
superintelligence because i could have
solved this in half a second you know
even if you don’t have a degree in
physics you should realize that that
these are just two sine functions with
with increasing radius it’s not hard but
a neural network would never get this
nor would any other machine learning
algorithm by the way impossible because
they don’t work that way that’s not
their goal
the way we can’t we can’t be angry at
them for not solving that I just want to
show you a take on probabilistic program
with this and and also explain to you
what public programming is it’s
basically an attempt to unify
general-purpose programming and by
general purpose I mean like Turing
complete programs that we all like
because they can basically compute
anything and marrying that was
probabilistic modeling which is what
everyone should be doing everyone
whatever model you are crazy you are
doing probabilistic modeling you just
accepted a lot of assumptions that you
didn’t make and and that is a
realization that that even though you
can choose not to care about it you have
to know about it you have to know the
assumptions behind the algorithms that
you’re using and that’s why even though
it’s very attempting to to fire up your
favorite programming language load
scikit-learn or tensorflow or you know
whatever framework you’re using MX net
doesn’t matter it’s still important to
understand the cost you don’t have to be
an expert in the math behind it that’s
not what I’m saying but you have to
understand conceptually what they do and
more importantly what they don’t do
because that makes all the difference so
this is just to say that you could have
written this model a lot easier now this
is this is also a breaking point of the
html5 presentations by the way this is
actually really supposed to be on the
right hand side so thank you windows
even so that few code up there is
basically a probabilistic way of
specifying the model that solves it
exactly and this can be expressed in a
probabilistic programming language the
neural network I wrote to fix that took
a lot more coding I can assure you
so the take-home messages here is that
if you view things if you go back to
basic and view them as what they are
probabilistic statements about data
about concepts about what you’re trying
to model you gain basically a generative
model you gain an understanding of what
is actually happening and and that also
means that you don’t get any crazy
statistical only solutions due to
identifiability problems and and this is
something we really have to get away
from identifiability is something that
will be problematic so I’m not going to
talk about deep learning I just want to
show you what it is but I think you’ve
had enough talks about that so max
pooling and all of that we can I’m
pretty sure we can skip what I do want
to say though that neural networks per
per default are degenerate and what I
mean by that is that the the energy
landscape that they’re running around in
where they are trying to optimize things
there are multiple locations in this
energy landscape corresponding to the
parameters that that minimizes the error
and they’re equivalent that they
correspond to very different physical
realities so how the how’s the neural
networks supposed to know and this is
not something that you know that that we
can design our way out of because the
whole idea with the neural network is
this degeneracy because the optimization
is such a problem problematic space and
I just want to visualize with the simple
neural network here why this happens you
can see these two networks describe
exactly the same thing they solve
exactly the same problem but the
parameters are different and that’s why
if you take you from X 1 and go to the
hidden 2 and hidden 1 you can either
have weight 1 1 be equal to 5 and go to
a hidden node 1 or you can have weight 1
1 before and go to hidden 8 so if you
try if you basically turn this on its
head and shift around these weights you
get exactly the same solution now this
is one source of degeneracy and there
are many of those so just imagine now
that you’re stacking a lot of layers on
top of each other you’re having hundreds
of neurons how many permutations do you
think you will be able to reach a lot is
the answer I didn’t do it I didn’t do
the math but just
trust me it’s a lot so in in energy
space in one dimension it looks like the
one on the left-hand side you see two
distinct points they are equivalent in
the solution space and you cannot
differentiate between them this is also
why regularization is such a good idea
in neural networks because it basically
forces you to enter one of those
tractors and in in two-dimensional space
you can see that it corresponds to these
two attractors in this colorized plot
and then if you visualize this in in all
the dimensions that the neural network
is actually operating in which is
typically the essence of dimensions then
you can just imagine how many of those
attractors you have and different depths
of those attractors so I want to end my
point if you missed my points I try to
state it several times but sometimes I’m
very clumsy in the way I state things so
I’m gonna be very blunt this is one of
the best neural networks at given 2016
or 2015 was a version of the Linette
that was trained to recognize digits and
it does that perfectly like we said
before we’re so far and in this area
about perception that we don’t have to
worry about not being able to do it it’s
actually it’s actually done and and and
it’s much better than humans at
recognizing these things okay so let’s
put it to the test shall we let’s
generate some random noise images and
ask it what is this and in every single
image here you see the network is 99%
sure that it’s a 1 versus 2 all the way
up to 9 so all the 4 images under the 0
it is convinced with the likelihood of
99% that this is a 0 can you in any way
understand why this is a zero I can’t
and nor nor can the network because it
was never penalized based on the fact
that you’re not allowed to find
structures that does not sort of dispute
your data it has no briefing that it has
to stay true to some sort of physical
reality and this happens
now back to my point what if it’s not
the number zero
what if it’s recognizing a unknown the
face of a known terrorist with a you
know kill on sight command and this is
just numbers ladies in them imagine the
complexity of faces so this is the entry
point exactly how dangerous this
technology is if you don’t respect it
and it’s not about you know the machines
being too intelligent it’s about us not
being stupid that is that is really
important to remember we have a
responsibility to build applications
that do not have this confirmation bias
in them and that is something I hope
that all of you will think of when you
go out and build the next awesome
machine learning application because I
can’t see any numbers in these images
anywhere and if you want to you can read
the paper by these guys that I said you
get the slides afterwards and it’s a
very interesting paper they’ve basically
tried all they could to see how the
network could generalize with things
that hadn’t seen before and in different
areas of what it was supposed to see
another thing I want to say is that
events are not temporally independent
everything that you do today everything
that you see today here perceive think
about is affected by what you saw
yesterday and it’s the same in data data
is not independent you cannot assume
that two data points are independent
that is a wild and crazy assumption that
we have been allowed to do for far too
long
and this is just a small visualization
from the domain that I that I was
working in where we’re trying to solve
how a TV exposure affects the purchasing
behavior of people moving into the
future and of course if you see TV
commercial today it might affect you to
buy something far into the future and it
might affect no one to do something
today and that’s course or temporal
dependencies that that also needs to be
taken into account if you think about
causal dependencies and if you think
about concepts if you really think about
structure of things then you end up with
something that looks like a deep
learning neural network but where you
actually have
structure that is inherent to the
problem at hand and that’s basically you
forging connections between concepts
between variables between parameters
death sort of solves the problem at hand
but that doesn’t have this over
characterization this is a visualization
of one of the one of the models that
were running and Blackwood for for one
of our from one of our clients and and
this is sort of the complexity that you
need to have to solve the everyday
problems every node that you see here is
basically a representation of a variable
or a latent variable and the
relationships between them are basically
edges and basically there’s no point in
this thing spinning I just thought it
looked cool and it helped me raise money
back in the days
actually the spinning I think was the
differentiate because in one of the
pitches I did it didn’t it didn’t spin
and we didn’t get those money and then
all of a sudden it was spinning and we
got those money I don’t know if that’s
you know all the reason but the spinning
in my mind helped so but there’s there’s
there’s no visual improvement based on
that how many people have seen this
before
okay well that’s that’s just no fun okay
but before before I saw it the first
time interesting enough I had not seen
it so the problem here is that you’re
supposed to judge whether a and B the
squares there are of the same hue or not
and from my point of view there are
extremely differentiated they look very
differently but the problem is that
they’re not they’re actually the same
and the reason why why a lot of people
think that they are think that they are
different is because we are predicting
based on the shadow that is being cast
from a light source that we know where
it is because we have recognized this
pattern earlier in their lives that is
also a kind of confirmation bias but
it’s a good one
because that’s that’s what allows us to
actually live our lives and sometimes we
were wrong like in these contorted
images but but it does prove a point
that does because our brains are very
biased based on what we know already and
and we would do predictions based on
what we know
so basically probabilistic programming
what that is
it basically allows us to specify any
kind of models that we want no you don’t
have to think about layers you don’t
have to think about the pooling you
don’t have to think about all the
wording all you have to think about is
that you specify how variables might
relate to each other and you specify
which parameters that might be there and
how they are relating to the variables
at hand and if you have that freedom
then there’s nothing you cannot model
the problem with this is that you cannot
fit that with Maxim likelihood you
cannot adapt that because you can’t
assume independent observations you
can’t assume that everything is its
uniform you can’t assume what you can
but it’s not very smart you can’t assume
that any given parameter has a possible
value of minus infinity or plus infinity
now this this in general just makes no
sense just just think about the fact
that you’re supposed to predict the
house prices for example if you allow
your model to predict something which is
negative then you have something that
might make sense again it statistical
space because there’s no reason why you
shouldn’t be able to mirror things right
you just look at the positive part but
what about the part in the of your model
that says that negative sales prices are
also positive that that’s just nonsense
and and these things you shouldn’t allow
so that’s why you should specify your
priors and the concept of your models
very rigorously
and the best thing about probabilistic
programming is that we no longer have to
be experts in Markov chain Monte Carlo
before you have to do that but today you
don’t you know you don’t have to
understand what what a Hamiltonian is in
this space you don’t have to understand
quantum mechanics you just have to learn
how to program a probabilistic
programming language which is very easy
by the way super easy if you know Python
or R or Julia or C++ or C or Java
learning how to program a probabilistic
programming language is a walk in the
park and it’s still true and complete
mind you there are a lot of different
things we get out of this we can get the
full Bayesian inference with the market
in Monte Carlo through algorithms such
as Hamiltonian Markov chain Monte Carlo
didn’t know you turn sampler that’s what
you really want to do the problem with
this is that still today it takes
it takes some time there’s a there’s
another emerging tool that’s called
automated differentiation variational
inference which is just a lot of
different words that says that turn the
inference problem into a maximization
problem and and they would have gotten
somewhere with that which makes these
inference machine a lot easier to fit
the best thing is that also the math
library already has this to automate the
differentiation so you don’t have to be
expressing that either again all you
have to do is learn a probabilistic
programming language or learn a
framework in in Python that supports it
like Edward for example there are many
other frameworks that do the same thing
a note about uncertainty now what if I
gave you a task your task right now is
to take 1 million American dollars
and you’re going to invest them in
either a radio campaign or a TV campaign
now I’m going to tell you that the
average performance of each campaign has
been 0.5 so the return of investment for
an average radio campaign has been 0.5
the return on investment on an average
TV campaign has also been 0.5 now my
question to you is how would you invest
does it matter well based on this
information I would save I will just
split it 5050 I mean why not they have
the same performance right but what if I
also told you that actually if you look
at our is the distribution if you look
over all the different radio campaigns
that have been run and all the different
TV campaigns that have been run if you
look beyond the average and look at the
individual results what do you have then
well then you have that radio for
example and TV they both have had
historically a return investment of 0
which basically means it didn’t work
that could be like your some of their
some of the commercials you see on TV
sometimes that are less than good you
know sometimes you see these these naked
gnomes running on a grass field and
they’re trying to sell cell phone
subscriptions and every law understood
the connection but that didn’t work I’m
sure I didn’t quantify that but but it
didn’t work on me
then I’m going to tell you that the
maximum radio and TV performance that
has been observed is that radio has had
in his history and return investment of
nine point three meanwhile TV has only
had one point four how would you invest
now would you still split it fifty-fifty
I wouldn’t
now what if I tell you that this is
probably not the the real solution
either in order to answer this question
you have to ask another question in
return you have to ask the question what
is the probability of me realizing a
return on investment greater than for
example 0.3 let’s just take that that is
what I want to to achieve now now we
have a specified what our question is
and then we can give it a probabilistic
answer and then the answer to this
question is that it’s about 40 percent
probable for radio to get a return on
investment for any given instance above
0.3 but it’s it’s about 90% for TV how
does that go hand-in-hand with the fact
that radio is outperform TV historically
as a maximum and they have the same
average well it’s because of the fact
that things are distributions things are
distributions and they are not caution
now this here is the source of failure
of every statistical method that you
probably have tried before because it
assumes that everything is symmetric in
caution nature makes no such promise it
has never said thou shalt not use Kashi
never has that been part of any sort of
commandment or information given to us
by nature there is nothing special about
the Gaussian distribution there is a few
things special about it but you know
let’s just ignore the central limit
theorem for now because of the fact that
we don’t have enough data to actually
approach that anyway so let’s just
ignore that for now now the point here
is that the distribution of radio looks
like this and the distribution for TV
looks like the one below and here you
can see they have the same average very
different minima and Maxima and very
different skewness and
this is why you cannot make optimal
decisions without knowing what you don’t
know you cannot make optimal decisions
without knowing uncertainty even though
if you knew the average performance
average performance is such a huge
culprit in bad science and bad inference
I cannot state this enough and that’s
also why you should never ever ever ever
ever treat the parameters of your model
as if they were constants because they
are not it’s also not interesting to ask
the question how uncertain is my data
about this parameter about this fixed
parameter also a nonsense question not
interesting and that is why we have to
go back to basics and do it right
because until we do we will never get
further so if I can tie this all
together I I created sort of a a way for
for you to start playing around with
this I am I made a docker image
basically which is called our Bayesian
or is the host language figure you can
basically use whatever language you want
it doesn’t really matter what I want to
show here is basically how easy it is to
deploy a docker container with a
Bayesian inference engine that can model
any problem known to man there is
nothing you cannot do with this
framework nothing it is more general
than anything that you have ever tried
because it can simulate everything that
you have ever tried and most of the
things you have ever tried comes from
probability theory and this is just a
pure application of probability theory
so this is a very easy way to just snap
that docker container and the best thing
is that the functions do you write
theory in there are automatically
converted to rest API so that you can
expose through this docker service so
you have a REST API ready inference
machine that is very much true to the
scientific principle with no limitations
and the only thing you have to pay for
it is that you have to think twice now
for those of you doesn’t like or I can
make one version with Python or Julie or
whatever it’s it’s not about the
language are
whatever I really want to convey is that
modeling needs to be rebooted we need to
think again on how we define our models
how we specify our malls how we think
about our models how we relate to our
models we can never ever relate to our
models without uncertainty we will
always fail that’s why I think that
playing around with this is it’s a good
way to to learn more about these things
this is just an example of how you would
actually use this so I wrote a very very
stupid container that it’s called the
stupid weather and it’s stupid because
it always gives you the same answer so
no matter what you send in as parameter
it always gives you something stupid so
that that’s just to show you how you
write a function it’s not supposed to
convey any intelligence it’s just a
placeholder it’s just boilerplate code
for you to ingest your algorithm but it
shows neatly how how you’re transforming
this to rest api and it’s as simple as
this just talk around and then you have
it so even if you’re not you know a
back-end developer or a full-stack
developer it’s still easy to deploy and
run your own solutions and you know
docker container can run anywhere in the
cloud can run on Google they can run on
Amazon I think even it can run on on
Microsoft’s cloud sure probably I didn’t
try that but but but I would assume that
they that they can run docker containers
so if I can leave you with one
conclusion it is basically think again
about everything that you were ever
taught every statistics class you had
every applied machine learning class all
of it
rethink its reevaluate it be critical to
whatever you were told because I got I
can assure you that in most cases it was
a flatulent lie and that lie didn’t
happen because of the fact that people
wanted to lie to you it’s based on
ignorance and it’s based on you know
decades of malpractice in this field
because computation has caught up with
us before it was ok to do
was done because we had no other choice
today no longer okay we have all the
choices in the world it’s not hard
getting a computational cluster with 200
gigabytes of RAM and the 64 CPUs or even
5000 GPUs those things are at our
disposal we don’t need to take the same
shortcuts as we did dangerous shortcuts
no less so I hope you will think about
that another thing is that whenever
you’re solving a problem I would like
you to think about that whatever problem
you’re solving whatever machine learning
application you’re writing it is an
application of the scientific principle
please stay true to that there’s a
reason why we have it science is a way
for us to not be biased science is a way
for us to discover truths about the
world that we live in this should not be
ignored or taken lightly and that’s why
you know crazy people like Trump can get
away with saying that there is no such
thing as global warming because he does
not adhere to the scientific principle
so you know you can either be Trump or
you can stay true to the scientific
principle and those two are the only
extremes my friends so another thing
that I want to say is always state your
mind whatever you know about the problem
I assure you that that knowledge is
critical and important do not pretend
and fall into this trap more I want to
do unbiased research there’s no such
thing no such thing understand this
there is no bias free research there is
no scientific result that can be
achieved without assumption you are free
to evaluate your assumptions again
restate them that’s good
that’s progress that is science but
before you’re observing data state your
mind and you have to because otherwise
you got nothing you got a result out but
that was just picked out of thin air
it’s nothing special about those
coefficients that came
nothing at all and until people realize
this we will still have applications
that believe that Central Park is the
red light and it is not even though that
might look like it from from a different
scale we need to do better and we can’t
do better and maybe the most important
thing of all is that with this framework
and with this principle of thinking you
are able to be free you are able to be
creative and most of all you are able to
have so much more fun building your
models because you are not forced into a
paradigm that someone else defined for
you because it made the math nice thanks
I think we have time for one question
somebody asked where can I read more
about this any good resources yes there
are a few great books that I can slowly
recommend and I will do them in
mathematical requirement order so if
you’re a hardcore mathematician or a
theoretical physicist or anyone with a
computational background with a deep
understanding of mathematics then you
can go directly to read a book called
the handbook of Markov chain Monte Carlo
that is a very technical book and it
describes the processes behind the
probabilistic modeling if you are a
little bit less mathematical but still
has quite a bit of mathematics you
should read the section about graphical
models made by bishop and then a book
called machine learning and pattern
recognition but the most important book
of all perhaps to read is is one of the
books called statistical rethinking and
that book explains a lot of the concepts
that I’ve been badgering now that you
know somewhere along the line we just
got lost that has both text that you
know is consumable by by people and it
has a little bit of math so you can sort
of put it in context those are really
the books I would recommend in this okay
thank you and I’ll tweet their resources
to the go to hashtag go to CPA okay
thank you my thank you
[Applause]
Please follow and like us:

Be First to Comment

Leave a Reply