A Friendly Introduction to Machine Learning

hi and welcome to the machine learning
peek of the grief and audacity so we’re
going to talk about today is what is
machine learning well this is the world
and in the world we add humans and we
got computers and one of the main
differences between humans and computers
is that humans learn from past
experience whereas computers need to be
told what to do need to be programmed so
they follow instructions now the
question is can we get computers to
learn from experience too and the answer
is yes we can and that’s precisely with
machine learning is of course for
computers fast experiences have a name
called data so in the next few minutes
I’m going to show you a few examples in
which we can teach the computer how to
learn from previous data and most
importantly I’m going to show you that
these algorithms are actually pretty
easy and the machine learning is really
nothing to fear so let’s go to the first
example let’s say we’re studying the
housing market on our task is to predict
the price of a house given its size so
we have a small house that cost $70,000
we have a big house that cost one
hundred and sixty thousand dollars and
we’d like to estimate the price of this
medium-sized house here so how do we do
it well first put them in a grid where
the x-axis represents the size of the
house and square feet and the y-axis
represents the price of the house and
dollars and so to help us out we have
collected some previous data in the form
of these blue dots these are other
houses that we’ve looked at and we’ve
recorded their prices with respect to
their size so in this graph we can see
that the small house is priced $70,000
and the big house is priced at a hundred
and sixty thousand dollars so now it’s
time for a small quiz what do you think
is the best guess for the price of the
medium house given this data would it be
a thousand dollars one hundred and
twenty thousand dollars or one hundred
and ninety thousand dollars well to help
us out we can see that these blue points
kind of form a line so we can draw the
line that best fits the data
now in this line we can say that our
best guess for the price of the house is
this point over here which corresponds
to one hundred and twenty thousand
dollars so if you set one hundred and
twenty thousand dollars that is correct
this method is known as linear
regression now you may ask how do we
find this line well let’s look at a
simple example this three points we’re
going to try to find the best line that
fits through those three points
obviously best line is subjective while
we try to find a line that works well
since we’re teaching the computer how to
do it computer can’t really eyeball the
line so you have to get it to draw a
random line and then see how bad this
line is so in order to see how bad the
line is we calculate the error so we’re
gonna for calculate the error look at
the lengths of the distances from the
line to the three points and we’re just
going to simply say that the error of
this line is the sum of those three red
lengths now what we’re going to do is
move the line around and see if we can
reduce this error so let’s say we moved
in this direction and we calculate the
error it’s given by the yellow distances
we add them up and realize that we’ve
increased the error so that’s not a good
direction to go let’s try moving the
other direction we move it here
calculate the error now it’s given by
the sum of these three green distances
and we see that the error is smaller so
we actually reduced it so let’s say we
take that step we’re a little closer to
our solution if we continue doing this
procedure several times we will always
be decreasing the error and we’ll
finally arrive to a good solution in the
form of this line this general procedure
is known as gradient descent now in real
life we don’t want to deal with negative
distances corresponding to a point being
on one or the other side of the line so
what we do to solve this is add the
square of the distance from the point to
the line instead and this procedure is
called least squares
so we’re going to cover in the census
trying to the central mountain this is
our Mountain Mount Everest this mounting
the hi we are the larger error is so
descending means reducing the error so
what are we doing the credit the
cinematic well look at our surroundings
and try to figure out which way we can
descend more for example here we can go
in two directions to the right or to the
left let’s go to the left then we’re
going up insert error is ascending this
is equivalent to moving the line
downwards and getting farther from the
three points but if we go to the right
instead then we’re actually descending
which means our error is decreasing this
is equivalent to moving the line upwards
and getting closer to the three points
so we decide to take a step towards or
right then we can start this procedure
again and again and again until we
successfully descend from the mountain
this is equivalent to reducing the error
until we find its minimum value which
gives us the best line fit so you can
think of linear regression as a painter
and will look at your data and draw the
best fitting line now this method is
actually much stronger if the data
doesn’t form a line with a very very
similar method we can draw a circle
through it or a parabola or even a
higher degree curve for example the data
here we can actually fit a cubic
polynomial okay so let’s move to the
next example in this example we’re going
to build an email spam detection
classifier so something that will tell
us if an email is spam or not and how do
we do this we do this by looking at
previous data the previous data is 100
emails that we looked at already out of
these 100 emails we have flagged 25 of
them are spam and 75 of them is not spam
now let’s try to think of features of
spam emails may be likely to display and
analyze these features so one feature
could be containing the word cheap
seems reasonable to think that an email
containing the word cheap is likely to
be spam so let’s analyze this claim we
look for the word cheap in all these 100
emails and find that 20 out of spam
loads and 5 out of the non spam ones
contain that word so we can forget about
all the rest of the emails and focus
only on the ones that contain the word
cheap okay so time for a quiz here’s the
question based on our data if an email
contains the word cheap what is the
probability of this email being spam is
it 40% 60% or 80% well to help us out we
can see that out of the 25 emails with
the word cheap 20 of them are spam while
5 of them are not so these form an 80/20
split so the correct answer with 80
if you said 80 you were correct so from
analyzing the data we can conclude a
rule the rule says if an email contains
the word cheap then we’re going to say
the probability of it being spam is 80%
so we then associate this feature with
the probability 80% and we’re going to
use it to flag future messages as spam
or not spam we can also look at other
features and try to find our Associated
probability let’s say we look at emails
containing a spelling mistake and
realize that the probability of an email
containing a spelling mistake being spam
is 70% or let’s say we look at emails
that are missing a title and find the
probability of those being spam is 95%
etc etc so now when future emails come
we can combine these features to guess
their spam or not this algorithm is
known as the naive Bayes algorithm okay
so now another example we are the App
Store or Google Play and our goal is to
recommend apps to users so to each user
we’re going to try to recommend them
app that they are most likely to
download we have gathered a table of
data that we’re going to use to make the
rules on the table contains six people
for each one of those six people we have
recorded their gender and their age and
the app they downloaded so for example
the first person is a 15 year old female
and she downloaded pokemon gold so
here’s a small quiz between gender and
age which one seems like the more
decisive feature for predicting what app
will be users download well to help us
out first let’s look at gender if we
split them by gender than the females
downloaded Pokemon go on whatsapp
whereas the male is downloaded Pokemon
go and snapchat so not much for split
here on the other hand if we look at age
we realize that everybody who’s under 20
years old downloaded pokemon gold
whereas everybody who is 20 or older
didn’t
that’s a nice split so the feature the
best splits the data is H therefore if
you said age that was correct so we’re
going to do is we’re going to add a
question here the question is are you
younger than 20 if yes then we’ll
recommend Pokemon go to you if not then
we’ll see so what happens if you’re 20
or older then we look at the gender it
seems like here if you’re a female
you’ve downloaded what’s up whereas if
you’re a male you download it snapchat
so we add another question here the
question is are you female or male and
if you’re female
we recommend what’s up and if you’re
male then we recommend snapchat so what
we end up here is with a decision tree
and the decisions are given by the
question we asked and this decision tree
was built with the data and now whenever
we have any user we can put them to the
decision tree and recommend them
whatever app the tree suggests is to
recommend for example you have a young
person
you recommend them Pokemon go if you
have an older person you check their
gender if it’s a female you recommend
them what’s up and it’s a male you
recommend them snapchat obviously there
won’t always be a tree that perfectly
fits our data but in this class we’re
going to learn an algorithm which
actually will help us find the best
fitting tree to your table of data okay
so let’s go to the next example
now let’s say we’re the admissions
office at a university and we’re trying
to figure out which students to admit
we’re going to admit them or reject them
based on two pieces of information one
is an entrance exam that we provide them
the test and the other one is their
grades from school so for example here
we have student 1 with scores of 9 out
of 10 in the test and 8 out of 10 and
the grades and that student got accepted
we also have student 2 with scores a 3
in the test and 4 in the grades and that
student did not get accepted and then a
new student comes in student 3 this
person has a son has scores of 7 and 6
and the question is should we accept
them or not
so let’s first put them in a grid or the
x-axis represents our score on the tests
and the y-axis represents their grades
here we can see that student 1 would lie
over here in the point with coordinates
9 8 since their scores were 9 and 8 and
the student 2 would lie right here in
the point with coordinates 3 4 since
their scores were 3 & 4 so in order to
see if we should accept or reject Stu
and 3 we should try to find it training
that data so we look at the previous
data in the form of all the students
we’ve already accepted or rejected and
it turns out that the previous data
looks like this the green dots represent
students that we’ve previously accepted
and the red dots represent students that
we’ve previously rejected so time for a
quiz
based on the previous data do we think
student 3 gets accepted yes or no so to
answer this question let’s look closely
at the data the red and green dots seem
to be nicely separated by a line here’s
the line and most of the points over at
are green and most of the points under
it are red with some exceptions which
makes sense since the students who got
high scores are over the line and they
got accepted in soon so what lowest
scores are under the line and they
didn’t get accepted so we’re going to
say that that line is going to be our
model and now every time we get a new
student we check their scores and plot
them in this graph and if they end up
over the line we predict that they’ll
get accepted and if they end up below
the line we predict that they’ll get
rejected so since students 3 has grades
7 and 6 a person will end up here at the
point 7 6 which is over the line so we
conclude that this students gets
accepted so if you said yes that’s a
correct answer this method is known as
logistic regression another question is
how do you find this line that best cuts
the data and – so let’s look at a simple
example is 6 points 3 Green 3 red and
we’re going to try to draw a line that
best separates the green points from the
red points and again a computer can’t
really eyeball the line so you can just
start by drawing a random line like this
one and given this line let’s just
randomly say that we label the region
over the line is green and the region
under line is red so just like with
linear regression we’re going to try to
see how bad this first line is and the
measure of how bad the line is would be
how many points are we miss classifying
we’re going to call that number
misclassified points the error this line
for example misclassified two points one
red and one green so we’ll say that it
has two errors so again like with linear
regression what we’ll do is move the
line around
and try to minimize the number of errors
using gradient descent so I’ve removed
the line a bit in this direction we can
see that we start correctly classifying
one of the points bringing down the
number of errors to one and if we move
it a little more correctly classify the
other one of the points bringing down
the number of errors to zero in reality
since we use calculus for a gradient
descent method it turns out that the
number of errors is not what we need to
minimize but instead something that
captures the number of errors called the
log loss function and the idea behind
the log loss function is that it’s a
function which assigns a large value to
the misclassified points and a small
value to the classified points ok so
let’s look more carefully at this model
for accepting or rejecting students
let’s say we have a student for who got
nine in the test and one on the grades
so the student gets accepted according
to our model since they are over here on
top of the line but that seems wrong
since I student got very low grades you
can get accepted no matter what their
test score was so maybe it’s simplistic
to think this data can be separated by
just one line right maybe the real data
should look more like this where these
students over here
we’ve got a load test score or low
grades don’t get accepted so now it
seems like a line won’t cut the data
into so what’s the next thing after a
line maybe a circle circle could work
maybe two lines that could work too
actually it looks like that works better
so let’s go with that now the question
is how do we find these two lines again
we can do it using gradient descent to
minimize a similar log loss function at
the for this is called a neural network
now why is it called a neural network
well let’s see we have this green area
here by and about two lines this area
can be constructed as an intersection
namely the intersection between the
green area on top of one lines and the
green area to the right of the other one
of the lines so
we’re going to graph it like this we
have two nodes each node is a line that
separates the plane into two regions and
from the two nodes we get the
intersection which is the desired area
the reason why this is called the neural
network is because this mimics the
behavior the brain in the brain we have
the neurons which connect to each other
and they either fire electricity or not
they resemble the nodes in our graph
which split the plane into regions and
fire electricity for given point belongs
to one of those regions and won’t fire
if it doesn’t so we can’t explain your
aggression as a ninja we’ll look at your
data and cut it in half based on the
labels and we can think of a neural
network as a team of ninjas who will
look at your data and cut it into
regions based on the labels okay
so let’s dive a bit deeper into the art
of splitting data into two we can look
at this points three green and three red
and there seem to be many lines that can
split them for example there is this
yellow line and there is this purple
line so quiz which of these two lines do
athing cuts the data better the purple
or the yellow one well if we look at the
yellow line it seems that it’s close to
failing it’s too close to two of the
points so if we were to wiggle it a
little bit we would miss classify some
of the points the purple one on the
other hand seems to be nicely spaced and
as far as we can from all the points so
it seems like the best line is a purple
one now the question is how do we find
the purple line well the first
observation is that we don’t really need
to worry about these points because
they’re too far from the boundary so we
can forget about them and only focus on
the points that are close and now what
we’re going to use is not gradient
descent but we’re going to use linear
optimization to find the line that
maximizes the distance from the boundary
points this method is called a support
vector machine
so you can think of support vector
machines that surgeon will see your data
and cut it but before she will carefully
look at what’s the best way to separate
the data into and then make the cut okay
so now let’s say we have these four
points arranged like this and we want to
split them it seems like a line won’t do
the job since they’re already over the
line and the red ones are on the sides
and the green ones are in the middle so
we need to think outside the box one way
to think outside the box is to use a
curve like this to split them another
one is to actually think outside the
plain and to think of the points is
lying in a three-dimensional space so
here are the points over the plane and
here we add an extra axis the z axis for
the third dimension and if we can find a
way to lift it to green points then we’d
be able to separate them with a plane so
what seems like a better solution the
curve over here or the plane over here
well it turns out that these two are
actually the same method don’t worry if
it seems confusing we’ll get into a
little bit more detail later this method
is called the kernel trick as very well
used in support vector machines so let’s
study one of them in more detail let’s
start with the curve trick so let’s
start by putting coordinates on the
points this one is the point zero three
this one is 1 2 this one is 2 1 and this
one is 3 0 and what we need is a way to
separate the green points from the red
points so the points coordinates are X Y
then we need an equation on the
variables x and y that gives us large
values for the green points and small
values for the red points or vice versa
so quiz which of the following equations
could come to our rescue
X plus y the product x times y
or x squared the first coordinates
squared this is a not an easy question
so let’s actually make a table with the
values of these equations on each of the
four points so here’s our table here we
have the four points on the top row and
now each of the other rows will be one
of the functions so here’s the sum X
plus y we fill in the first row the
following way 0 plus 3 is 3 1 plus 2 is
3 2 plus 1 3 3 plus 0 3 now for the
second row we’re going to get the
products 0 times 3 is 0 1 times 2 is 2 2
times 1 is 2 and 3 times 0 is 0 and for
the third row x squared is the first
coordinate squared so 0 squared is 0 1
squared is 1 2 squared is 4 and 3
squared is 9 so let’s think which one of
these equations separates the green and
the red points we look at the sum X plus
y and that gives us 3 at every value so
it doesn’t really separate the points we
can look at x squared and that gives us
different values for every point but we
get 0 & 9 for the red values and 1 & 4
for the green ones so this one also
don’t doesn’t separate them but now we
look at the product x times y and that
gives us 0 for the red values and 2 for
the green ones so that one seems to do
the job right it’s a function that can
tell them apart so that’s the equation
we’re going to use you can see their
products here and now for the red points
X comma Y we have that the product X y
equals 0 and for the green points we
have that the product X y equals 2 and
what separates a 0 and a 2 well a 1 so
the equation x y equals 1 will separate
them
and what is XY equals one it’s the same
as y equals one over X and the graph for
y was 1 over X is precisely this
hyperbola over here that is the curve we
want it so that is the kernel trick now
we can also see it in 3d here we have
the point 0 3 1 2 2 1 and 3 0 and we’re
going to consider them in 3 space so
we’re going to take the map that takes
the point X comma Y 2 X comma Y comma X
times y so where does 0 3 go 0 3 goes to
0 comma 3 comma 0 since the product of 0
& 3 is 0 1 2 goes to 1 comma 2 comma 2
so it goes all the way up since the
third coordinate is the height the point
2 1 also goes to 2 comma 1 comma 2 and
the point 3 0 goes to 3 comma 0 comma 0
so there we go we can split them using a
plane so you can think of a support
vector machine a kernel method as a
surgeon who is a slightly confused
trying to split some apples and oranges
all of a sudden she comes up with a
great idea the idea consists of moving
the apples up and the oranges down and
then successfully cutting a line through
between them ok so let’s move the next
example let’s say we have a chain of
pizza parlors and we want to put 3 of
them in this city so we make a study and
realize that the people who eat piece of
the most live in these locations and so
we need to know where are the optimal
places to put our 3 pizza parlors well
it seems like the houses are nicely
split into three groups the red the blue
and the yellow so it makes sense to put
one pizza parlor in each one of the
three clusters but we’re teaching a
computer how to do this a computer can
just eyeball the three clusters we need
an algorithm so here’s one algorithm
that’ll work
let’s start by choosing three random
locations for the pizza parlors so
they’re here where the stars are located
red blue and yellow now it makes sense
to say each house should go to the pizza
parlor that is closest to it in that
case we can look at the map like this
where the yellow houses go to the yellow
pizza parlor the blue houses go to the
blue pizza parlor and the red houses go
to the red pizza parlor but now look at
where the yellow houses are located you
would make a lot of sense to move the
yellow pizza parlor to the center of
these houses same thing with the blue
houses and the red houses so let’s do
that let’s move every pizza parlor to
the center of the houses that it serves
as follows but now look at these blue
points there are a lot closer to the
yellow pizza parlor than to the blue one
so we might as well color them yellow
and look at these red points they’re
closer to the blue bits of color then to
the red so let’s color them blue and now
let’s do the step again that send each
pizza parlor to the center of this
houses that is serving in this way but
then again look at this red house is
there so much closer to the blue pizza
parlor so let’s turn them blue and then
again let’s move every pizza parlor to
the center of the house as it serves and
now we’ve reached an optimal solution so
starting with random points and
iterating this process helped us reach
the best locations for the pizza parlors
this algorithm is called k-means
clustering but now let’s just say we
don’t want to specify the number of
clusters to begin with it’s just a
different way to group the houses so say
they’re arranged like this it would make
sense to say the following if two houses
are close they should be served by the
same pizza parlor so if we go by this
rule let’s try to group the house
let’s look at which houses are the
closest to each other it’s these two
over here
so we grouped them now what are the next
two closest houses it’s these two over
here
so we grouped them the next two closest
houses are these two so again we grouped
them the next two closest outside is two
so we unite the groups now the next two
house right here so we grouped them the
next two clusters are here so we join
the groups the next two closest houses
are here but now let’s just say that’s
too big so all we need to do is specify
a distance and say this distance is too
far when you reach this distance stop
clustering the houses and now we get our
clusters this algorithm is called here
article clustering so congratulations
in this video we’ve learned many of the
main algorithms of machine learning we
learn to find you house prices using
linear regression we learn to detect
spam email using naive Bayes we learn to
recommend apps using decision trees we
learn to create a model for an
admissions office using logistic
regression we learn how to improve them
using neural networks and we learn how
to improve it even more using support
vector machines and finally we learn how
to locate pizza parlors around the city
using clustering algorithms so many
questions may arise in your head such as
are there more algorithms the answer is
yes which ones to use that’s not easy
given a data set how do we know which
algorithm to pick how to compare them
and evaluate them into algorithms how do
you know which one is better than
another one data set given the running
time their accuracy etc are there
examples other projects are the real up
that data that I can get my hands dirty
with them the answer to all these
questions are more or in the Udacity
machine learning nanodegree so if this
interests you you should take a look at
it thank you

Russian Version

Please follow and like us:

A Friendly Introduction to Machine Learning

Be First to Comment

Leave a Reply Cancel reply