Perceptron is the most basic form of a neural network and therefor a good place to start. Perceptrons are a type of binary classifiers — given data representing some object as an input, a perceptron determines to which class the object belongs. It is called ‘binary’ because it only has two classes to choose from. These classes can be, for example ‘cat’ and ‘dog’, but they can also be, in the more general case, ‘cat’ and ‘not a cat’.
Let’s build an example that will remain with us throughout this journey:
Let’s say you’re a movie enthusiast and you want a method to determine whether a given movie in IMDB is worth watching. You want a calculation that, based on movie characteristics, will give you either ‘1’ — go see the movie, or ‘0’ — don’t waste your time. So you take 100 movies you’ve already watched (and either liked or didn’t like), and review their characteristics. Based on your research, you came up with the following formula for your personal movie preference:
(where IMDBScore is normalized to values between 0 and 1).
You chose the coefficients in such a way, that will maximize the correlation between movies you liked (out of the 100) and a positive personal score (we’ll discuss later how to get those coefficients right). Now, each time a score is ≥ 0, you turn it into 1. each time the score is < 0, you turn it into 0 (this is called the Heaviside step function), and based on that you decide to watch the movie or play the guitar instead. Before we continue, let’s use a shorter notation for the computation we just described:
where the w values are called the weights (and it’s a vector with 2 elements: [0.7, 0.1], the x values are the elements of the input vector ([IMDBScore, BradPittMinutes]), and b is a bias term (equal to -2.0). The sigma means that we sum over all the terms xi, wi (where i goes from 0 to 1. We like to count elements starting at 0). If we plot the personal score for all input combinations using the weights and bias we defined, (before applying the step function), we get the following 2D surface:
The vertical axis is our personal score (before applying the step function). Points on the surface that are at Personal Score ≥ 0 indicate a movie worth watching, and those with score <0 indicate a movie that isn’t worth watching. After applying the step function, the non-negative values become 1 and the negative values become 0, and the plot becomes:
and now the personal score for every given set of coordinates is either 1 or 0, corresponding to whether you should or shouldn’t watch the movie whose features are described by that set of coordinates.
If we look at the last 3-D plot from above (and rotate it, to get our axes right), we find something interesting:
In this 2-D plot we see a distinct border between the positive (yellow), and zero (dark blue) outputs.
(A friend said that on her screen it looks purple, but to me it seems dark blue so we’ll stick with that).
That border defines the boundary in our data space (the IMDB score and Brad Pitt minutes plane) between the two classes: ‘movie worth watching’ and ‘movie isn’t worth watching’. This makes it very easy to examine what to do with each movie we want to test.
Let’s test this with the movie 12 Monkeys (1995). It has an IMDB score of 8 (which we normalize to 0.8), and (let’s say) 18 Brad Pitt minutes. The calculation boils down to a PersonalScore = 0.56 + 1.8 -2.0 = 0.36 > 0, which means — ‘movie worth watching’ (an excellent movie, by the way). Alternatively, we could take a quick look at the coordinate (0.8, 18) in the last figure and see that it falls in the yellow side of the classifier boundary.
On the other hand, the movie The Room (2003) scored 3.7 on IMDB and doesn’t have Brad Pitt in it, giving it a PersonalScore = 0.259 + 0 -2.0 = -1.741. Take a look at coordinates (0.37, 0) and see that it’s far into the dark blue side which means you should opt the guitar.
The decision system you built is a perceptron.
The perceptron, introduced in 1958 by Frank Rosenblatt at the Cornell Aeronautical Laboratory, is a binary classifier and is also the simplest form of a neural network.
The perceptron receives an input vector (whose elements are also known as features) characterizing the object we want to classify, and emits a single value as its answer. The perceptron’s output is binary, i.e., it can be either 0 or 1.
A graphical way to describe the perceptron we built looks like:
where the blocks inside the gray rectangle are components in the perceptron, and the orange boxes are input features (in our case x0 is the normalized IMDB score and x1 is Brad Pitt minutes. The yellow boxes are the perceptron’s weights (in our case 0.7 and 0.1), the green box is the bias (-2.0), and the blue circles are the familiar arithmetic operations ‘X’ and ‘+’.
In general the input vector can have any number of features, and there is exactly one weight element for each input feature. But there will always be just one bias term (which, in some cases, can have a value of zero). The weights and bias are what comprises the equation our mechanism is solving, and they are called the parameters of the model. We can say that the more parameters our perceptron has, the more complex our model.
How are the model parameters learned?
So far we just assumed that we somehow found the best values to use for the weights and bias terms. but how do we find them? Can we come up with a way for the perceptron to learn its optimal parameter values? This is where AI comes in.
Let’s lay out our strategy first:
To enable our perceptron to learn the best parameters, or to train, we’ll be using a concept known as ‘supervised learning’:
We start with a set of data objects to which we already know what the correct answers are (also referred to as ‘labels’). In the movie preference example, the data will be the 100 input vectors for the 100 movies you’ve already watched, and the label for each data entry is either 1 or 0, depending on whether you liked or didn’t like it. Correct labels that are supplied with a dataset are also known as ‘ground truth’. This set of labeled data is called the training set.
Now that we’ve set up our training set, we want to initiate the perceptron’s parameters to some random (or zero) value. Then, we feed it with data from the training set, and incrementally change the parameters according to whether our perceptron returned the correct label at its output. If the perceptron’s output (be it 0 or 1) is identical to the ground truth then it means the perceptron is correct and no change will be made. If the perceptron’s output is different from the ground truth, then our algorithm will modify the parameters to give the correct answer (or something closer to the correct answer) for that specific data entry. So, returning to our previous example with the movies, we start with the training set points (or vectors) in our data space (I plotted 8 data points but let’s assume it’s 100):
And start iterating over the points. Each configuration of the perceptron’s parameters will give a boundary, like this one:
Which is clearly not optimal, since it got 2 movies wrong: one green dot (movie known to be good) is mistakenly labeled in the blue half plane, and one red dot (movie known to be bad) is labeled in the yellow half-plane. But after a few iterations, or a few epochs (a full pass through our training set) hopefully, the model parameters will settle down with a boundary that satisfies all the data points:
When this goal is achieved, we want our perceptron to stop the training process and remain with the final parameters.
The reason this method is called ‘supervised learning’ is because the training process is supervised: our model’s output is continually compared with the ground truth and rewarded based on the resemblance of its results to the ground truth.
How do we do it? Let’s get technical.
Firstly, note that the sigma operation we introduced earlier, is, in fact, a dot product between two vectors —a data vector and a weight vector.
Secondly, recall that each data point is a vector (a vector always starts at the origin) in the data space, and that the dot product of two vectors (a0, a1)(b0, b1) is equal the cosine of the angle between the two vectors multiplied by the product of the vector lengths:
And a cosine function is equal to +1 when the angle is 0 (two vectors pointing at the same direction) and gradually falls off as the angle increases, until it reaches -1 when the vectors point in opposite directions.
Thirdly, recall that the elements of a vector are the also coefficients of the plane equation for a plane that is normal to the vector and crosses the origin. If our vector is (w0, w1) and we call our space directions x0 and x1, then the equation for a plane perpendicular to that vector is:
In other words, we want to find a weight vector whose normal plane can serve as a boundary between our two training set data classes. The correct position of that boundary will be such that for every data point whose ground truth is ‘1’, its dot product with the weights (plus the bias) will be positive, and for each data point with ground truth ‘0’, its dot product with the weight vector (plus the bias) will be negative (The Heaviside step function will turn the positive values to ‘1’ and negative values to ‘0’).
To do an actual training example, I want to start with a simpler scenario than the example with the movies — a scenario that doesn’t require a bias term — and after we’re done, we will discuss why we sometimes need a bias term and what this term means in the feature space.
So our new scenario has 4 training set data points, and note that the data space vector, (x0, x1) has values that can be both positive and negative (I added dotted arrows to remind us that those data points are vectors. You can keep imagining them in the following illustrations as well).
We’ll refer to our training set as pairs of data points p and ground truth lable d in the following format: p :d. So we have:
(-10, 5) :0, (-2, 3) :0, (10, 11) :1, (12, 6) :1
We start by initializing our weight vector to (w0, w1) = (0, 0), and then repeat the following procedure:
Let’s say our first data point is (-2, 3), with the label 0. The perceptron’s output is y=Heaviside((0, 0)(-2, 3)) = Heaviside(0) = 1. This is a mistake since d = 0 and we enter the second ‘if’ condition, where we modify the weights to be (0, 0) — (-2, 3) = (2, -3). This creates a weight vector pointing away from our data point, with a normal plane that puts that data point in the negative (correct) side of the dot product (or zero side of the perceptron’s output):
Now, our procedure moves to the next data point. Let’s say it’s the lower green one — (12, 6) with label 1. Our calculation gives y=Heaviside((2, -3)(12, 6)) = Heaviside(24–18) = Heaviside(6) = 1, which is the correct label for that point, so no modification is made (as we could expect from the last illustration). We choose another point, let’s say the top green one: (10, 11) with label 1. The perceptron’s output is y=Heaviside((2, -3)(10, 11))= Heaviside(20–33) = Heaviside(-13) = 0, while the label is d=1. So we enter the first ‘if’ condition and update the weight vector to (2, -3) + (10, 11) = (12, 8).
When running the procedure with the last remaining (red) data point, no modification is required since it’s correctly labeled as ‘0’, and we start a new epoch, where, sooner or later we encounter the red data point (-2, 3) which is labeled incorrectly because its center is slightly in the yellow half plane. According to our algorithm the weight vector is modified to (12, 8) — (-2, 3) = (14, 5) — this shifts the boundary a little bit clockwise, and fixes the problem.
Now all of our training points are labeled correctly. When we run through epoch 3 without any modifications, the training procedure stops and we have our finalized model parameters: w0 = 14, w1 = 5.
Congratulations! Now you know how to train a basic neural network, one which contains a single perceptron!
The model in the last example had 2 features, but a pereceptron can take any number of features as input. The corresponding model will have one weight element for each feature (and a bias term). A multi feature model is impossible to capture graphically in one plot, but the math is valid nonetheless.
When is the bias required?
A bias is a term that is added to the vector dot product and shifts the result by a fixed value, independent on the input.
Our last example didn’t require and didn’t use a bias. But let’s examine the next configuration (which is close in its spatial distribution to our movie example):
Here, there is no boundary plane (or line) that both separates the two classes and passes through the origin (the second requirement is because the weight vector always starts at the origin). The clever solution is to add a dimension to the problem:
By adding a height feature of magnitude 1 to all of the data points (we assign this new dimension the name x2), we are able to create a plane that passes through the origin and neatly divides the spaces between the two classes. Now, each data point Pi becomes (x0i, x1i, 1). Our weight vector will also have 3 elements to account for the new dimension: (w0, w1, b). Now the dot product between the weight vector and the data vector becomes:
showing that the weight component in the new, uniform dimension is exactly the bias term! So we see how the bias term can be treated as a weight element, and this goes for the training algorithm that we introduced earlier as well. Just remember that each data point now has another dimension with magnitude 1.
You might have noticed that alternatively, we could have normalized our feature space in a way that puts the two classes on opposite sides of the origin (using the following transformation: f((x0, x1))=(x0–6, x1–6)), and return to the simple case where no bias is required. While this is true, when working with a multi-feature space, it’s not always easy to see an option for this type of manipulation, so it’s better to use a bias — let the machine find the solution for you!
The training process described above is guaranteed to converge (i.e. finish successfully in a finite number of steps) if the data is linearly separable (i.e. if there indeed exists a plane that separates the two classes). We will not prove it here but you can take a look at this paper for a discussion about the subject.
Sometimes the original feature data is not linearly separable, but with some manipulation we can transform it into a linearly separable problem, by transforming the features — a procedure that may require some experience and intuition. For example:
There is no plane separating between the green and the red groups. A circle will do the job, but we can’t form a circular boundary using a perceptron. What we can do is represent the data in another way: if we perform the following transformation: x0new = sqrt(x0² + x1²), we get a configuration which is easily separable with a plane:
This concludes our chapter about perceptrons — Hopefully it was clear. Comments and questions are welcome.