Support Vector Machine Intuition

by andy

Introduction

A very common machine learning algorithm is a Support Vector Machine, or SVM. SVMs will allow you to predict information about data — we’ll see an example shortly.

In this post I’m going to walk you through the concept and intuition behind SVMs — to understand the content here, you need no technical background. I’ll be following up with another post containing the math behind SVMs, which will be substantially more advanced. Stay tuned!

Example 1

Let’s imagine that we have a dataset of houses that have been on the market. We’ll imagine that we have three pieces of information for each: the house size in square feet, the list price of the house, and whether the house has been sold or not. Here’s a graph of that data (1):

svm_c_data_unmod

Now let’s say that I give you the size and cost of a new house, as below:

svm_c_data_new

If I asked you to predict whether you think the house has sold or not, you’d have a pretty clear answer: yes, it has probably sold. This simple bit of pattern recognition is exactly what we’re trying to get a computer to do when we talk about writing a machine learning algorithm!

Now, if I pushed you to describe the pattern you saw, you might say that there seems to be a sort of dividing line between the sold houses and the unsold houses. You might draw something like this and say that it seems like everything above the line is unsold and everything below the line is sold.

svm_c_data_line

This is a fairly simple observation, but as it turns out, this is exactly what an SVM algorithm is doing on a basic level!

Effectively, what SVMs do is take your data and draw the line (or hyperplane, but we’ll get to that in a minute) that divides your dataset into groups of positive (sold) and negative (unsold) observations. Then, when you feed it a new data point, the algorithm figures out which side of the line the data point is on and spits back the predicted classification!

Of course, I’m simplifying the algorithm a great deal here, but this is the basic idea of a Support Vector Machine algorithm.

Example 2

The example above is all well and good, but it has one characteristic that makes it a bit too simple: the fact that I can draw a line that divides the dataset perfectly, with no observations falling on the wrong side of the line. That dataset is what we call a linearly separable dataset, though in reality, datasets are rarely linearly separable. Let’s take the same example but make the data just a touch more realistic.

svm_c_data2_new

Now we can’t draw a line that perfectly divides the dataset into sold and unsold houses! On the conceptual end of things, this actually doesn’t change much for us. If I asked you to predict the status of the “new” data point again, you’d probably still tell me that it has been sold.

If I asked you to show me the pattern again, you’d probably still draw a line that looks something like the one below and say that anything below the line has probably been sold and anything above the line has probably not been sold.

svm_c_data2_line

Maybe you’d be a little less confident with your predictions this time around, but it still seems like the obvious pattern.

It turns out that this is pretty much how an SVM works with non-linearly separable data, too. It uses what is called a regularization parameter to draw the best dividing line it can, and then it predicts new data points based on which side of the line they lie on.

More Features

In the examples above, we only had two features: the house size and the house price. Because of this, we could draw a nice 2-dimensional plot and draw a line to divide the data. But SVMs work with any number of features, whether there are 1, 2, 3, or 1000s. With our two features, we could draw a dividing line. But with 3 features, we need to draw a dividing plane. Here’s what that might look like:

svm_c_data3_plane

Then, intuitively, we can classify new data points by determining which side of the plane they fall on.

Once we get beyond 3 features, we can no longer effectively visualize the data (or at least, I can’t see 4-dimensional plots!). With four or more features, the SVM creates what is called a hyperplane — a higher dimensional representation of a plane. But just like the line in 2-d or the plane in 3-d, the hyperplane divides the feature space in half and allows you to classify new points by determining which side of the hyperplane they fall on.

Conclusion

In the end, the concept behind a Support Vector Machine algorithm is pretty simple — draw something that divides your training data (2) into positive and negative samples as best as possible, then classify new data by determining which side of the hyperplane they lie on.

One note I want to make clear is that it’s impossible to guarantee that you’ll predict correctly, of course! Even if your training data is perfectly linearly separable, a new data point could lie on one side of the hyperplane but actually be classified as being on the other side. Machine learning algorithms can only give you guesses for how your data should be classified — they can’t definitively tell you one way or the other.

There are a ton of directions to go from here, like using an SVM to predict multiple classes (rather than just classifying houses as sold or unsold, you could classify flowers as red, yellow, or blue) or optimizing an SVM in various ways to apply to certain problems. In an upcoming post I’m going to go over the math involved in what I discussed in this post, and then perhaps I’ll explore one of these directions.

I’ve attempted to explain SVMs in an conceptual way here, and in doing so, I’ve made a number of simplifications and skipped over parts of the algorithm. Please take this information for what it is: an intuitive way to think about Support Vector Machines, and not a rigorous examination.

I’d love to hear what you think about this post (or if I made any mistakes!) — feel free to email me or comment below!


  1. This data isn’t real — I randomly generated it. 

  2. The “training data” is the data you have at the beginning, when you’re building your model. Then, you use the algorithm on your “test data” to get your results.