### Support Vector Machine Intuition

#### by andy

## Introduction

A very common machine learning algorithm is a Support Vector Machine, or SVM. SVMs will allow you to predict information about data — we’ll see an example shortly.

In this post I’m going to walk you through the concept and intuition behind SVMs — to understand the content here, you need no technical background. I’ll be following up with another post containing the math behind SVMs, which will be substantially more advanced. Stay tuned!

## Example 1

Let’s imagine that we have a dataset of houses that have been on the market. We’ll imagine that we have three pieces of information for each: the house size in square feet, the list price of the house, and whether the house has been sold or not. Here’s a graph of that data (^{1}):

Now let’s say that I give you the size and cost of a new house, as below:

If I asked you to predict whether you think the house has sold or not, you’d have a pretty clear answer: yes, it has probably sold. This simple bit of pattern recognition is exactly what we’re trying to get a computer to do when we talk about writing a machine learning algorithm!

Now, if I pushed you to describe the pattern you saw, you might say that there seems to be a sort of dividing line between the sold houses and the unsold houses. You might draw something like this and say that it seems like everything above the line is unsold and everything below the line is sold.

This is a fairly simple observation, but as it turns out, this is exactly what an SVM algorithm is doing on a basic level!

Effectively, what SVMs do is take your data and draw the line (or hyperplane, but we’ll get to that in a minute) that divides your dataset into groups of positive (sold) and negative (unsold) observations. Then, when you feed it a new data point, the algorithm figures out which side of the line the data point is on and spits back the predicted classification!

Of course, I’m simplifying the algorithm a great deal here, but this is the basic idea of a Support Vector Machine algorithm.

## Example 2

The example above is all well and good, but it has one characteristic that makes it a bit too simple: the fact that I can draw a line that divides the dataset perfectly, with no observations falling on the wrong side of the line. That dataset is what we call a *linearly separable* dataset, though in reality, datasets are rarely linearly separable. Let’s take the same example but make the data just a touch more realistic.

Now we can’t draw a line that perfectly divides the dataset into sold and unsold houses! On the conceptual end of things, this actually doesn’t change much for us. If I asked you to predict the status of the “new” data point again, you’d probably still tell me that it has been sold.

If I asked you to show me the pattern again, you’d probably still draw a line that looks something like the one below and say that anything below the line has *probably* been sold and anything above the line has *probably* not been sold.

Maybe you’d be a little less confident with your predictions this time around, but it still seems like the obvious pattern.

It turns out that this is pretty much how an SVM works with non-linearly separable data, too. It uses what is called a regularization parameter to draw the best dividing line it can, and then it predicts new data points based on which side of the line they lie on.

## More Features

In the examples above, we only had two features: the house size and the house price. Because of this, we could draw a nice 2-dimensional plot and draw a line to divide the data. But SVMs work with any number of features, whether there are 1, 2, 3, or 1000s. With our two features, we could draw a dividing line. But with 3 features, we need to draw a dividing *plane*. Here’s what that might look like:

Then, intuitively, we can classify new data points by determining which side of the plane they fall on.

Once we get beyond 3 features, we can no longer effectively visualize the data (or at least, *I* can’t see 4-dimensional plots!). With four or more features, the SVM creates what is called a hyperplane — a higher dimensional representation of a plane. But just like the line in 2-d or the plane in 3-d, the hyperplane divides the feature space in half and allows you to classify new points by determining which side of the hyperplane they fall on.

## Conclusion

In the end, the concept behind a Support Vector Machine algorithm is pretty simple — draw something that divides your training data (^{2}) into positive and negative samples as best as possible, then classify new data by determining which side of the hyperplane they lie on.

One note I want to make clear is that it’s impossible to guarantee that you’ll predict correctly, of course! Even if your training data is perfectly linearly separable, a new data point could lie on one side of the hyperplane but actually be classified as being on the other side. Machine learning algorithms can only give you guesses for how your data should be classified — they can’t definitively tell you one way or the other.

There are a ton of directions to go from here, like using an SVM to predict multiple classes (rather than just classifying houses as sold or unsold, you could classify flowers as red, yellow, or blue) or optimizing an SVM in various ways to apply to certain problems. In an upcoming post I’m going to go over the math involved in what I discussed in this post, and then perhaps I’ll explore one of these directions.

I’ve attempted to explain SVMs in an conceptual way here, and in doing so, I’ve made a number of simplifications and skipped over parts of the algorithm. Please take this information for what it is: an intuitive way to think about Support Vector Machines, and not a rigorous examination.

I’d love to hear what you think about this post (or if I made any mistakes!) — feel free to email me or comment below!

Andy, good intro to SVMs. Quick question about SVMs, are they limited to binary classifications?

I’m enjoying many of your posts. I wonder if you’ll be considering a career as a teacher because you have a gift for explanation.

Chris, thank you for the comment! A normal SVM is limited to binary classification, but there are ways to extend the concept to allow them to do multi-category classification. (And I don’t think I’ll be a teacher but I appreciate the compliment!)

Your future may not include you becoming a teacher in a professional sense, but you are definitely teaching people.

Personally, I’ve only had the patience to teach those that are interested (and engaged) in the subject matter I was explaining. I work in a very technical profession (an engineer for a wireless telecom), and metaphors have always been my friend when trying to explain something technical to a lay person (which are typically the types I don’t like to “teach”).

My point is this: If you can explain the concept of software predictive pattern recognition to lay people (just like you did here), you will likely go very far in whatever career you pursue.

BTW, You taught me something I didn’t know. Thank you.

Jeremy, I really appreciate your comment. Thanks for stopping by.

Chris, There are some packages like ‘libsvm’ and ‘lssvm’ for multi class classification in MATLAB.

Jeremy, you are absolutely right. Andy, you explained the concept very well for lay people.