Linear Regression Part 1

September 25, 2018

Linear Regression Part 1

It's finally time to get into the interesting topics of Machine Learning.

I am sorry it took a while for me to post. My workload at school is hefty, but I'm trying to find as much free time in my schedule as possible.

There are two major branches of Machine Learning: Supervised and Unsupervised. Supervised means that when our model learns, makes guesses, we know what the right answer should be (only for the data we have, of course). This means we have labeled data. We explicitly tell the model what the output should be. In unsupervised learning, we don't tell the program what it should output or what patterns to recognize. It picks up on meaningful information by itself. The line between Supervised and Unsupervised learning can sometimes get blurry.

We will start with Supervised Learning. It's not that Unsupervised is hard, it just helps to start off with Supervised because it's simpler.

We are going to start off the journey into Machine Learning with Linear Regression. I will break this into two posts to keep everything from getting cluttered. In the first post I will talk about the idea of Linear Regression and what it entails. In the second, we will create a Linear Regression model.

There will also be an optional post in the center on how to set up your Virtual Machine to keep everything clean.

This post will be about Linear Regression; I will talk about more complex topics in the future. Linear Regression is a really good place to get started with Machine Learning.

First, we need to know what "Linear Regression" means.

Linear Regression is just the linear relationship between two variables. It's like the simple y = mx + b formula you learned in high school; the relationship between the x and y variables.

This looks like the same boring graph you see all the time, but there's a lot more to this than meets the eye. In the figure above, "x" and "y" are just whole numbers from 0 to infinity with no special meaning. In the world of Machine Learning, "x" and "y" carry a lot of information.

Here is an example of this in Machine Learning:

Imagine that you are a real-estate agent and you want to figure out what the price of a house (target house) should be. You do this by looking at other houses near your target house and evaluating what the target house price is depending on the other houses.

There are a lot of factors that affect house prices, but let's just imagine that the only deciding factor of price is the square footage of the houses.

So to figure out the price of the target house, you need to look at the prices and square footage of the houses near you and compare. What you end of doing is creating a rough relationship in your head between the square footage and the price of the houses around you. And then, you look at the target house and see how that house falls into that relationship.

This is a rough estimate in your head, so let's put it on paper. First, you plot the square footage and prices of the houses you're familiar with in the neighborhood. In this example, you have 14 houses in your neighborhood.

The dots are houses, the X-axis (Independent Variable) is the square footage of the houses and the Y-axis (Dependent Variable) is the price.

Now you want to find the relationship between the houses and price, so you put a line on your graph that looks like the "line-of-best-fit" to you.

The red line tries to best estimate the relationship between the square footage and the houses.

You might say that the relationship could be better defined if a line with curves was used since there are a few outliers and scattered data. And to that, I would say you are absolutely right, but we are not going to worry about that now.

The information above could now be used to estimate the price of a new house for sale in the neighborhood. Let us say that the new house you want to sell is 3321 square feet. Using the graph and the line of best fit, you can approximate what the price of the house could be.

The idea behind Linear Regression is just that. You try to find the relationship between variables in the most effective way possible.

Of course, that begs the question, what does "most effective" mean.

When the graph was made above, it was a rough estimate. There was no solid logic or math behind it; we just eyeballed it. Linear Regression is using math to find the "line of best fit".

Now, we need to start thinking of this problem as a Machine Learning one. That means that a few terms need to be changed.

Instead of using y = mx + b, we will be using more formal terms:

$$Y = \theta_0 + \theta_1X$$

θ₀ is called the bias which is the same thing as the y-intercept.

θ₁ is called the coefficient which is the same thing as the slope.

The symbol is called theta

Let's think, how are we going to get a line that fits the data in the best way. We are going to need some way to check how far away our line is from the actual dot. Let's call that "different error".

We have a visual idea of what the error is, but we also want a number to describe the error.

That number would have the be the distance between the house (n) and the line.

$$distance = (Actual - Prediction)^2 $$

Notice that the error is squared. That is because the prediction could be below the line or above, but we don't want a negative error.

That formula would give you the error for one house. The total error for all "m" houses is

$$error = \sum_{n=1}^m distance_n$$

The goal of linear regression is to minimize the error. In other words, we want to find the line that is as close as possible to all the dots on the graph.

So how do we do this?

We have many options in front of us. One really popular approach is called "Gradient Descent".

Gradient Descent

There are a few steps to properly execute GD. The first one is to pick a random prediction for a "line of best fit."

Obviously, this is not a good line, so we calculate the error for this particular line. We use the error to update our line for a better prediction.

This improved the line by a lot. In practice, the improvement is not usually this large. The precision gets better little-by-little over many runs so as to not overshoot and fail.

By fixing our line incrementally, we have found the "line of best fit" through approximation.

I talked only about the high-level idea of GD, but I didn't get into the math, as it is out of scope for this post.

I would show the formulas here, but they are pretty daunting if I don't explain them more thoroughly.

Don't be too disappointed though: I will talk about GD in more detail later.

Another approach we can take towards this problem is called the "Ordinary Least Squared Method".

Ordinary Least Squared Method

This method comes from the world of statistics. This is a world that we will be visiting very often for many different methods and ideas.

The proof behind this method is understandable, but if I explain it here, it would take too long.

You can read up on this method here.

We will need to find the mean of "X" and "Y" to make this method work. The bar on top of the variables mean "mean."

$$\theta_1 = \frac{\sum_{n=1}^m(x_n - \bar x)(y_n - \bar y)}{\sum_{n=1}^m(x_n- \bar x^2)}$$

That is how we solve for the slope (or θ₁).

Now that we have the slope, we have to find the y-intercept (or θ₀).

$$\theta_0 = \bar y - \theta_1 \bar x$$
By solving for the two variables, we will have found the slope and the y-intercept for the "line of best fit."

"Ordinary Least Squared Method" and "Gradient Descent" are only two of the many techniques for minimizing error.

What we looked at here was solving for the "line of best fit" when there is only one independent variable. In the case of this post, it was square footage. This is called "Simple Linear Regression" (SLR).

But in real life, we rarely only have one independent variable. For predicting house prices we would need more information such as the number of rooms, windows in each room, type of flooring, etc. This is called "Multiple Linear Regression" (MLR).

You might notice, some of these attributes cannot be represented directly as a numerical value. To get around this we use "one-hot" encoding, which I will talk about in more detail in later posts.

It doesn't matter if you have only one independent variable or one hundred, the idea remains the same: all we're trying to do is minimize the error.

Instead of having just one number to represent "x," we will have vectors. But let's not get into that now.

I want to show you guys how to actually implement this idea, but there is not enough time to do it in this post. In my next post, I'll talk about how to set up environments in your "VM" (virtual machine) to keep everything clean. After that, we'll get into the code.

If you understood the ideas here, the code is not very hard. All the data you need will also be provided in that post.

Phew... we're done. This is what you need to know to understand linear regression. I know I didn't get into the math, and I'm sorry about that. But it would have been too complicated to explain it here. I'll include a few resources if you want to dive into the math, but don't get bogged down by it. The hardest part is the notation.

You guys probably read this post in like 10 minutes, but let me tell you, it took so long to write this. It was a lot of fun, but it took a while. I'm sorry about that. I'll try my hardest to not be so late next time.

Now I understand why Youtubers take so long to upload videos.

Resources:

Linear Regression
Linear Regression Analysis by George Seeber and Alan Lee
Gradient Descent
Ordinary Least Squares

Until next time, ladies and gentlemen.

Search This Blog

Potent Code

Linear Regression Part 1

Gradient Descent

Ordinary Least Squared Method

Comments

Post a Comment

Popular Posts

Creating a Virtual Machine Using VirtualBox

Setting Up VM for a Clean Workspace