Machine Learning - Linear Regression

Daniel Levin
8 min readJun 10, 2023

In this article, our focus will be on studying linear regression using a commonly encountered example — house price prediction. We will assume that the price of a house is solely determined by its size, measured in square meters (m²). Additionally, for the time being, let’s assume that all houses within a particular area share the same price per square meter.

Under these assumptions, we can establish a mathematical model that relates the price of a given house to its size. This model can be represented as f(x) = θ₀ + θ₁x, where f denotes the total price of the house based on its size x in square meters. Here, θ₀ represents an initial value, and θ₁ corresponds to the price of 1 square meter (price per unit). In the context of machine learning, the house price serves as the “label” or target variable to be determined, while x represents a “feature” on which the label depends.

For instance, if the price per square meter in a specific area is determined as 2000 (with no initial value), the price of any house in that area can be calculated using the formula f(x) = 2000x, where x > 0. Plotting the house prices against their respective sizes would yield the following graph:

If one knows the size of a house, it is possible to determine the price of that house by substituting the size into the equation or by reading the corresponding value on the y-axis of a graph.

However, in the real world, the price of a house is influenced by multiple factors, commonly referred to as features. Even in a simplistic model that considers only the size of a house as a determining factor for its price, the prices would still vary from one house to another within the same area.

To illustrate this point, let’s examine the house prices provided in the following table [1]:

Since the data is from the United States, the size of the houses is indeed in square feet. The analysis remains the same, focusing on the “sqft_living” column and disregarding the other variables for now. Moreover, to narrow down our analysis and focus on a specific area, we can limit the data to houses located in the city of Sammamish. This allows us to investigate the housing market dynamics within that particular region.

Indeed, in the current scenario where we are considering the housing market in the city of Sammamish, it is apparent that the price per square foot is not constant and varies across different houses. However, if we seek to obtain an approximate value for a given house in that city, we can still employ a simple approach without delving into linear regression.

One way to estimate the price of a particular house is by determining the mean value of the price per square foot within the city of Sammamish. We can calculate this by taking the average of all the price_per_sqft values available in the dataset. Once we have the mean price per square foot, we can multiply it by the size of the given house to obtain an approximate value for that specific property.

While this approach provides a rough estimate, it’s important to note that it does not take into account other factors that may significantly impact the house price, such as location, condition, amenities, and market trends. Thus, the result obtained through this simplified method is only for education purposes.

data['price_per_sqft'] = data['price'] / data['sqft_living']
filtered_data = data[data['city'] == 'Sammamish']
subset = filtered_data[['price', 'sqft_living', 'price_per_sqft']]
subset.head()
price_mean = subset['price_per_sqft'].mean()

# price_mean = 251.25

Setting θ₁= 251.25 (the mean value) one gets the following plot:

The objective of this article is to construct a regression model and leverage it to forecast the price of a random house within the specified area. To begin, let us examine the dataset through a scatter plot visualization.

plt.scatter(filtered_data['sqft_living'], filtered_data['price'], color = 'red', marker = 'x')
plt.xlabel('size/sqft')
plt.ylabel('Price/$')
plt.title('House prices in Sammamish')
plt.show()

In the plotted graph, it is evident that the majority of values cluster around a hypothetical line, while a few points deviate from this trend. These houses with prices located further away from the main cluster are referred to as outliers.

With this dataset in consideration, our objective is to determine the optimal model, denoted as f(x) = θ₀ + θ₁x, for predicting the price of any new house not included in the dataset. It is important to note that while x represents the user-provided feature used to determine the label f, these individual components do not solely define the model. Instead, it is the parameters θ₀ and θ₁ that define the linear function and subsequently shape the line depicted in the upcoming plot.

from scipy.stats import linregress
plt.scatter(filtered_data['sqft_living'], filtered_data['price'], color = 'red', marker = 'x')
x = filtered_data['sqft_living']
y = filtered_data['price']
slope, intercept, r_value, p_value, std_err = linregress(x, y)
regression_line = slope * x + intercept
plt.plot(x, regression_line, color='blue', label='Regression Line')
plt.xlabel('size/sqft')
plt.ylabel('Price/$')
plt.title('House prices in Sammamish')
plt.legend()
plt.show()

To determine the function that provides the best fit for the given data, we need to find the appropriate values for θ₀ and θ₁ in equation f(x) = θ₀ + θ₁x. The goal is to obtain values that enable accurate price predictions for houses not present in the dataset.

Given that the data points do not align in a perfectly straight line, it becomes impossible to find a straight line that passes through all the given points. Even if we were to devise a complex mathematical function that managed to encompass all the data points, such a model would likely perform poorly in predicting prices for houses outside the dataset.

Therefore, our aim is to find a line that minimizes the distances (errors) to the given data points. We seek a line that optimally approximates the overall trend. In the upcoming discussion, we will delve deeper into the specifics of this approach to gain a clearer understanding.

In the given dataset, there is a specific house with a size of 3690 sqft. The actual price of this house is known to be 865,000, whereas the model predicts a price of 823,970. The resulting discrepancy of 41,030 between the predicted value and the true value represents the error for this particular data point.

Similarly, for each data point in the dataset, there exists an error that signifies the disparity between the value generated by the prediction model and the actual value. As part of creating a machine learning model, it is essential to examine the distribution of the dataset and, consequently, the distribution of the errors.

Various distributions are of interest in this context, including the Normal, Bayesian, and Poisson distributions. These distributions will be explored in subsequent discussions to gain insights into their relevance and application to the ML model. Let

denote the true prices for the houses given in the dataset. The errors between the model and the real values would now be:

As we consider each data point (xᵢ, fᵢ) up to the nth point, the sum of squared errors (SSE) measures the overall discrepancy between the predicted values and the actual values. By squaring the errors, negative and positive errors are treated equally, providing a measure of the overall error magnitude to be minimized in linear regression:

Let’s get more mathematical

Let’s recall that, at this stage, we have yet to determine the prediction model. Previously, my ability to provide an error was based on a “cheat” where I had access to the correct answer beforehand. As mentioned earlier, we begin by squaring all the errors and summing up these squared values. In mathematical terms, finding the minimum or maximum value involves an optimization problem that can be addressed by taking the derivative of a function and setting it equal to zero. When we differentiate a sum, we can individually differentiate each term in the sum and then add them together. Since Q is a function with two parameters, θ₀ and θ₁ (remember that we are optimizing for θ₀ and θ₁ rather than x), we need to calculate the partial derivative (∂) with respect to each parameter while holding the other one constant. Consequently, the partial derivatives are as follows:

With:

we now have the following equation system to be solved:

After performing some algebraic manipulations, we arrive at the following two expressions:

where, for the sake of readability, the summation indexes have been omitted.

Upon observing the last two equations, it becomes evident that θ₁ solely relies on the input data, whereas θ₀ depends on both θ₁ and the input data. Hence, the approach is to first compute θ₁ and subsequently utilize that value to determine θ₀.

To summarize, when confronted with data in which the target variable appears to exhibit a linear relationship with a given feature, we assume a linear predictive model known as linear regression. The parameters θ₀ and θ₁ define this linear equation. In this article, we explored the application of the least squares method to ascertain these parameters. It is important to note that the parameter determination process can be extended to encompass multiple features, and not all target variables (labels) exhibit a linear dependence on the given features. These topics will be explored in subsequent articles.

[1] https://www.kaggle.com/datasets/lespin/house-prices-dataset?resource=download

BECOME a WRITER at MLearning.ai // text-to-video // Divine Code

--

--

Daniel Levin

M.Sc. Physics, B.Ed., and A.S. in Software Development. Teaching for 15 year and coding proffesionally since 2021.