Univariate linear regression is a statistical technique that models the relationship between independent variables and single dependent variable.
Such a model can be utilised to predict a dependent variable for a given independent variable.
The term ‘univariate’ simply means ‘one variable’
Examples of such relationships might include:
- Egg size and required boiling time
- Laundry size and required amount of detergent
- Height and weight
This article will examine simple linear regression, where the model only includes one independent variable.
The model is expressed as:
$$ \begin{aligned} h(x) = wx + b \end{aligned} $$where:
- w and b are constant numbers.
- x is an independent variable (e.g. egg size, laundry size, height).
x is an input for the model.
Also called regressor, predictor, explanatory variable, feature. - h(x) is a dependent variable (e.g. time, amount of detergent, weight).
This is the variable that is predicted using the model.
The letter ‘h’ here refers to the term ‘hypothesis’. Hypothesis is a possible explanation or a function that describes the given data. A model, on the other hand, is a final implementation of a hypothesis that is ready to make predictions.
It is often referred to as the response, outcome, or target variable. Also denoted as y or f(x).
What are the roles of the w and b
Assume that there are multiple pairs of independent (x) and dependent (y) variables marked on a graph. When a line that traverses through all points is drawn, it results in a single straight line, as illustrated below:

This line is representative of a graph of a linear function; if each point of that function were marked on the coordinate system, the result would appear as the same line (just infinitely longer).
The point at which the line intersects the vertical axis is referred to as the ‘intercept’. It corresponds to the output value (y or h(x)) when the input (x) equals 0. Intercept is the value of b and b is not always intuitive. For example, a relatively accurate model that predicts weight based on height could use non-zero value for the intercept (i.e. for the output when height is 0).
w is referred to as the ‘slope’ because it determines how steep the line is.

To calculate the slope, it is necessary to select two points on the line and determine the change in y from one point to the other. A similar procedure is followed for x: calculate the change in x from point 1 to point 2. Then, the change in y is divided by the change in x: (| w = \frac{\Delta y}{\Delta x} |)
When the change in y is divided by the change in x, it will give the change in y for each 1 unit of change in x. In other words, how much y changes if x is increased by 1 (or decreased by 1).
Slope is a function’s changing rate.
Below is an another example of how the slope is calculated:

$ \frac{y_2 - y_1}{x_2 - x_1} $ and $ \frac{y_1 - y_2}{x_1 - x_2} $ yield the same result
Cost Function
Suppose some observations have been made and plotted on a graph. Given this, the problem that is intended to be solved with linear regression is to find the optimal line or rather the best-fitting linear function that describes the data:

In order to determine whether one function is better than another, an overall error made by the function is used as a criterion. An error on a graph is a distance between the prediction was made and the actual observed value:

These distances in machine learning are called ‘residuals’. A residual is a difference between an observed value and its corresponding predicted value. Errors are deviations in predictions from the true values that are often unobservable.
In supervised learning data is divided into two subsets: training (used for optimizing parameters and training the model) and testing (used for evaluating model’s real-world performance).
Since errors can be calculated directly from the test set, often distances in the training set are referred to as residuals, while distances in the testing set are called errors.
For simplicity, the term “error” will be used to refer to both residuals and errors for the remainder of this article.

Erorrs are calculated using the following formula:
$$ \begin{aligned} \varepsilon^{(i)} = \hat{y}^{(i)} - y^{(i)} \end{aligned} $$where:
- (| i |) is used to denote an index of observations.
- (| \varepsilon^{(i)} |) (epsilon) is an error in the (| i^{th} |) observation.
- (| \hat{y} |) (y-hat) is a predicted value for corresponding x value. Alternative notation in this article is h(x).
- y is a true observation.
How often a model mistakes is measured by a cost function. There are different types of cost functions. Here is the most commonly used cost function Mean Squared Error (MSE):
$$ \begin{aligned} J(w,b) = \frac {1} {2m} \sum_{i=1}^{m} (\varepsilon^{(i)})^2 \end{aligned} $$The Greek letter (| \sum |) (sigma) is used to denote summation. The given formula means “sum all squared errors, then divide the sum by 2m”. As m is a number of observations, it results in an average squared error. Errors are squared so that negative errors (points above the line yield a negative error) do not reduce the cost function when summarised. It is important to note that more significant errors are penalised to a greater extent. Also, squaring and dividing by 2 makes math calculations more straightforward later.
The cost function can be decomposed into simpler elementary terms:
In simple linear regression the optimal line that best fits the data is determined by minimising the cost function. While it is straightforward to identify this line in simple linear regression by brute force, this becomes impractical when there are more than two independent variables; each independent variable requires an additional axis, or dimension, on a graph. Although computers can visualise such graphs, it is impossible to understand what exactly is being plotted.
One of the methods used to minimise the cost function is called ‘gradient descent’.
Gradient Descent
Gradient descent is an iterative optimisation algorithm for minimising a differentiable function.
It requires:
- Calculate the function’s gradient. The resulting vector of partial derivatives is referred to as the ‘gradient’.
- Update parameters based on the gradient.
- Repeat until convergence (very small change in cost or gradient).
Again the intercept b is 0 for now. For the cost function:
$$ J(w) = \frac {1} {2m} \sum_{i=1}^{m} (wx^{(i)}-y^{(i)})^2 $$the derivative is the following:
$$ \frac{dJ}{dx} = \frac{1}{m} \sum_{i=1}^{m} (wx^{(i)}-y^{(i)})x^{(i)} $$Since taking derivative is applying the right formulas, there is no value in presenting the exact path from the cost function. So I didn’t.
The update rule is:
$$ w_{new} := w - \alpha \times \frac{dJ}{dx} $$where (| \alpha |) is referred to as a ’learning rate’. It is just a small number that scales the derivative. It is usually set to 0.01, when features are scaled.
For non-zero intercept one more dimention should be introduced on the cost function’s graph:

It is the cost function’s surface for a data for which best fit is y = 2x + 1.
Red points here represent each combination of slope and intercept and corresponding cost.
Since there are two values on which cost function depends, for the cost function:
$$ J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} (wx^{(i)} + b - y^{(i)})^2 $$should be taken the partial derivatives:
- $$ \frac{\partial J}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} (wx^{(i)} + b - y^{(i)})x^{(i)} $$
- $$ \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} (wx^{(i)} + b - y^{(i)}) $$
The parameters of hypothesis should be updated simultaneously:
- $$ w_{new} := w - \alpha \times \frac{dJ}{dw} $$
- $$ b_{new} := b - \alpha \times \frac{dJ}{db} $$
Next article: Multiple Linear Regression



















