Intro

Simple Linear Regression (SLR) has been tickled to death. One interesting tidbit about SLR is that of the different Sum of Squares formulations that exist and how they tie into just about everything. This posts tries to deconstruct the sum of squares formulations into alternative equations.

Definitions

In the least technical terms possible….

Sum of Squares provides a measurement of the total variability of a data set by squaring each point and then summing them.

More often, we use the Corrected Sum of Squares, which compares each data point to the mean of the data set to obtain a deviation and then square it.

, where the mean is defined as:

When we talk about Sum of Squares it will always be the later definition. Why? Well, using the initial definition is sure to cause a data overflow when working with large number (e.g. 1000000000000^2 vs. (1000000000 - 1000000)^2).

Arrangements

There are three key equations:

  1. Sum of Squares over $x$:
  2. Sum of Squares over $y$:
  3. Sum of $x$ times $y$:

Psst… The last one isn’t a square! In fact, it’s part of what’s called covariance. It’s listed here because of the similarities in manipulations that you will see later on.

These initial arrangements can be modified to take on different forms such as:

and

The next two sections go into depth on how to manipulate these equations. The main point behind manipulating these equations is the use of the mean definition and some series properties.

Providing different forms of the Sum of Squares for $S_{xx}$ and $S_{yy}$

These arrangements can be modified rather nicely to alternative expressions.

For instance, both 1 and 2 can be modified to be:

We’ll call this result the alternative definition.

We can further manipulate this expression…

The last result we’ll refer to as the exterior definition.

Therefore, as stated previously, we have:

Psst… For $S_{yy}$, simply replace every $x$ you see above with a $y$.

e.g.

Exploring the different forms of $S_{xy}$

Based on the previous section, what comes next should not be very surprising. The only real difference between these two sections is the inclusion of a different variable AND the fact that the number of observations between $x$ and $y$ are the same (e.g. $n_{x} = n_{y} = n$).

The result is a modified verison of the alternative definition.

We can obtain the similar form as the previous section, except this time we must choose to either have $y_i$ or $x_i$ on the exterior… Let’s start by opting for $y_i$ on the exterior:

Alternatively, we can go the opposite route and have $x_i$ on the exterior:

Both are results from the exterior definition.

Therefore, we have the following equations:

A simple test

The above manipulation can be further scrutinized by seeing if it is accurate. To do so, let’s quickly right a few R functions to check the output.

# For reproducibility
set.seed(1337)

# Generate some random data
x = rnorm(10000,3,2)
y = rnorm(10000,1,4)

Let’s formulize the definitions.

# Sxx and Syy definition
s.xx = function(x){
  sum((x-mean(x))^2)
}

# Sxx and Syy Definition definition
s.xx.alt = function(x){
  n = length(x)
  sum(x^2) - n*mean(x)^2
}

# Sxx and Syy Exterior definition
s.xx.ext = function(x){
  sum((x-mean(x))*x)
}

# Sxy Definition
s.xy = function(x,y){
 sum((x-mean(x))*(y-mean(y))) 
}

# Sxy Alternative Definition
s.xy.alt = function(x,y){
  n = length(x)
  sum(x*y) - n*mean(x)*mean(y)
}

# Sxy Exterior Definition
s.xy.ext = function(x,y){
    sum((x-mean(x))*y)
}

Now, let’s see the results of each function:

### Sxx and Syy

# All give the same value for Sxx Definition?
all.equal(s.xx(x), s.xx.alt(x), s.xx.ext(x))
## [1] TRUE
# What is the value?
s.xx.ext(x)
## [1] 40066.65
### Sxy

# All give the same value for Sxy Definition?
all.equal(s.xy(x,y), s.xy.alt(x,y), s.xy.ext(x,y))
## [1] TRUE
# What is the value?
s.xy.ext(x,y)
## [1] 330.3306

Timing

Aside from the derivations and the simple tests, there is one other item to consider… The amount of time it takes to calculate each equation.

# install.packages("microbenchmark")

# Load microbenchmark
library(microbenchmark)

# Benchmark Sxx definition against x data
microbenchmark(s.xx(x), s.xx.alt(x), s.xx.ext(x))
## Unit: microseconds
##         expr    min     lq      mean  median      uq      max neval
##      s.xx(x) 44.128 45.930  91.04903 47.4310 58.9880 3066.181   100
##  s.xx.alt(x) 39.025 41.127 100.49005 42.4775 55.8365 3418.609   100
##  s.xx.ext(x) 43.829 46.080  59.45064 47.2810 54.7855  904.184   100
# Benchmark Syy definition against y data
microbenchmark(s.xx(y), s.xx.alt(y), s.xx.ext(y))
## Unit: microseconds
##         expr    min     lq     mean median      uq      max neval
##      s.xx(y) 44.129 46.530 70.16155 47.130 48.0315 1257.813   100
##  s.xx.alt(y) 39.926 41.727 44.42283 42.327 43.9790   67.844   100
##  s.xx.ext(y) 44.429 46.830 70.35370 48.031 50.5830 1065.088   100
# Benchmark Sxy Definition
microbenchmark(s.xy(x,y), s.xy.alt(x,y), s.xy.ext(x,y))
## Unit: microseconds
##            expr    min     lq      mean  median      uq      max
##      s.xy(x, y) 78.651 81.352 201.17819 84.0545 92.1600 4452.476
##  s.xy.alt(x, y) 65.743 67.544 119.99969 69.1950 72.3465 4884.755
##  s.xy.ext(x, y) 46.531 48.932  52.09587 49.8320 52.0835   72.948
##  neval
##    100
##    100
##    100

In this case, we see that for the $S_{xx}$ and $S_{yy}$ the alternative definition is best whereas if we have $S_{xy}$ then the best speed is from the exterior definition.