Benefits of Using a Sparse Matrix | The Coatless Professor

Intro

Oddly enough, I spend a lot of my time talking about different kinds of data structures and when they should be used. Recently, one of my fellow graduate students discovered why using sparse matrices instead of regular matrices are beneficial. Principally, the sparse matrix format reduces the size of the overall data structure as it discards uneventful values. For instance, if an entry value is zero or very close to it, then retaining such values wouldn’t be informative. Instead, there would be a preference to only store the locations inside the matrix that have non-zero entries.

The Matrices War: Regular Matrices vs. Sparse Matrices

To start the discussion, let’s create a regular matrix that is “sparse”. For data, let’s aim to use the binomial distribution to replicate a coin flip under a low probability. Thus, heads would be a 1 and tails would be a 0. So, we’re only going to be interested in knowing when heads lands.

# Set a seed for reproducibility
set.seed(9134)

# Generate a matrix containing 1's and 0's with
# a low probability of 1.
standard_matrix = matrix(
  rbinom(100*100, size = 1, prob = .06), 
  nrow = 100
)

# Show the first four rows and columns
standard_matrix[1:4, 1:4]

##      [,1] [,2] [,3] [,4]
## [1,]    0    0    0    0
## [2,]    0    0    0    0
## [3,]    0    0    0    1
## [4,]    0    0    0    0

If we convert the matrix to a sparse matrix, the zero entries will be discarded while the non-zero entries are preserved and stored by coordinate. To convert to a sparse matrix, either use the as() or Matrix() functions that reside in the Matrix R package.

# Load the Matrix package
library("Matrix")

# Convert the regular matrix to a sparse matrix
sparse_matrix = as(standard_matrix, "sparseMatrix")

# Alternatively, we could use:
# Matrix(sparse_matrix, sparse = TRUE)

# Preview the sparse matrix
sparse_matrix[1:4, 1:4]

## 4 x 4 sparse Matrix of class "dgCMatrix"
##             
## [1,] . . . .
## [2,] . . . .
## [3,] . . . 1
## [4,] . . . .

Notice, wherever there was a 0 entry a period, e.g. ., now acts as a placeholder. As a result of discarding zero entries, the size of the matrix decreases greatly.

# Standard matrix
print(object.size(standard_matrix), units="Kb")

## 39.3 Kb

# Sparse matrix
print(object.size(sparse_matrix), units="Kb")

## 8.4 Kb

This is largely because the sparseMatrix format only saves the non-zero values at a coordinate location under the scheme of:

i is row index;
j is column index;
x is the value (optional if values other than 1 preferred); and
dim is the dimension of the matrix.

With scheme in hand, the sparse matrix can be directly constructed using sparseMatrix() function in the Matrix R package.

# Construct a new sparse matrix from a sample.
sample_sparse_matrix = sparseMatrix(
  i = c(1, 2, 4),    # Row ID
  j = c(1, 3, 5),    # Column ID
  x = c(8, 1, 7),    # Contents
  dims = list(4, 5)  # Dimension of matrix
)

# Show sparse matrix
sample_sparse_matrix

## 4 x 5 sparse Matrix of class "dgCMatrix"
##               
## [1,] 8 . . . .
## [2,] . . 1 . .
## [3,] . . . . .
## [4,] . . . . 7

Fin

In short, sparse matrices are incredibly powerful for storing data that has a small amount of entries. Consider using a sparse matrix if you are encountering memory issues.