Using the ff package to load the kaggle criteo data set into R

Intro

The goal of this post is to demonstrate how to load criteo data set associated with the kaggle competition into R using the ff and ffbase R packages. I’ll also present a way to model the data using biglm R Package that will require you to be able to clean the data before running the modeling command.

In order to proceed in this guide, you will need to be using a computer with AT LEAST 4 gigs of RAM. Preferably, you should also be able to save the data to a magnetic hard drive and NOT a solid state drive (SSD). This remark comes from personal experience of quickly burning out an SSD when manipulating big data.

Background

The size of the dataset is about 10 gigs with 45,840,617 observations and 40 variables. Yes, you read that correctly… The data set has over 45 million observations!

A note…

The file paths provided below correspond to my storage drive that I use for working with bigdata. You may not have an F:/ drive. You may need to place big data on your C:/ drive or within a specific volume on OS X or Linux.

As a result, before running the script, please change the file paths so that they are relative to your machine!

Load Script

# Any package that is required by the script below is given here:
# Check to see if packages are installed, if not install.
inst_pkgs = load_pkgs =  c("ff","ffbase","biglm")
inst_pkgs = inst_pkgs[!(inst_pkgs %in% installed.packages()[,"Package"])]
if(length(inst_pkgs)) install.packages(inst_pkgs)

# Dynamically load packages
pkgs_loaded = lapply(load_pkgs, require, character.only=T)

# Set Working Directory to where big data is
setwd("F:/bigdata/kaggle_criteo/dac/")

# Check temporary directory ff will write to (avoid placing on a drive with SSD)
getOption("fftempdir")

# Set new temporary directory
options(fftempdir = "F:/bigdata/kaggle_criteo/dac/temp")

# Load in the big data
ffx = read.table.ffdf(file="train.txt", # File Name
                      sep="\t",         # Tab separator is used
                      header=FALSE,     # No variable names are included in the file
                      fill = TRUE,      # Missing values are represented by NA
                      colClasses = c(rep("integer",14),rep("factor",26)) 
                      # Specify the import type of the data
                      )

# Assign names to column
colnames(ffx) = c("Label",paste0("I",1:13),paste0("C",1:26))

Quicker load on subsequent runs

Instead of recreating the ffdf object each time we open the workspace, we opt to save the ffdf object using:

# Export created R Object by saving files 
ffsave(ffx, # ffdf object
       file="F:/bigdata/kaggle_criteo/dac/ffdata/ffdac", # Permanent Storage location
       # Last name in the path is the name for the file you want 
       # e.g. ffdac.Rdata and ffdac.ff etc.
       rootpath="F:/bigdata/kaggle_criteo/dac/temp")     # Temporary write directory
       # where data was initially loaded via the options(fftempdir) statement

After the ffdf object has been saved, we are able to open the ffdf object using:

# Load Data R object on subsequent runs (saves ~ 20 mins)
ffload(file="F:/bigdata/kaggle_criteo/dac/ffdata/ffdac", # Load data from archive
       overwrite = TRUE) # Overwrite any existing files with new data

Note, if we modify the ffdf object via data cleaning or et cetera, we need to RESAVE the object! Otherwise, our modifications will not be stored in the permanent directory and will only exist for the duration of the R session.

Sample modeling using ffbase’s biglm hook.

One of the nice benefits of using ffbase is the many options it has for working with ff data. In particularly, there is a wrapper that allows us to feed information into the biglm package without having to worry about converting the ffdf object into a bigmemory matrix.

# Model
# Get predictor variable names (only 1 categorical is included)
data_variables = colnames(ffx)[c(-1,-(18:40))]

# Create model formula statement
model_formula = as.formula(paste0("Label ~", paste0(data_variables, collapse="+")))

## YOU MUST CLEAN THE DATA BEFORE RUNNING THE REGRESSION! RUNNING THE REGRESSION WITH MISSING VALUES WILL YIELD AN ETA ERROR!

# Use a modified version of bigglm so that bigglm will not try to convert to a regular data.frame
model_out = bigglm.ffdf(model_formula, family=binomial(), data=ffx,chunksize=100, na.action=na.exclude)
comments powered by Disqus