The goal of this post is to demonstrate how to load criteo data set associated with the kaggle competition into R using the ff and ffbase R packages. I’ll also present a way to model the data using biglm R Package that will require you to be able to clean the data before running the modeling command.
In order to proceed in this guide, you will need to be using a computer with AT LEAST 4 gigs of RAM. Preferably, you should also be able to save the data to a magnetic hard drive and NOT a solid state drive (SSD). This remark comes from personal experience of quickly burning out an SSD when manipulating big data.
The size of the dataset is about 10 gigs with 45,840,617 observations and 40 variables. Yes, you read that correctly… The data set has over 45 million observations!
The file paths provided below correspond to my storage drive that I use for working with bigdata. You may not have an
F:/ drive. You may need to place big data on your
C:/ drive or within a specific volume on OS X or Linux.
As a result, before running the script, please change the file paths so that they are relative to your machine!
# Any package that is required by the script below is given here: # Check to see if packages are installed, if not install. inst_pkgs = load_pkgs = c("ff","ffbase","biglm") inst_pkgs = inst_pkgs[!(inst_pkgs %in% installed.packages()[,"Package"])] if(length(inst_pkgs)) install.packages(inst_pkgs) # Dynamically load packages pkgs_loaded = lapply(load_pkgs, require, character.only=T) # Set Working Directory to where big data is setwd("F:/bigdata/kaggle_criteo/dac/") # Check temporary directory ff will write to (avoid placing on a drive with SSD) getOption("fftempdir") # Set new temporary directory options(fftempdir = "F:/bigdata/kaggle_criteo/dac/temp") # Load in the big data ffx = read.table.ffdf(file="train.txt", # File Name sep="\t", # Tab separator is used header=FALSE, # No variable names are included in the file fill = TRUE, # Missing values are represented by NA colClasses = c(rep("integer",14),rep("factor",26)) # Specify the import type of the data ) # Assign names to column colnames(ffx) = c("Label",paste0("I",1:13),paste0("C",1:26))
Quicker load on subsequent runs
Instead of recreating the
ffdf object each time we open the workspace, we opt to save the
ffdf object using:
# Export created R Object by saving files ffsave(ffx, # ffdf object file="F:/bigdata/kaggle_criteo/dac/ffdata/ffdac", # Permanent Storage location # Last name in the path is the name for the file you want # e.g. ffdac.Rdata and ffdac.ff etc. rootpath="F:/bigdata/kaggle_criteo/dac/temp") # Temporary write directory # where data was initially loaded via the options(fftempdir) statement
ffdf object has been saved, we are able to open the
ffdf object using:
# Load Data R object on subsequent runs (saves ~ 20 mins) ffload(file="F:/bigdata/kaggle_criteo/dac/ffdata/ffdac", # Load data from archive overwrite = TRUE) # Overwrite any existing files with new data
Note, if we modify the
ffdf object via data cleaning or et cetera, we need to RESAVE the object! Otherwise, our modifications will not be stored in the permanent directory and will only exist for the duration of the R session.
Sample modeling using ffbase’s biglm hook.
One of the nice benefits of using ffbase is the many options it has for working with ff data. In particularly, there is a wrapper that allows us to feed information into the biglm package without having to worry about converting the
ffdf object into a
# Model # Get predictor variable names (only 1 categorical is included) data_variables = colnames(ffx)[c(-1,-(18:40))] # Create model formula statement model_formula = as.formula(paste0("Label ~", paste0(data_variables, collapse="+"))) ## YOU MUST CLEAN THE DATA BEFORE RUNNING THE REGRESSION! RUNNING THE REGRESSION WITH MISSING VALUES WILL YIELD AN ETA ERROR! # Use a modified version of bigglm so that bigglm will not try to convert to a regular data.frame model_out = bigglm.ffdf(model_formula, family=binomial(), data=ffx,chunksize=100, na.action=na.exclude)