Intro

A while ago, I was given a binary file reading script written using MATLAB. The scripts goal was to read a binary output file from an Inertial Measurement Unit (IMU) into a MATLAB matrix. What I worked on was converting the code from MATLAB to C++, specifically for use with the Armadillo C++ Matrix Library. Furthermore, I wanted to embed it within the gmwm R package that provides a method for modeling IMU error processes. ‘Twas here that the tale begins.

Binary 1-0

Binary file formats are a bit different than an ASCII format, which is what we are accustom to seeing. Specifically, a binary file contains blocks of strings such as “01000010 01101100 01100001 01100011 01101011” to represent the traditional word: “black”. To view the files, you really cannot use your traditional IDE since more often than not they will try to convert file encoding into UTF-8 or Windows ISO. Instead, you should seek out a binary file editor (Windows: Hexplorer, OS X: Hex Fiend, Linux: xxd -b shell )

Conversion from MATLAB to C++

The function was easily ported from MATLAB into C++. The main differences between the two versions settled around how I structured the IMU data type and, as you guessed it, how the binary file was read. The later was very problematic as it related to cross-platform deployment. The entire IMU function is available on GitHub for you to peruse or use in your standalone applications per the LICENSE.

The Bug

The binary record format was either all double or a double and 6 longs for int. On Windows, everything just “worked” on the C++ port. I was able to easily recover the data within the matrix. However, on OS X, it was a mess. Only IMU binary record formats that used double exclusively were able to be loaded. When a record format contained long, the results were really, really odd.

The worst part is the bug came up as I was doing a live action demo on OS X. So, when I went back to my Windows development machine, I couldn’t replicate it at all. Thinking it was just a corrupted file, I went back to try to demo it again. Yet again, the bug raised its head.

When I started this section, it might have seemed odd that I wrote the data type as a long for int instead of long int. Part of the reason for that was the original code showed long as the primitive type. The other part is I wanted to emphasis that long by default is associated with int. So, long is equivalent to long int.

With this being said, I narrowed it down after several bits of running the code on my Windows machine and then running it on OS X to being related to the data types I was using. I coded up a straight forward primitive data comparison file:

#include <Rcpp.h>   // Way to bundle C++ with R, replace with normal C++ header
#include <stdint.h>

// [[Rcpp::export]]
void sizeme() {
  std::cout << "Size of double: " << sizeof(double) << std::endl;
  std::cout << "Size of long double: " << sizeof(long double) << std::endl;

  std::cout << "Size of float: " << sizeof(float) << std::endl;

  std::cout << "Size of int: " << sizeof(int) << std::endl;
  std::cout << "Size of long int: " << sizeof(long int) << std::endl;
  std::cout << "Size of long: " << sizeof(long) << std::endl;
  
  std::cout << "Size of int32_t: " << sizeof(int32_t) << std::endl;
}

/*** R
sizeme()
*/

Under gcc on Windows, data types* have the following values:

  • Size of double: 8
  • Size of long double: 16
  • Size of float: 4
  • Size of int: 4
  • Size of long int: 4
  • Size of long: 4
  • Size of int32_t: 4

Under clang on OS X, data types* have the following values:

  • Size of double: 8
  • Size of long double: 16
  • Size of float: 4
  • Size of int: 4
  • Size of long int: 8
  • Size of long: 8
  • Size of int32_t: 4

* Above given in bytes…

Notice that a difference exist in the amount of bytes allocated to the long int type between OS X and Windows of 4 bytes. At long last, the reason was becoming clear what the issue was and what the solution is.

Solution

Use portable integer types!

Type Description Value Range [min, max]
int8_t 8-bit signed integer $[-2^{7}, 2^{7} - 1]$
uint8_t 8-bit unsigned integer $[0, 2^{8} - 1]$
int16_t 16-bit signed integer $[-2^{15}, 2^{15} - 1]$
uint16_t 16-bit unsigned integer $[0, 2^{16} - 1]$
int32_t 32-bit signed integer $[-2^{31}, 2^{31} - 1]$
uint32_t 32-bit unsigned integer $[0, 2^{32} - 1]$
int64_t 64-bit signed integer $[-2^{63}, 2^{63} - 1]$
uint64_t 64-bit unsigned integer $[0, 2^{64} - 1]$

These primitive data types are guaranteed to have the same byte value across platforms. However, depending on the C++ compiler, they may or may not be declared! Though, since the world has moved on from the era of C++98, except for R, this is less likely but I would be remissed if I didn’t mention it as a potential future source of error.

Specifically, I ended up using int32_t in place of long to ensure the buffer had adequate space for the file to be read into.