Removing Large Files and Sensitive Information using BFG Repo Cleaner

Intro

As we come upon the start of group projects and student created visualizations, there is the inevitable issue that a group will commit a data set that is too large into their repository. Generally, this means that the repos are treated as problematic on push to GitHub when files are:

  1. Over 50 mb causes a warning for the commit.
  2. Over 100 mb yields an error and rejection of commits .

Similarly, a repository is flagged as problematic if the total size of the repository is over 1 GB.

Workflow

There are two workflows given by GitHub that seek to address the “large file” or “sensitive data” problem:

  1. Removing a file using git rm and git commit --amend
  2. Removing sensitive data from a repository using bfg or git filter-branch

Neither approach as stated allowed for a quick clean up of a repository. Though, the BFG repo cleaner highlighted by the second workflow was insightful. With additional flags provided to the BFG repo cleaner there was a quick rewrite of commits that allowed the repos to go back to a functional state.

For future use, the workflow is given as:

# Download the BFG Repo Cleaner to working directory
wget http://repo1.maven.org/maven2/com/madgag/bfg/1.13.0/bfg-1.13.0.jar

# Clone the repository to a new directory
git clone git@github.com:org-name/sample-repo.git sample-repo

# Remove all files greater than 100M with disables BLOB file protection
# in sample-repo
java -jar bfg-1.13.0.jar --no-blob-protection --strip-blobs-bigger-than 100M sample-repo

# Change into the repo folder after purging
cd sample-repo

# Strip out the unwanted dirty data
git reflog expire --expire=now --all && git gc --prune=now --aggressive

Common Issues

Did BFG Cleaner ask if your repo needs to be re-packed through a warning message related to “large blobs”?

Using repo : /cloud/project/.git

Scanning packfile for large blobs completed in 16 ms.
Warning : no large blobs matching criteria found in packfiles - does the repo need to be packed?

To repackage the repository, run:

git gc

Appendix

Sample output of a successful clean build will be:

Using repo : /cloud/project/.git

Scanning packfile for large blobs: 30
Scanning packfile for large blobs completed in 56 ms.
Found 1 blob ids for large blobs - biggest=120226125 smallest=120226125
Total size (unpacked)=120226125
Found 0 objects to protect
Found 4 commit-pointing refs : HEAD, refs/heads/master, refs/remotes/origin/HEAD, refs/remotes/origin/master

Protected commits
-----------------

You're not protecting any commits, which means the BFG will modify the
contents of even *current* commits.

This isn't recommended - ideally, if your current commits are dirty, you
should fix up your working copy and commit that, check that your build still
works, and only then run the BFG to clean up your history.

Cleaning
--------

Found 5 commits
Cleaning commits:       100% (5/5)
Cleaning commits completed in 375 ms.

Updating 1 Ref
--------------

        Ref                 Before     After
        ---------------------------------------
        refs/heads/master | c2da9bdd | a732bc9c

Updating references:    100% (1/1)
...Ref update completed in 18 ms.

Commit Tree-Dirt History
------------------------

        Earliest      Latest
        |                  |
         .   .   D   D   D

        D = dirty commits (file tree fixed)
        m = modified commits (commit message or parents changed)
        . = clean commits (no changes to file tree)

                                Before     After
        -------------------------------------------
        First modified commit | b56cbdde | 6056ad01
        Last dirty commit     | c2da9bdd | a732bc9c

Deleted files
-------------

        Filename       Git id
        ----------------------------------
        psam_p17     | d8be4dfb (114.7 MB)
        psam_p17.csv | d8be4dfb (114.7 MB)