How to cleanup large and sensitive files in GIT repository history

If you have a large file in the GIT commit history, or if you commit sensitive data, such as a password or SSH key into a Git repository, you can remove it from the history. To purge unwanted files from a git repository’s history you can use either the git filter-repo tool or the BFG Repo-Cleaner open source tool.

The git filter-repo tool and the BFG Repo-Cleaner rewrite your git repository’s history, which changes the SHAs for existing commits that you alter and any dependent commits. Changed commit SHAs may affect open pull requests in your repository. We recommend merging or closing all open pull requests before removing files from your repository.

For this tutorial we will be using BFG’s Repo cleaner tool to demonstrate the purging process.

BFG Repo-Cleaner

Removes large or troublesome blobs like git-filter-branch does, but faster. And written in Scala. an alternative to git-filter-branch, the BFG is a simpler, faster alternative to git-filter-branch for cleansing bad data out of your Git repository history:

  • Removing Crazy Big Files
  • Removing Passwords, Credentials & other Private data

The git-filter-branch command is enormously powerful and can do things that the BFG can’t – but the BFG is much better for the tasks above, because:

  • Faster: 10 – 720x faster
  • Simpler: The BFG isn’t particularly clever, but is focused on making the above tasks easy
  • Beautiful: If you need to, you can use the beautiful Scala language to customize the BFG. Which has got to be better than Bash scripting at least some of the time.

Download BFG Repo Cleaner

Pre-requisite: The Java Runtime Environment – Java 8 or above is required. (To install Java on windows, watch this tutorial)

Purge/Cleanup steps

First clone a fresh copy of your repo, using the --mirror flag:

$ git clone --mirror git://example.com/some-big-repo.git

This is a bare repo, which means your normal files won’t be visible, but it is a full copy of the Git database of your repository, and at this point you should make a backup of the repo to ensure you don’t lose anything.

Now you can run the BFG to clean your repository up:

$ java -jar bfg.jar --strip-blobs-bigger-than 100M some-big-repo.git

The BFG will update your commits and all branches and tags so they are clean, but it doesn’t physically delete the unwanted stuff. Examine the repo to make sure your history has been updated, and then use the standard git gc command to strip out the unwanted dirty data, which Git will now recognize as surplus to requirements:

$ cd some-big-repo.git$ git reflog expire --expire=now --all && git gc --prune=now --aggressive

Finally, once you’re happy with the updated state of your repo, push it back up (note that because your clone command used the --mirror flag, this push will update all refs on your remote server):

$ git push

At this point, you’re ready for everyone to ditch their old copies of the repo and do fresh clones of the nice, new pristine data. It’s best to delete all old clones, as they’ll have dirty history that you don’t want to risk pushing back into your newly cleaned repo.

Examples

Delete all files named ‘id_rsa’ or ‘id_dsa’:

$ java -jar bfg.jar --delete-files id_{dsa,rsa}  my-repo.git

Remove all blobs bigger than 50 megabytes:

$ java -jar bfg.jar --strip-blobs-bigger-than 50M  my-repo.git

Replace all passwords listed in a file (prefix lines ‘regex:’ or ‘glob:’ if required) with ***REMOVED*** wherever they occur in your repository:

$ java -jar bfg.jar --replace-text passwords.txt  my-repo.git

Remove all folders or files with a wildcard ‘*.jar’

$ java -jar bfg.jar --delete-files *.jar my-repo.git

Examine the repo to make sure your history has been updated, and then use the standard git gc command to strip out the unwanted dirty data, which Git will now recognize as surplus to requirements:

$ cd some-big-repo.git$ git reflog expire --expire=now --all && git gc --prune=now --aggressive$ git push origin

Team,
DataHackr

Scroll to Top