How to remove secrets from your codebase

André Eleuterio

Sourcegraph provides universal code search: a navigation engine for understanding your code.

With most early-stage startups some security concerns are deprioritized in favor of speed and getting to a functioning product. That was the case with Sourcegraph—in the beginning we didn’t have enough bandwidth to handle all secrets automatically in our source code. Since our infrastructure is fully managed through code, most of our infrastructure and service account passwords were stored in private repositories. We always saw this as a big risk for the organization, so we tackled it earlier this year. We relied heavily on Sourcegraph and its incredible search capabilities to reassure us that we weren’t missing anything.

We broke the effort down into three parts:

  • Finding all secrets in our codebase
  • Removing the secrets from code and storing them in a secret vault
  • Rotating the secrets

Searching for secrets in our codebase

To get started and properly estimate the work, we needed to find all secrets in our codebase. We had an idea of where most secrets would be: Kubernetes (.yaml) and Terraform (.tf) files. That served as a starting point, but we needed to be very thorough.

There is no industry-standard tool for finding secrets in source code, but there are a few that aim to help. truffleHog and Gitrob are popular OSS tools for this purpose. They all have strengths and weaknesses, usually a balance between:

  • Searching for patterns that look dangerous, such as a password= string. This approach usually leads to more findings but less precision.
  • Searching for known patterns, such as a GitHub or AWS token that have particular patterns. This approach identifies all instances of that given token but fails to be comprehensive.

To ensure we were thoroughly covered we combined automated tooling (truffleHog), manual reviews, and (especially) Sourcegraph searches. truffleHog also served as a great source of patterns to search for.

We started with a high-precision search targeting known patterns. This search was developed targeting secrets we already knew were in our source code and had an identifiable pattern. We were able to find many secrets with a low number of false positives:

repo:[our targeted repos]$
patterntype:regex

// ===== PATTERNS (or-delimited) =====

// Strings longer than 32 characters, possibly base-64 encoded
("[a-z0-9+/]{32,}=?"|'[a-z0-9+/]{32,}=?'|`[a-z0-9+/]{32,}=?`) or

// Private keys
-----BEGIN (RSA )?PRIVATE KEY----- or

// Lines ending with "=" (likely base64 values)
[a-z0-9+/]+==?(['"],?)?\n or

// Lines containing a significant number of base64 characters,
// but not necessarily ending with "=",
// (acknowledging that not all base64 strings end with "="),
// prefixed with a keyword indicating that they are sensitive
(token|secret|password|credential|key|private|sensitive)[^a-z0-9+/\n]+[a-z0-9+/]{16,}(['"],?)?\n or

// Likely k8s secrets
(kind: secret|kind secret|kubectl create secret) or

//Slack
(xox[pborsa]-[0-9]{12}-[0-9]{12}-[0-9]{12}-[a-z0-9]{32}) or

// GitHub
[gG][iI][tT][hH][uU][bB].*['|\"][0-9a-zA-Z]{35,40}['|\"]

// Google, GCP, GSuite
AIza[0-9A-Za-z\\-_]{35} or
[0-9]+-[0-9A-Za-z_]{32}\\.apps\\.googleusercontent\\.com or
ya29\\.[0-9A-Za-z\\-_]+

We then moved to searches with a wider scope and, as a result, more false positives to be triaged. These searches look for keywords with the added power of regular expressions:

r:[our targeted repositories]$
patterntype:regex

private[\s_-]?key or
api[\s_-]?key or
secret[\s_-]?key or
session[\s_-]?id or
auth[\s_-]?token or
license[\s_-]?key or
r:[our targeted repositories]$
patterntype:regex

(credential|secret|private|\Wkey\W|token|sensitive|password|session|auth|license|\Wid\W) or
[sS][eE][cC][rR][eE][tT].*['|\"][0-9a-zA-Z]{32,45}['|\"]

We combed through false positives manually and also cross-checked them against results from truffleHog. truffleHog is a great tool but led to a very high number of false positives and duplicate results. One problem we ran into was that container image shasums were picked up as secrets. Using truffleHog to cross-check results proved very valuable.

As we found more secrets we searched those patterns individually for further coverage. We confirmed around 150 secrets that needed to be removed from our code.

Removing secrets from code

The goal was to remove all secrets from our source code and store them in GCP Secret Manager, our secret vault of choice for this project. We use Terraform to manage our cloud assets, which allows us to fetch secrets from GCP and inject them into Kubernetes Secrets in our clusters. After this initial setup and writing some Terraform modules, we were ready to start moving secrets to the Secret Manager.

Sourcegraph was very valuable in situations where the same secret was used in multiple places, such as service accounts. With Sourcegraph we can search code across multiple repositories, giving us certainty that we weren't missing any places where it was used. Sourcegraph allowed us to run wide-ranging searches instead of targeting specific folders or files. Finding all instances of a specific secret allowed us to organize our work better and tackle types of secrets in batches. At this point we didn’t find any secrets our initial searches had missed, reassuring us that we were fully covered.

This was the longest part of the work. Many, many PRs later we had all secrets moved to the Secret Manager. This was a big accomplishment and our team celebrated!

Rotating secrets

It felt great to have these secrets removed from our repositories but they were all in our Git history. We needed to rotate all these secrets now that they were in the Secret Manager. Rotating the secrets varied greatly depending on what we were rotating—anything that allowed us to create a new credential/token/key, deploy it, and then invalidate the old value was prioritized.

For our services to pick up changes to the Kubernetes Secret we needed to reroll those pods. Sourcegraph providing complete searches across multiple repositories was once again invaluable in ensuring that we were restarting all instances where a secret was used and none was missed. Not having this would have likely led to some services not being properly restarted and later failing with invalid credentials. Sourcegraph combined with infrastructure-as-code proved to be immensely helpful.

Try it yourself!

The searches above targeted our own code, and we wanted to make sure you can use Sourcegraph to search for secrets in your own code.

Using sourcegraph.com and adding your private repositories is the quickest way to try out these searches on your code. You can also run Sourcegraph locally on your computer and run the searches, although it can be resource intensive if targeting many repositories.

Click on this link to open the search on sourcegraph.com, then follow along the comments to tune the search for your repositories. You can also request a demo.

This project highlights just one of the ways Sourcegraph can be used to help support security efforts across a large codebase with a lot of history. Keep an eye out for future blog posts of other ways to use Sourcegraph to protect your business, or let us know if you’ve found any ways of your own!

Get Cody, the AI coding assistant

Cody makes it easy to write, fix, and maintain code.