strings: the Sourcegraph blog

A collection of characters, stories, and other elements

Code search turned code checker

Rijnard van Tonder

I find code checkers like linters and lightweight static analyzers most valuable when they teach me better ways to code in a language or framework. For example, the Go staticcheck tool finds expensive string comparisons like:

if strings.ToLower(string1) == strings.ToLower(string2) {
  ...
}

and suggests instead:

if strings.EqualFold(string1, string2) {
  ...
}

These short-and-sweet replacements are a great way to learn framework idioms or library functions, like strings.EqualFold in Go.1 And as a codebase grows, small inefficiencies like this one, and inconsistencies opportunities compound. Code patterns creep in that affect readability and performanceβ€”and it matters.

Running code checks quickly, easily, and comprehensively

Tools like staticcheck and linters need setting up: a local clone or project build on your machine, or a continuous integration (CI) or editor set up. When I learn about patterns like strings.EqualFold, I want to know where else they might be lurking: in my code, in my team's code, or in open source projects. To do that I really need a lightweight workflow, not something that needs cloning repositories, CI or editor setup. Too much hassle. What I'm really after is a nimble way to find patterns in a bunch of code over many repositories, quickly and comprehensively. Something that feels a lot more like searching code than running analyzers.

Naturally, a plaintext search with grep can find snippets of EqualFold calls. In practice though, this plaintext treatment can't offer the fidelity of dedicated checkers that understand more about code structure and type information. But I believe there's a midway. What about a lightweight wokflow where that EqualFold check is a simpler but comparably effective search query that could completely eradicate all those inefficient comparisons in my code, my organization's code, or even all of open source code?

Example: turning code checks into code search queries

Earlier this year Sourcegraph introduced structural search to search over code syntax. Structural search uses comby to implement a basic building block in traditional code checkers: it interprets programs as concrete syntax trees, not just plaintext. Using file filters and our new support for or clauses, it's possible to write configurable code checks as self-contained search queries. Let's explore this idea!

Here's a search query inspired by a check where strings.Index comparisons can be replaced with strings.Contains:

language:go
not file:test
not file:vendor
not file:Godeps

strings.Index(..., ...) > -1

or

strings.Index(..., ...) >= 0

or

strings.Index(..., ...) != -1

This query matches all .go files, excluding file paths that contain test, vendor, and Godeps. It's sensible to exclude these file paths if we want to actually propose changes to a project (more on that later). The patterns strings.Index(..., ....) match syntax of strings.Index calls, and the ... ellipses are special placeholders that match at least two arguments.2 The or keyword separates individual patterns.

We can search over the top 100 Go repositories on GitHub (by stars) those by adding repogroup:go-gh-100 to the query. Have a look at some of the results:

πŸ”˜ Find ways to improve code in popular Go projects β†—Side note: our multiline query editor is in a proof-of-concept phase.

Some search hits in projects like Go and Kubernetes where a simpler strings.Contains can be used instead. The query finds matches in some of the most popular Go projects in a couple of seconds. An exhaustive search shows that there are more than 10 matches at the time of writing. For this flavor of syntactic change, I have a good sense that these are real hits of code that we can fix up.

Turning more code checks into search queries

Because structural search only looks at syntax, it can't yet operate at the level of a tool like staticcheck, which knows more about static properties like type information and variable scope to implement checks. At the same time, staticcheck isn't a search tool, it's an entire toolset that includes a suite of pre-written, high-precision checks that's very effective in certain workflows, like CI. The question is not necessarily whether a search tool can achieve parity with a tool like staticcheck. But given the overlap with now-expressible search queries, I wanted to know how this search workflow stacks up: how far can we push structural code search to find similarly actionable code checks? I.e., checks that match real cases of code waiting to be improved, minus the hassle.

Approach

So, taking inspiration from staticcheck, I wanted to see how many of its checks translate to search queries that I could have high confidence in (i.e., all patterns find legitimate issues; zero or very-near-zero false positives). I chose to look at staticcheck for its clear documentation, which made it easy to find examples.3 I ran my search queries against staticcheck's own test files to check that they don't match unintended patterns (false positives) and don't miss real patterns (false negatives). Each check may have more than one syntactic variant, so I tried to implement patterns for as many variants as I could find in tests. It's a neat exercise to develop patterns against the reference tests and discover which variants to cover, all in a self-contained search webapp. Here's an example where the query matches all the true hits in the test file, annotated with // want strings.Contains ...:

πŸ”˜ Example query to match known patterns in test filesβ†—

Using the test files isn't a guarantee that I've implemented all the checks or that it's entirely precise, but it adds a lot more confidence than inferring patterns only from documentation.

Results

Without making any claims about completeness, I was able to implement at least one variant for 20 out of 34 checks that I feel confident about. I relied only on patterns in staticcheck's test data to discover syntactic variants for checks, so I don't know if I covered all the patterns that staticcheck implements in its code.

The majority of checks that I couldn't write required type information to implement correctly (13 of 34). Other checks I couldn't write required more complex syntax matching rules (8 of 34). This table roughly quantifies the expressive needs for implementing checks:

Description# of checks
Total34
Works (at least one variant)20
Works (all variants)11
Needs type info13
Needs additional syntax matching8
Note that some checks require type info and additional syntax matching. Also, one working variant for a check may not require type info, but another variant for the same check may. I.e., the values overlap and do not sum to Total.

You can see and run the final query here:

πŸ”˜ Run the query for all code checks on popular Go projects β†—


Side note: patterns run in order and search stops after finding at least 30 matches. Feel free to delete some patterns to see others in action.

It was tempting to compare more directly by downloading the 100 Go repositories to disk, running staticcheck, and comparing matches. For staticcheck to be effective, the project typically needs to be built first (my experience was that running staticcheck on individual files can be hit-and-miss, and understandably so). I didn't like the idea of doing all that work, so I punted. That's really the takeaway of these results: many useful checks can fit into a lightweight search workflow that scales and generalizes to a lot of projects, without ad-hoc setup for project builds and dedicated tools.5

Extending search queries to access static properties like type information is a also natural extension for writing better code checks, and an area of code intelligence work that we're exploring.

Code checks beyond dedicated tools

The direct comparison to staticcheck is interesting, but the ease of a search workflow means there are also different benefits over dedicated tools. For example, in just the last couple of weeks I learned about a more elegant way to write code for appending bytes in Go. A pattern like this:

b = append(b, 'f', 'o', 'o')

can instead be written:

b = append(b, "foo"...)

This one isn't available in staticcheck, but I could immediately implement the check and find hits using Sourcegraph:

πŸ”˜ Find appends of three or more bytes β†—

There's also the notion of project-specific checks that will never make it into a general tool like staticcheck. For example, in our Sourcegraph codebase we've mostly moved away from testing values with != and instead use cmp.Diff. But I know there are still places where we do comparisons the old way. This query highlights some of the remnants, and how how the inner block uses the compared values:

πŸ”˜ file:test if want != got {...} β†—

So code checks can also be specific and customizable to your organization that you won't find in off-the-shelf tools.

Takeaway

Code search can be so much more than finding your favorite provider called ChocolateCake identifier↗. It can be a lightweight workflow for revealing short-and-sweet ways to make your code better. Perhaps the most powerful idea is that a universal code search can wholesale find and eradicate the code slips we're always bound to make.

Starting small, the solution could look like a search query that finds issues in active and popular Go projects at just the push of a button. Why not start there? I'm excited about the idea that we can run that query for everything at once:

πŸ”˜ Run all the code checks on popular Go projects β†—


The great thing about code checks as queries is that it's easy to simply delete patterns that we don't find as valuable. When I explored some individual queries, it was also reassuring to discover that no patterns occur in any of the Go repositories. For example, S1035 checks that there are unneeded http.CanonicalHeaderKey calls on the first argument of certain functions:

headers.Add(http.CanonicalHeaderKey(...), ...) or
headers.Del(http.CanonicalHeaderKey(...)) or
headers.Get(http.CanonicalHeaderKey(...)) or
headers.Set(http.CanonicalHeaderKey(...), ...)

There are no matches β†— for this pattern in the Go repositories, but there are some matches β†— in vendored files, when we we remove the not file:vendor filter.

Proposing changes

The most exciting part of finding real hits with any code checking tool is that we discover an actionable way to improve the code! I've narrowed down the search query so that it (hopefully) finds only real and uncontroversial patterns to improve and contribute to active Go projects.

Part of an effective contribution means that we need to avoid matches in files that are tests, vendored, or external dependencies. While anyone can use the results to make contributions, it's important to be mindful of contributor guidelines, clearly communicate and motivate proposed changes in pull requests, and validate that the change passes a project's tests or CI checks. And, while the query does exclude common test and vendored files, it's best to check that matches occur in files that really are part of the project.

If you're interested in turning the results of the complete query into open source contributions, e-mail me at rijnard@sourcegraph.com, I can help.

What's next for search queries

There are many languages and code checkers to explore, and this post only skims the surface. We want to make high-quality code checks available for your projects and languages. If you have a project or query in mind, feel free to open an issue on GitHub with your thoughts. We're still working on performance and usability enhancements so that you can search your code and not just popular projects, so stay tuned! In the meantime, you can find an all-in-one config for replacing the patterns on your local machine at comby-tools/go-patterns. If you want to make large scale, org-wide changes, talk to us about Batch Changes. And if you find code checking valuable and want to learn more about our work at Sourcegraph, reach us at feedback@sourcegraph.com. Happy searching!


1 Another one of my favorites tools using the same principle is Kibit for Clojure.


2 See the structural search reference for more details about special syntax.


3 These are all the simple static check of the form S<number>, excluding SA and ST checks.


4 There's plenty of opportunity to pick other tools or languages, I picked staticcheck because its the nicely illustrative.


5 Also, it probably makes sense for these projects to run staticcheck in their CI, and a quick search helps make it more obvious which ones do not πŸ˜›.


Acks: Thanks @lguychard, @thorstenball, @stefanhengl @beyang, @sqs for feedback on the content of this post.