Going beyond regular expressions with structural code search


We're introducing a new way to search code at Sourcegraph with structural code search. Structural code search lets you match nested expressions and whole code blocks that can be difficult or awkward to match using regular expressions.

Structural code search is the idea that you can search for syntactic structures in code that correspond more closely to a program's underlying concrete syntax tree (or parse tree). For example, for loops in Rust look something like this:

for var in expression {
    code
}

The code block can contain nested for loops, if statements, and so on. To match all of the code block contents for these expressions, and search for patterns inside them, a search engine must understand that code exists inside the balanced braces { ... }. Regular expressions can go a long way to match such syntactic structures but they are not ideal. In practice we use parsing to interpret and convert syntax for nested expressions like {...} into trees, which encode richer structural properties than the textual representation.

Nested expressions figure Figure 1: Nested expressions can expand inside code blocks. Parsing converts nested expressions into tree data structures.

Most code search today is not based on true parsing or tree data structures. Instead, we use literal strings or regular expressions which is good enough for many kinds of searches. But these methods make it tricky to match precisely on blocks or expressions that can expand inside statements like the loop in Figure 1. We could more easily and precisely search for richer syntactic patterns if today's search tools also treated code as syntax trees, and that's the key idea behind structural search.

Many neat developer and compiler tools already exist for querying or matching tree structures (see additional resources at the end of this post!). But none are available at your fingertips, just seconds away from running on some of today's largest and most popular codebases. That is why we are happy to announce that Sourcegraph now supports a first release of structural search available at scale, for nearly every language, directly from your browser.

Examples! Show me examples!

Let's look for things in the Linux kernel. After all, why not show structural search working on one of the largest and most popular projects in open source software?

One important function is copy_from_user, which copies content from userspace memory into the kernelspace memory. This function has a history of careful auditing because incorrect uses can (and have) lead to vulnerabilities. We can find all copy_from_user calls with a query like copy_from_user(:[args]). Try it live (the toggle means structural search is active):

The :[args] syntax is a structural hole that matches all text between balanced parentheses. The args part is just a descriptive identifier. We support comby syntax, which is currently the underlying engine behind structural search. You can find out more about the match syntax in our usage docs, but for now it's enough to just follow along this blog post!

Of course, we could have run a simpler regex search for the prefix with something like copy_from_user( and get results more quickly, and sometimes that's the right thing to do.

But in other cases we can do more interesting things with structural search that become awkward otherwise. For example, one result for the above query is:

copy_from_user(&txc.tick, &txc_p->tick, sizeof(struct timex32) -
			   offsetof(struct timex32, tick))

where the argument spans multiple lines. By default, :[hole] syntax matches across multiple lines just like code structures can. An interesting thing about the call above is that it calculates the size of memory using sizeof(...) - .... Calculating and checking the size of memory to copy can be more error-prone than simple or static values. So, one thing we could check is whether there are other calls that calculate the size of memory in a similar way to the above, using subtraction and sizeof:

This query breaks up the original :[args] hole into holes for the destination buffer dst, source buffer src, and the calculation for the memory size. The :[_] syntax is just a hole that we don't particularly care to name. This query finds just a handful of results, so this is a rather uncommon pattern in the code base!

Structural search can also identify patterns to clean up. For example, one cleanup patch in the kernel looks like this:

Linux clean up patch

Here's a query to easily find more of these patterns:

This time, we're using the same identifier :[x] twice, to make sure that the argument is the same for both list_del and list_add calls. We could choose any identifier, except for :[_], which is just a placeholder. The whitespace between the calls is interpreted to possibly include newlines, so there's no issue matching these calls across newlines.

Do note that structural search is purely syntactic, so there are some matches that cannot cleaned up:

	if (!list_empty(&page->lru))
		list_del(&page->lru);

	list_add(&page->lru, &pool->lru);

In this case, the problem is that the list_del call is in scope of the if block. However, the majority of matches carry our intended meaning. In the future, we are introducing rules to refine queries further, giving you greater control for avoiding unintended matches.

Structural search works on practically all languages, and understands the basic syntactic structures for them (like strings, comments, and code blocks). Here's a short list that gives just a taste of some patterns you can try out:

Java. Find try-catch-finally statements where the catch statement has no body (the catch clause could be omitted)

Python. Find old-style string formatted print statements

Rust. Find chained filter(...).next() that could simplified to .find(...) (based on lint)

๐Ÿ”Ž .filter(:[_]).next() and .filter(:[_]) .next() (the latter matches across newlines)

ReactJS. Look for opportunities to optimize away arrow functions (see the React FAQ)

Go. Find .type(...) switches that contain a nil: case

Clojure. Find cond expressions with an :else nil branch at the end

Dart. Find Image.asset constructors in the Flutter API where width is initialized to 100

You're ready to try it yourself

*Here are a couple of things to keep in mind

  • Structural search on sourcegraph.com is only enabled for roughly the top 10,000 most popular repositories on GitHub, and on the most recent commit of the default branch. We plan to open it up to all mirrored repositories and you can help make that happen faster.

  • You can set up Sourcegraph locally for your own code or any other repository you'd like and get all of the structural search goodness. Running locally will also likely be faster than using sourcegraph.com.

  • You might be running structural search for the first time on a repo! ๐Ÿ˜Ž If your query times out, give the page a refresh because we're probably warming up the cache for you.

  • See our usage documentation for more help and the comby FAQ for more details and known limitations of the matching engine.

*A quick note on regular expressions: How is structural search different?

Structural search is not a replacement for regexp search. It's another tool in your toolkit that works well for matching blocks of code or expressions, and simplifies catching buggy syntactic patterns. If you only want to find a simple string or pattern, consider using Sourcegraph's literal or regexp search, because these queries are typically much faster! For a more detailed breakdown, see the short comparison at the end of this post.

We have more features and improvements planned. If you want to see any of these features arrive more quickly, please +1 the related issue tracker so that we can prioritize our engineering to deliver them sooner!

  • Structural search enabled for all mirrored repositories. We want structural search to be available for your repository, and not just the really popular ones. +1 this feature on GitHub
  • Regular expression support in structural holes. We want to add support for regexp syntax inside holes for refine search, like \d+ to match only numerical digits. +1 this feature on GitHub
  • Make it faster. Structural search is typically slower than our regexp search because it does more work. If you find it valuable, we want to make it faster. +1 this feature on GitHub
  • A richer query language. There are ways to refine structural search with rules allowing richer queries. To get a taste of rules, have a look at this CactusCon talk (subtitles recommended if audio is hard to follow). If you have a use case for this and want support sooner, +1 this feature on GitHub.

That's it for this blog post. You'll find some additional resources and discussion below if you're interested in reading more. Happy searching!


Additional resources

There is an immense amount of existing parsing and query tools for syntax trees. Most compilers today offer a library or visitor framework, and linters or static analyzers may build on them to implement checks. Here is a non-exhaustive short list of tools related to structural search and matching that you may be familiar with or find interesting:

  • IntelliJ IDE support for structural search and replace, or SSR [1]
  • Coccinelle for C code [1]. Our list example above is taken from the Coccinelle work, and there are many more patterns to browse through and try out!
  • Sgrep, or Syntactical grep (multiple languages) [1], [2]
  • tree-sitter, parsing and query framework (multiple languages) [1]
  • gogrep for declaratively matching Go syntax trees [1]
  • Spoon for declaratively matching Java code [1], [2]
  • Spoofax, AST querying using the Spoofax Language Workbench (multiple languages) [1]
  • CodeQL, querying tree and graph properties for a number of poular languages [1]

At Sourcegraph we're continually looking to improve developer tools, and to integrate richer search functionality. If you find these tools or others valuable, share your thoughts with us at [email protected].


Here are some key differences and comparisons to regexp-based text search:

  • Structural search is language-aware. For example, it understands certain pieces of syntax for code blocks, string delimiters, and comments. The language can be forced by specifying the lang: filter. If omitted, we perform a best-effort to infer the language based on matching file extensions, or fall back to a generic structural matcher.

  • :[hole] syntax matches across multiple lines by default.

  • Whitespace matching is fuzzy: a space in the pattern will match contiguous whitespace including newlines in the code.

  • Delimiters like {}, [], () are expected to always be balanced (depending on language). For example, a dangling parenthesis in Java is considered a syntax error and can't be matched. A dangling delimiter in the pattern implies a syntax error (prefer regexp search if you want to match dangling delimiters).

  • Built-in equality constraints when using the same identifier in patterns like foo(:[x], :[x]). This is similar to, e.g., backreferences in regular expressions.

  • No explicit support for matching regexp character classes like \d+ yet (see planned improvements).

For a complete overview, refer to comby.dev.


Feedback


ย ย 
ย