LLMs for static analysis is a bad idea

I've spent 4 years on this problem. Here's why it's not as easy as you might think.

If you've tried GitHub Copilot, you'd think LLMs would be great at detecting bad code with static analysis. If it can generate code, why not detect problems in it, right? I've spent 4 years on this problem. Here's why it's not as easy as you might think.

For the uninitiated. static analysis is a technique of "analyzing" source code without actually executing it (hence "static") to find out bad patterns in it. Source code is converted to its abstract syntax tree representation, most commonly. You can then "walk" through this tree.

If you know a bad pattern, you can write an AST walker to find out that bad pattern in the tree and raise an error if it's found. For instance, in JavaScript, using === is preferred over == for object comparisons. Here's how eslint (partly)implements this check.

Implementation of eqeqeq rule in eslint

As the rule is hand-written, detection is deterministic and highly accurate. You can handle several edge cases or false-positive scenarios so you don't see noise. But most importantly, you're relying on an explicit declaration that this is a bad pattern by eslint. Non-ambiguous. Developers trust eslint because it's reliable in this sense. All the rules are hand-written, backed by someone's opinion of what's a bad pattern or not — in this case, the eslint community (derived from the broader JS community).

That's not the case with LLMs.

LLMs learn on publicly available code. They generate code suggestions based on what they've already seen in the wild (in this case, millions of open-source repositories). It's pattern matching, mixed with your local context that makes the experience magical. But there are infinite ways of writing a function. So if you get a wrong suggestion, it doesn't matter. You can just move to the next one to see if something else works, or just discard it and write code on your own.

It's just a SUGGESTION. Take it or leave it.

False positive thrown by Copilot

But static analysis, unlike suggestions, is JUDGMENT. If you rely on learning from open-source projects, how will you know something's a bad pattern? Open-source is full of examples of bad patterns, even some popular repos. Who decides what pattern is wrong? The LLM can't. That's why "AI-based" static analysis tools spectacularly failed historically. The rate of false positives has been too high (beyond 35% in the case of tools that have used AI for this).

As a developer, would you use a tool that shows you false warnings all the time?

In 2018, Google published a paper on their experience with implementing static analysis for Java. They found out that the maximum acceptable rate of false positives for static analysis tools to be useful is 10%. LLMs, by the very nature of how they learn, cannot promise such high accuracy (at this moment, at least). Aside, you cannot leave the judgment of what's a good pattern or not to unsupervised learning.

We've spent the last 4 years working on building a static analysis tool that guarantees less than 5% false positives. If you're interested in learning more, DM me, and I'd love to share more details on how we do this!

Check out deepsource.io. 🙌🏼