Diffest

A Visibility Jam 2023 exploration.

Introduction

My research question for this project was

What if your diff tool worked with code, not text?

When I say "diff tool" I mean anything that can compare two different versions of a code base and visualize the changes between those versions. There are various command-line tools that can compare two files or directory trees and output the result in multiple formats. Graphical tools can visually show you changes by highlighting modified text. Your version control system and your software forge both have this ability built in and your IDE shows colored bars in the editor gutter signifying which lines have changed since the last commit. What's common between all these tools is that they are only working on the text in the file and not on the code which that text represents.

In my "I'm in"-post when I announced the project I would be working on (for the full post see the activity below), I contrasted two types of tools:

those only working with the plain text your source code is in and
those that transform that plain text into some higher-level and more useful data structures

Diff tools have mostly been in the former category. With Diffest I tried to explore where a diff tool of the latter type could be helpful. A diff tool that is (at least in some capability) aware of the code you write.

In that post I also listed six advantages a code-aware diff tool could have. Let me repeat those here and provide a bit more context for each:

Look past changes in formatting to focus on what's actually relevant. This one seems pretty obvious at first: You only want to review code that was modified and not any formatting changes that might also have happened. Especially with an auto-formatter in your IDE adding some formatting changes to your commit is really easy to do. However, formatting is not just whitespace or indentation, any of the following can also be irrelevant formatting changes:
- trailing commas
- superfluous semicolons
- ordering of imports
Group related changes so renaming a constant is one change instead of several dozen. Only reviewing the renaming of some variable once and then having the code-aware diff tool see all other places where the same variable was renamed as also reviewed would save lots of time, especially in large refactoring commits. Ideally this could also be extended to any change that happened multiple times, such as
- the signature of an interface method changing
- a (magic) number being changed to a constant
- accessing a field directly changed to a get method
- or whatever else your IDE helped you refactor
Examine the entire code base instead of handling every file individually. This will let us easily see when code is moved from one file to another, something no text-based diff is good at since they all only compare individual files. If you rename a file that gets picked up generally but otherwise? You wouldn't even need a code-aware diff tool for that, it's like the lowest of low-hanging fruit.
Track where code came from and where it was moved to. Now that we are aware of the entire code base, finding the bigger patterns, the broad strokes of what happened is way easier.
Present the entirety of the changes in a sensible and ordered way. Ideally the person proposing the changes would do this for you. Practically you get one commit with eight different types of smaller changes and you need to unravel them yourself. If your diff tool could tell you that your coworker renamed that constant, updated that method signature, introduced an interface there, and then show you the three fixed typos, that would sure improve all our lives.
And finally, figure out what changes weren't even made. How often have you changed something and then figured out that you didn't update a comment that now contains old information. With some string-matching at least some of these cases could be caught.

And so, for the jam, I wanted to make advances on at least some of these points.

A deep-dive through the internals

Let's see where I got to and let me walk you through everything that Diffest does.

It compares source code by working on the syntax trees that source code represents. It takes an abstract syntax tree (AST) and enriches it with nodes from a concrete syntax tree (CST) and raw bytes from the source code. The resulting syntax tree still follows the form of an AST but contains additional nodes to ensure every byte from the source code is captured. For the nodes comprising the AST I'm using the term black nodes while everything added is a gray node.

I'm using Tree Sitter to parse the source code. Tree Sitter produces an AST and gives me the option to get the byte slice every node in the tree came from, even for non-leaf nodes. It distinguishes two types of nodes:

named nodes which contain the relevant node-content of the AST and
anonymous nodes which contain fixed tokens and mandatory punctuation

I'm iterating that resulting tree and transforming it to my own node type which contains a few more goodies that will come in handy later. I'm also splitting the two types given by Tree Sitter into my black and gray nodes.

At this point however, not every byte from the source file is accounted for. I fix this now by introducing even more gray nodes, these new ones contain mostly optional whitespace and punctuation. As a last step before the syntax tree is finished I hash the content of every node (ie. the subtree starting at the node) which will make it faster to compare them later.

Now I've got one syntax tree per file. Next I combine all these by building a new syntax tree following the file and directory structure of the code base I have and attaching the existing syntax trees on the leaves.

With this I've reduced the number of trees to two. And these then get diffed. I take the two root nodes and calculate an edit distance between the list of children. To ensure that I don't try to match up two nodes of completely different types I upgraded the algorithm so that it only tries to edit between two nodes of the same type. While matching the two lists I also take the text-based edit distance between the content of the nodes into account. Nodes that I've determined need to be deleted or inserted I recursively walk through the relevant tree, quickly adding the nodes into my result. I then recursively step into nodes that should be edited, ie. where something differs between the two subtrees and perform the entire thing again. All the while I mark the nodes of the resulting (compared) tree as equal, deleted, inserted, or edited.

The first post-processing steps concerns the gray nodes that were marked as changed. Lots of these gray nodes are just irrelevant formatting changes so I find and unmark all these. Everywhere an entire black node is marked as changed all gray nodes somewhere below can inherit that marking. This removes noisy markings on whitespace while keeping relevant markings intact. After all, if you add an entire function, all the whitespace inside that function was also added.

The second post-processing step looks for common changes. I start by building out a map from the hash of the content to a list of nodes that have that content. The content here is the raw bytes of that node and not connected to the information whether it was added, removed, or not changed. Every node that exists only once is obviously not a common change and gets removed right away. Next I check that every node with the same content hash actually had some change. For example: You have a variable named i and use it 12 times. If one of these nodes was added but 11 weren't changed, that entire group is not interesting. Often there are several of these groups nested inside each other. I remove every group but the biggest that contains all instances of the given change. In order to try to increase the size of the common change marking even more I check if the previous or next siblings of the current group all match and if so expand the group. And finally, once I have found the perfect common change groups, I mark them all with unique numbers so that I can then visualize them.

That was the last part, the only thing left is to show the result now. I did this as HTML, highlighting deleted and inserted code with red and green backgrounds as you would expect. Common groups I displayed with colored boxes around the changes.

The good, the bad, and the ugly

Looking at the output Diffest produces, it can be split into three different categories. They're maybe no all quite the same size but they are all relevant.

The good

Diffest is able to look at your entire code base. It can track moved code and relate changes to each other across all files. That works reliably and across code bases consisting of several programming languages.

If you rename a constant that change is neatly shown in-line. Changes that happen several times across the code base (like renaming a constant) are related to each other and marked as such. These marked changes are made to encompass the biggest amount of context possible.

Tokens are treated as the smallest unit. You don't have random individual letters flying around that are somehow present in both states of the codebase. Numbers are one thing to compare and not a series of digits.

The bad

There are quite a number of cases where Diffest produces a result comparable to a naive text diff. For only a small sub-set of these that diff is the best possible; for most of these cases a diff tool should be able to do better.

There are also some cases where the diff makes sense, per se, but in order for it to be useful you'd need to have a deep understanding of the syntax tree both versions of the code produce. Generally you're looking at diffs when you're reviewing code, so you're looking at code you're not yet understanding.

The ugly

I've got so many test cases where the diff that Diffest produces is down right ugly. For half of the test cases I made at the beginning of the jam, I've not shared any results because the so-called code-aware diff is literally worse than the most naive text diff possible. Like, in some cases it's worse than not having a diff at all.

Anywhere indentation changes the hope to get a useful diff goes right out the window. Introduced a new namespace? Nope. Guarding some code in an if? Not helpful. Refactoring something out into a method? Get ready to look at a text-based diff.

Additionally doing two things at the same time is a recipe for disaster. Moving some code and renaming a variable in there? You'll have to put the pieces together yourself.

The future

As we've seen, comparing trees level-by-level leaves a lot to desire. Having code that changes depth is really common across any code change and a new iteration of this tool should produce more useful tree diffs.

What I've built now doesn't scale well. A code-aware diff tool is more useful the bigger the code base is, so having it work on huge code bases as well is paramount. The bigger the code base, the more changes there are to track. Especially code that was moved from one place to another and changed during the move should be recognized up to some level.

The way I mark changes as belonging together by then grouping these markings again or combining non-marked changes with marked ones. The output of that effort could then be used to create some coherent story of the changes as something the reviewer can work along.

Also, with some text matching and clever searching through the syntax trees a tool could find relevant code that should have been changed but wasn't. At least seeing if the old name of a renamed constant is still somewhere in a comment is doable.

Finally, everything up to now assumed that we only got two states of the code base to compare. Let's check the version control system, there are probably more states or a common ancestor to include and guide the output.

All in all, a lot of stuff worthy of future explorations.

Conclusion

Fusing the abstract and concrete syntax trees with the actual bytes from the source code worked out. It gave me a tree structure with several levels of detail which were very useful when comparing two trees.

Both while comparing the trees and during the various post-processing steps I'm iterating through the trees multiple times. Several of these are nested and I probably should have cached some stuff somewhere in between. The runtime complexity has long surpassed $O(n^2)$ and is probably approaching $O(scary)$. Additionally I would still classify Diffest as a project that falls apart if you look at it wrongly.

Check the Activity below for the updates I gave during the jam. They include pictures.