Diffest 2

A Visibility Jam 2024 exploration.

Introduction

Trees are found in multiple places in software, among them:

Your filesystem (mostly) follows a tree structure.
Various file types for data exchange or configuration organize their content in a tree, for example xml, json, yaml, ...
Code is structured in an abstract syntax tree.

In all of these places two versions of the same tree need to be compared from time to time.

What if you want to back up the rest of your vacation photos but have reorganized your directory structure since the last backup? With bad software, you'll end up saving a bunch of photos twice, wasting storage space. With better software, every photo will be backed up only once but your folders will be a mess combining elements of both the old and new directory structures. Good software could automatically figure out what went were and make it easy to apply the changes to your backup drive.

What if you spilled your box of DevOps and need to compare the resulting mess with a known good version to get your production Kubernetes cluster back up and running? I mean, it sounds like the clock is ticking, you'd better have some good software to help you do the comparison. This is especially true in the case of yaml where tree structure is both essential and hard to get right in so many edge cases that it's disingenuous to even call them edge cases.

What if you build a code-aware diff tool during a programming jam in an online community that is building software by hand but realize that it produces ugly results whenever the abstract syntax tree does abstract syntax tree things? Is that just me? Okay, maybe that one's just me.

In mathematics, the problem of comparing two trees has been considered in more theoretical approaches from a bunch of angles. Trees are a subset of all graphs in graph theory. The computer scientist's favorite directed acyclic graphs are a thing between graphs and trees and have many uses in computer science. However they are to broad to consider here. Generic graph-theory trees however are still a strict superset of the trees that appear in software. The software trees additionally have the constraint that one node is defined as the root and/or all edges are directed. (I'm saying "and/or" here because smart math people don't agree among themselves and aren't sure about this either. Some believe that one implies the other.) Trees in set theory are pretty similar to trees in software but come with more complicated notations. In both types of trees the children of a node are generally not in any particular order. In software they often are ordered, though sometimes implicitly (for example sorted alphabetically). In this project I'm assuming an explicit order.

With filesystem trees generally not compared as trees, config trees compared as such but not generalizable to other tree types, and math being complicated as always, I believe there is space for my little exploration to find some new land or at least signpost land that was already found. I don't know which it is, either.

With the theory and the math behind us, and with the Why handled, let's get to the How.

The Algorithm

The first stage is Preparation. Here the tree gets parsed into a series of nodes. Every node contains its content (which is simply a string) and a list of child nodes. I can now pass around the node of the root of the tree and access the entire tree through it. To simplify working with the nodes later on, every node gets a pointer to its parent; in case of the root node that pointer is nil. Additionally I take the chance to save a hash of its content and the content of its children in every node, forming a Merkle tree.

In order to prepare for the second stage, the tree is now flattened. I iterate the tree in depth-first order and put all nodes into a linked list.

In the second stage the Comparison between the two linked lists happens. I follow a modified form of the Wagner-Fischer algorithm to explore the space of possible solutions. If you haven't heard of that algorithm before, you might have talked about edit distances in general or the Levenshtein distance in particular in your algorithms and data structures course - those are all forms of the same thing. While Wagner and Fischer would want us to build up the whole matrix, that would imply a space- and time-complexity of O(n*m), and I really don't want to. Following the principle of sunk cost I instead decide to abandon possible solutions once it becomes clear they won't be winning. I don't succumb to the fallacy and cut my losses once first results are in.

More math: I haven't fully thought through the complexity values of this solution, but it feels like this:

In worst case, we're worse than Wagner-Fischer, so let's not talk about that

In best case, our time-complexity is O(max(n,m)) and our space complexity a constant factor on top of that

In average case we're somewhere in-between

It probably makes a difference in practice.

How do we get all these optimizations? The idea is simple: We path-find our way through the Wagner-Fischer matrix. I'm using A* for this. That way, I'm not limited by the strictness of the Wagner-Fischer matrix which would allow me to implement wormholes, short-cuts through the solution space. Explored elements are recorded in merged nodes that note the nodes of both original trees that were reached, the last operation to get here, and the accumulated cost until here. Using a priority queue, the currently-shortest path (ie. cheapest solution) is explored one step further. If it finds a series of elements that are equal in both original nodes, it greedily travels down that path, getting it ever closes to the goal. Once I reach the goal, I've found a possible solution. That then gets passed on to the next stage.

The third stage handles Applying Penalties to the found solutions. Once a solution is found I backtrack through the solution space and apply penalties to structures that imply there might be a better solution close-by. Right now, these are:

a merged node that only contains one original node (meaning it describes a deleted or inserted node) but has children
a merged node where both original nodes are the same but the depths in their respective trees changed

Since the penalty makes the solution's cost worse, this stage is actually included in the previous one. Only once the exploration of the solution space has exhausted all options cheaper than the currently cheapest solution (including penalty), the stage is finished. It returns all solutions it found in that time, together with their costs.

Clean Up is what awaits the solutions in the fourth stage. As it comes from the comparison stage, the solution is backwards. Just reversing the list would be too simple, however. Both the comparison operation and its cost needs to be moved one element down to avoid having off-by-one errors in the list. Additionally, in case of a newly inserted b node, the a node would be a repeat from the last element, so I can set it to nil. Same for a deleted node. Also I can lob off the last element in the list since it basically amounts to the strings null terminator. Technically only the cheapest solution would need to be cleaned, because only the cheapest solution is returned, but I want to print all found solutions to the console, for debugging purposes, so I clean them all.

The fifth stage, Treeifying, is the most complex one. It's not terrifying although the code might look that way. The goal is to turn that list of merged nodes back into a tree. The first node of the list gets added as the root node. The original nodes inside that first merged node were both root nodes of their respective trees, so the merged node gets that same privilege. If a merged node that needs to be added only has one original node, we can find that node's parent in the tree and add the new node there. If the merged node we want to add has two original nodes, it gets more complicated. I start from the node added last and walk upwards, looking for the parent of both the a and b original nodes. I look at their position relative to each other and the nodes in-between to determine the best place to put the new node. In some cases I have to insert additional deleted or inserted merged nodes to which I then attach the new node so that both original nodes have the parents they expect. For well-behaved test cases we can add all merged nodes from the list that way and get a resulting tree that, at least, didn't lose any data since stage one.

In-between stages here would be a great place for some post-processing steps.

The sixth and last stage handles Marking nodes. The content hashes I calculated many stages ago now come in handy. I build up a hash map mapping from hash to a list of merged nodes whose original nodes that have that hash. If that list only contains one merged node, that merged node only exists once and is therefore not interesting. If that list contains two merged nodes, one of them will probably contain a deleted original node and the other an inserted original node. Together those two form a node that was moved. This is the important information we're looking for and both merged nodes get a new mark number, counting up from 1. If that list contains three or more merged nodes, there has been some duplication in one or both original trees. The merged nodes get marked up all the same.

Before we get to visualization, here is another great place to insert some post-processing.

Finally all the results get turned into a webpage so they can be looked at easily.

Results

Adding or removing leaf nodes works as expected. These are all "Merged Trees", just the results of individual comparisons.

Adding or removing a node in the middle of the tree (a branch node) also works as one would expect.

If we move a node, the algorithm correctly detects that node as deleted in one place and inserted in another. The two nodes are marked as being one and the same. This not only works with leaf nodes, but with entire subtrees, even if it looks not as good as it could be in the latter case.

Since the main comparison is happening on a flattened list, the information about the rest of the tree is not taking into account at that stage. This can lead to correct results that non-the-less seem strange. In the screenshot below I show both the original trees (left and right) and the resulting tree (merged).

This is not an algorithmic problem but purely a presentational one. The algorithm could however output a result that would then look better.

Back in the comparison stage, the algorithm can also output some nodes as edited if the optimal solution would involve them being edited. Editing in this case is renaming of a non-leaf node, like renaming a directory.

These show up neatly in-line in the resulting merged tree.

Improvements

I've got ideas for improvements for basically every part of the algorithm. Here are some of them:

Implement better path finding. Worst case complexity at Wagner-Fischer levels should be possible.
If a big subtree is moved, I could make it cheaper to do it at once compared with moving every item individually. This adds a shortcut the algorithm exploring the solution space could take and use if advantageous.
Adding penalty costs to a solution after it is found could use some fine-tuning and potentially adding more penalties once they come up as useful.
As already mentioned, if a node at the end of a parent changes depth we run into edge cases with the algorithm. Depending on how much better move-handling gets, it might make sense to add a deleted and a new inserted node instead.

Many improvements are related to the visualization of the comparison. This topic could have been a whole project by itself. For example:

Instead of having a node that was moved show twice, once deleted and once inserted, add an arrow to make the action clearer. Marking the two places as related only helps a bit.
If the tree doesn't have an explicit order, nodes could be reordered to make the result look less intimidating.

A bunch of topics worthy of a deep-dive in a future exploration.

Conclusion

Doing the comparison on a flattened version of the tree worked well. Turning the comparison back into a tree was the most complex part of the algorithm and we got a few edge cases stemming from this approach, but the entire package ends up less complicated than trying to do the comparison while the trees still looked like trees. The entire thing is a series of steps with distinct artifacts passed between them, that makes it not only easier to parallelize but also helps with understanding the algorithm.

There are certainly a number of inputs that make the tool crash because of null pointer exceptions or deliberately unhandled code, so the algorithm would need to be taught about more edge cases before it could be used productively. There should also be some way for it to fall back to a more naive version of the algorithm in case it gets to some local maximum it can't get itself back out of.

All in all though, I think this exploration is a good effort in looking closer at this challenging problem.

For more, check the Activity below for the updates I gave during the jam.