How Graphtage Works

In general, optimally mapping one graph to another cannot be executed in polynomial time [1], and is therefore not tractable for graphs of any useful size [*]. This is true even for restricted classes of graphs like DAGs [2]. However, trees and forests are a special case that can be mapped in polynomial time, with reasonable constraints on the types of edits possible. Graphtage exploits this.

Why Mapping Trees is Complex

Ordered nodes in the tree (e.g., JSON lists) and, in particular, mappings (e.g., JSON dicts) are challenging. Most extant diffing algorithms and utilities assume that the structures are ordered. Take this JSON as an example:

Original JSON

Modified JSON

{
    "foo": [1, 2, 3, 4],
    "bar": "testing"
}
{
    "foo": [2, 3, 4, 5],
    "zab": "testing",
    "woo": ["foobar"]
}

Existing tools effectively canonicalize the JSON (e.g., sort dictionary elements by key and format lists with one item per line), and then perform a traditional diff:

$ cat original.json | jq -M --sort-keys > original.canonical.json
$ cat modified.json | jq -M --sort-keys > modified.canonical.json
$ diff -u original.canonical.json modified.canonical.json
 1{
 2-  "bar": "testing",
 3   "foo": [
 4-    1,
 5     2,
 6     3,
 7-    4
 8-  ]
 9+    4,
10+    5
11+  ],
12+  "woo": [
13+    "foobar"
14+  ],
15+  "zab": "testing"
16}

Not entirely useful, particularly if the input files are large. The problem is that changing dict keys breaks the diff: Since “bar” was changed to “zab”, the canonical representation changes and they are considered separate edits (lines 2 and 15 of the diff).

Matching Ordered Sequences

Graphtage matches ordered sequences like lists using an “online” [3], “constructive” [4] implementation of the Levenshtein distance metric [5], similar to the Wagner–Fischer algorithm [6]. The algorithm starts with an unbounded mapping and iteratively improves it until the bounds converge, at which point the optimal edit sequence is discovered. This is implemented in the graphtage.levenshtein module.

Matching Unordered Collections

Dicts are matched by solving the minimum weight matching problem [7] on the complete bipartite graph from key/value pairs in the source dict to key/value pairs in the destination dict. This is implemented in the graphtage.matching module.

Footnotes