graphtage.levenshtein

An “online”, “constructive” implementation of the Levenshtein distance metric.

The algorithm starts with an unbounded mapping and iteratively improves it until the bounds converge, at which point the optimal edit sequence is discovered.

levenshtein classes

EditDistance

class graphtage.levenshtein.EditDistance(from_node: TreeNode, to_node: TreeNode, from_seq: Sequence[TreeNode], to_seq: Sequence[TreeNode], insert_remove_penalty: int = 1)

Bases: SequenceEdit

An edit that computes the minimum sequence of sub-edits necessary to transform one node to another.

The edits used to transform the source sequence to the target sequence are graphtage.Match, graphtage.Remove, and graphtage.Insert.

The algorithm works by iteratively constructing the Levenshtein matrix one diagonal at a time, starting from the upper left cell and ending at the lower right cell. Each successive call to EditDistance.tighten_bounds() constructs a new diagonal of the matrix and fully tightens the bounds of its edits. Once the lower right cell is expanded, the matrix is complete and the optimal sequence of edits can be reconstructed.

Bounds of this edit are updated after each diagonal is added by observing that the final cost is bounded above by the minimum cost of an edit in the last-expanded diagonal. This results in a monotonically decreasing upper bound.

__init__(from_node: TreeNode, to_node: TreeNode, from_seq: Sequence[TreeNode], to_seq: Sequence[TreeNode], insert_remove_penalty: int = 1)

Initializes the edit distance edit.

Parameters:
  • from_node – The node that will be transformed.

  • to_node – The node into which from_node will be transformed.

  • from_seq – A sequence of nodes that comprise from_node.

  • to_seq – A sequence of nodes that comprise to_node.

  • insert_remove_penalty – The penalty for inserting or removing a node (default is 1).

__iter__() Iterator[Edit]

Returns an iterator over this edit’s sub-edits.

Returns:

The result of AbstractCompoundEdit.edits()

Return type:

Iterator[Edit]

__lt__(other)

Tests whether the bounds of this edit are less than the bounds of other.

_debug_tighten_bounds() bool

Adds debugging assertions when tightening bounds; for debugging only

bounds() Range

Calculates bounds on the cost of this edit.

If the Levenshtein matrix has been fully constructed, return the cost of the lower right cell.

If the matrix is incomplete, then use super().bounds().lower_bound as the lower bound and the minimum cost in the last completed matrix diagonal as the upper bound.

Returns:

The bounds on the cost of this edit.

Return type:

Range

edits() Iterator[Edit]

Returns an iterator over this edit’s sub-edits

from_node: TreeNode
has_non_zero_cost() bool

Returns whether this edit has a non-zero cost.

This will tighten the edit’s bounds until either its lower bound is greater than zero or its bounds are definitive.

initial_bounds: Range
is_complete() bool

An edit distance edit is only complete once its Levenshtein edit matrix has been fully constructed.

on_diff(from_node: EditedTreeNode)

A callback for when an edit is assigned to an EditedTreeNode in TreeNode.diff().

This default implementation adds the edit to the node, and recursively calls Edit.on_diff() on all of the sub-edits:

from_node.edit = self
from_node.edit_list.append(self)
for edit in self.edits():
    edit.on_diff(edit.from_node)
Parameters:

from_node – The edited node that was added to the diff

print(formatter: GraphtageFormatter, printer: Printer)

Prints this edit.

This is equivalent to:

formatter.get_formatter(self.sequence)(printer, self.sequence)
property sequence: SequenceNode

Returns the sequence being edited.

This is a convenience function solely to aid in automated type checking. It is equivalent to:

typing.cast(SequenceNode, self.from_node)
tighten_bounds() bool

Tightens the bounds of this edit, if possible.

If the Levenshtein matrix is not yet complete, construct and fully tighten the next diagonal of the matrix.

property valid: bool

Returns whether this edit is valid

levenshtein functions

levenshtein_distance

graphtage.levenshtein.levenshtein_distance(s: str, t: str) int

Canonical implementation of the Levenshtein distance metric.

Parameters:
  • s – the string from which to match

  • t – the string to which to match

Returns:

The Levenshtein edit distance metric between the two strings.

Return type:

int