graphtage.levenshtein
An “online”, “constructive” implementation of the Levenshtein distance metric.
The algorithm starts with an unbounded mapping and iteratively improves it until the bounds converge, at which point the optimal edit sequence is discovered.
levenshtein classes
EditDistance
- class graphtage.levenshtein.EditDistance(from_node: TreeNode, to_node: TreeNode, from_seq: Sequence[TreeNode], to_seq: Sequence[TreeNode], insert_remove_penalty: int = 1)
Bases:
SequenceEdit
An edit that computes the minimum sequence of sub-edits necessary to transform one node to another.
The edits used to transform the source sequence to the target sequence are
graphtage.Match
,graphtage.Remove
, andgraphtage.Insert
.The algorithm works by iteratively constructing the Levenshtein matrix one diagonal at a time, starting from the upper left cell and ending at the lower right cell. Each successive call to
EditDistance.tighten_bounds()
constructs a new diagonal of the matrix and fully tightens the bounds of its edits. Once the lower right cell is expanded, the matrix is complete and the optimal sequence of edits can be reconstructed.Bounds of this edit are updated after each diagonal is added by observing that the final cost is bounded above by the minimum cost of an edit in the last-expanded diagonal. This results in a monotonically decreasing upper bound.
- __init__(from_node: TreeNode, to_node: TreeNode, from_seq: Sequence[TreeNode], to_seq: Sequence[TreeNode], insert_remove_penalty: int = 1)
Initializes the edit distance edit.
- Parameters:
from_node – The node that will be transformed.
to_node – The node into which
from_node
will be transformed.from_seq – A sequence of nodes that comprise
from_node
.to_seq – A sequence of nodes that comprise
to_node
.insert_remove_penalty – The penalty for inserting or removing a node (default is 1).
- __iter__() Iterator[Edit]
Returns an iterator over this edit’s sub-edits.
- Returns:
The result of
AbstractCompoundEdit.edits()
- Return type:
Iterator[Edit]
- __lt__(other)
Tests whether the bounds of this edit are less than the bounds of
other
.
- bounds() Range
Calculates bounds on the cost of this edit.
If the Levenshtein matrix has been fully constructed, return the cost of the lower right cell.
If the matrix is incomplete, then use
super().bounds().lower_bound
as the lower bound and the minimum cost in the last completed matrix diagonal as the upper bound.- Returns:
The bounds on the cost of this edit.
- Return type:
- has_non_zero_cost() bool
Returns whether this edit has a non-zero cost.
This will tighten the edit’s bounds until either its lower bound is greater than zero or its bounds are definitive.
- is_complete() bool
An edit distance edit is only complete once its Levenshtein edit matrix has been fully constructed.
- on_diff(from_node: EditedTreeNode)
A callback for when an edit is assigned to an
EditedTreeNode
inTreeNode.diff()
.This default implementation adds the edit to the node, and recursively calls
Edit.on_diff()
on all of the sub-edits:from_node.edit = self from_node.edit_list.append(self) for edit in self.edits(): edit.on_diff(edit.from_node)
- Parameters:
from_node – The edited node that was added to the diff
- print(formatter: GraphtageFormatter, printer: Printer)
Prints this edit.
This is equivalent to:
formatter.get_formatter(self.sequence)(printer, self.sequence)
- property sequence: SequenceNode
Returns the sequence being edited.
This is a convenience function solely to aid in automated type checking. It is equivalent to:
typing.cast(SequenceNode, self.from_node)