.. _library_string_distance:
string_distance
This library provides string distance predicates with support for
different string representations: atoms, character lists, or character
code lists.
The predicates are defined in the string_distance(_Representation_)
parametric object where _Representation_ can be one of:
atom - strings are represented as atoms
chars - strings are represented as lists of characters
codes - strings are represented as lists of character codes
The parameter must be bound when sending messages to the object.
API documentation
Open the
`../../apis/library_index.html#string_distance <../../apis/library_index.html#string_distance>`__
link in a web browser.
Loading
To load all entities in this library, load the loader.lgt file:
::
| ?- logtalk_load(string_distance(loader)).
Testing
To test this library predicates, load the tester.lgt file:
::
| ?- logtalk_load(string_distance(tester)).
Algorithms
This library implements the following string distance algorithms:
- Levenshtein distance is probably the most well-known. It counts
the minimum number of single-character edits â insertions, deletions,
or substitutions â needed to transform one string into another. For
example, the distance between "kitten" and "sitting" is 3.
- Damerau-Levenshtein distance extends Levenshtein by also allowing
transpositions of two adjacent characters as a single edit operation.
This makes it more practical for catching common typos like "ab" ->
"ba".
- Hamming distance only works on strings of equal length and counts
the number of positions where the corresponding characters differ.
It's very fast but limited by that equal-length constraint, making it
useful for things like comparing binary codes or fixed-format
identifiers.
- Jaro and Jaro-Winkler distance produce a similarity score between
0 and 1 rather than a raw edit count. Jaro considers matching
characters and transpositions, while Jaro-Winkler adds a
prefix-weighting boost â the idea being that strings matching at the
start are more likely to be the same (useful for name matching).
- Edit similarity is a normalized version of the edit distance,
defined as 1 - (edit distance / max length of the two strings). It
produces a similarity score between 0 and 1. Can be computed from
Levenshtein, Damerau-Levenshtein, Hamming, or Longest Common
Subsequence distances.
- Longest Common Subsequence (LCS) finds the length of the longest
sequence of characters that appears in both strings in the same
relative order (not necessarily contiguous). It's heavily used in diff
tools for comparing files.
- Longest Common Substring is similar to LCS but requires the shared
characters to be contiguous. Useful when caring about shared blocks of
text rather than scattered matches.
- Cosine Similarity shifts the approach from character edits to
vector-based comparison. Strings are converted into vectors (often
using character n-grams or word tokens), and then the cosine of the
angle between them is computed. It's widely used in NLP and
information retrieval.
- Jaccard Index compares two sets of tokens (often words or n-grams)
by dividing the size of their intersection by the size of their union.
Simple and intuitive, though it ignores term frequency.
- Soundex/Metaphone are phonetic algorithms rather than strict
string comparisons. They encode strings based on how they sound, so
"Smith" and "Smythe" would match. Useful for name deduplication where
spelling varies but pronunciation is similar.
General advice: use Levenshtein or Damerau-Levenshtein for
general-purpose edit distance, Jaro-Winkler for short strings like
names, LCS for diff-style comparisons, and cosine/Jaccard when working
at the word or document level rather than character level.