This implements an algorithm to determine the similarity between Strings by utilizing an
alignment/edit distance approach. In this special case of alignment the distance between numbers
is regarded as zero while each mismatch otherwise is punished with 1 as is each deletion or
insertion.
The edit distance than is transformed to a similarity by taking 1-distance/(max{|a|,|b|}), which
is 1 - the number of costly edit operations that had to be used relative to the worst case
(replace the whole first sequence with the whole second sequence using only mismatches and
elongate or shorten if necessary = max{|a|,|b|}). This can be interpreted as a confidence value
that the two Strings represent the same content.
Author:
Benjamin Paassen - bpaassen(at)techfak.uni-bielefeld.de
This implements an algorithm to determine the similarity between Strings by utilizing an
alignment/edit distance approach. In this special case of alignment the distance between
numbers is regarded as zero while each mismatch otherwise is punished with 1 as is each
deletion or insertion.
The edit distance than is transformed to a similarity by taking 1-distance/(max{|a|,|b|}),
which is 1 - the number of costly edit operations that had to be used relative to the worst
case (replace the whole first sequence with the whole second sequence using only mismatches
and elongate or shorten if necessary = max{|a|,|b|}). This can be interpreted as a confidence
value that the two Strings represent the same content.
Parameters:
a - the first string
b - the second string
Returns:
a confidence value (between 0 and 1) that the two Strings represent the same content
Copyright (C) 2013, 2014 Raphael Dickfelder, Jan Göpfert, Benjamin Paassen, Andreas Stöckel, licensed under the AGPL v. 3: http://openresearch.cit-ec.de/projects/scie