Thursday, 26 November 2015

co.combinatorics - Finding a set with the maximum number of finite alphabet strings within a fixed Levenshtein distance of one-another

Please consider the set of all possible strings of some finite size $M$ alphabet $Sigma$, $alpha$ $= a_1, a_2, ..., a_k, ..., a_n$, of length $|alpha| = L$. The Levenshtein distance (or 'edit distance') between any two strings, $D(a_{k1},a_{k2})$ = $p$, can be defined as the number of single-character insertions, deletions, or substitutions necessarily to transform one string, $a_{k1}$, into another, $a_{k2}$. We are forbidding the transposition operation, giving the $L = 3$ example: $D("abc", "cba")$ = $2$.



Let $(s_{p1}, s_{p2}, ..., s_{pk}, ..., s_{pm}) in S_p$, be a particular set that has the largest possible number of strings over the finite alphabet within a fixed Levenshtein distance, $p$, of one-another. Can we find a particular instance of $S_p$, or perhaps more easily $||S_p||$, for an arbitrarily sized alphabet, $Sigma$, and an arbitrary string length, $L$?



Perhaps this is more of a definition, but we know (trivially) that $||S_0|| = 1$. It seems to be true that $||S_1||$ increases in a linear way with $L$ and $M$. But there doesn't appear to be a nice growth relationship for $||S_p||$ as a function of $M$ and $L$ if $p > 1$.



Update - For the provided data, $||S_2||$ appears to have a closed-form solution of $4*(L-1)^2$ for a three-character alphabet, and a closed-form solution of $frac{3}{2}(5*(L-1)^2+(L-1))$ for a four-character alphabet.



Update - Does anyone know if this problem is in class NP, similar to the complexity of generating many types of error-correcting codes? I was hoping the case here would be different, as we're asking for less information...




A computer search yields:



For $L = 1$:
$$eqalign
{&(||S_0||,||S_1||)=cr
&qquad (1, 1) sim M = 2 cr
&qquad (1, 2) sim M = 3cr
&qquad (1, 3) sim M = 4cr}
$$



For $L = 2$:
$$eqalign
{&(||S_0||,||S_1||,||S_2||)=cr
&qquad (1, 2, 1) sim M = 2cr
&qquad (1, 4, 4) sim M = 3cr
&qquad (1, 6, 9) sim M = 4cr}
$$



For $L = 3$:
$$eqalign
{&(||S_0||,||S_1||,||S_2||,||S_3||)=cr
&qquad (1, 3, 4, 1) sim M = 2cr
&qquad (1, 6, 16, 8) sim M = 3cr
&qquad (1, 9, 33, 27) sim M = 4cr}
$$



For $L = 4$:
$$eqalign
{&(||S_0||,||S_1||,||S_2||,||S_3||,||S_4||)=cr
&qquad (1, 4, 9, 4, 1) sim M = 2cr
&qquad (1, 8, 36, 35, 16) sim M = 3cr
&qquad (1, 12, 72, 130, 81) sim M = 4cr}
$$



For $L = 5$:
$$eqalign
{&(||S_0||,||S_1||,||S_2||,||S_3||,||S_4||,||S_5||)=cr
&qquad (1, 5, 16, 11, 5, 1) sim M = 2cr
&qquad (1, 10, 64, 124, 80, 32) sim M = 3cr
&qquad (1, 15, 126, 410, 438, 243) sim M = 4cr}
$$



For $L = 6$:
$$eqalign
{&(||S_0||,||S_1||,||S_2||,||S_3||,||S_4||,||S_5||,||S_6||)=cr
&qquad (1, 6, 27, 28, 15, 6, 1) sim M = 2cr
&qquad (1, 12, 100, 306, 302, 192, 64) sim M = 3cr
&qquad (1, 18, 195, 942, 1938, 1458, 729) sim M = 4cr}
$$



For $L = 7$:
$$eqalign
{&(||S_0||,||S_1||,||S_2||,||S_3||,||S_4||,||S_5||,||S_6||,||S_7||)=cr
&qquad (1, 7, 40, 60, 39, 21, 7, 1) sim M = 2cr
&qquad (1, 14, 144, 612, 1030, 712, 448, 128) sim M = 3cr
&qquad (1, 21, 279, 1807, 5803, 6966, 5103, 2187) sim M = 4cr}
$$



For $L = 8$:
$$eqalign
{&(||S_0||,||S_1||,||S_2||,||S_3||,||S_4||,||S_5||,||S_6||,||S_7||,||S_8||)=cr
&qquad (1, 8, 56, 113, 97, 56, 28, 8, 1) sim M = 2cr
&qquad (1, 16, 196, 1074, 2716, 2575, 1792, 1024, 256) sim M = 3cr
&qquad (1, 24, 378, 3086, 13716, 28310, 23414, 17496, 6561) sim M = 4cr}
$$



For $L = 9$:
$$eqalign
{&(||S_0||,||S_1||,||S_2||,||S_3||,||S_4||,||S_5||,||S_6||,||S_7||,||S_8||)=cr
&qquad (1, 9, 74, 194, 216, 135, 84, 36, 9, 1) sim M = 2cr
&qquad (1, 18, 256, 1724, 5956, 8315, 6309, 4608, 2304, 512) sim M = 3 cr}
$$



For $L = 10$:
$$eqalign
{&(||S_0||,||S_1||,||S_2||,||S_3||,||S_4||,||S_5||,||S_6||,||S_7||,||S_8||)=cr
&qquad (1, 10, 95, 311, 448, 321, 210, 120, 45, 10, 1) sim M = 2cr}
$$

No comments:

Post a Comment