laarcnew | comments | discord | tags | ask | show | place | submitlogin
Two-Way String-Matching (1991) [pdf] (researchgate.net)
5 points by burntsushi to compsci research pdf 629 days ago | 2 comments



A variant of this algorithm is what glibc uses for its implementation of memmem[1]. I say "variant" because it uses a shift table in some cases.[2]

I found this original paper describing the algorithm to be notable, in particular for how rigorous it is. I've read it a couple of times so far, but I still don't grok everything. One thing I'm particularly interested in is trying to simplify the algorithm so that you don't need to handle the short and long period cases differently. In particular, the paper hints that this is possible:

> The string-matching algorithm of Section 2 uses the period of the pattern. A previous computation of this period is possible. This precomputation can be made easily by using Knuth, Morris, and Pratt’s algorithm on the pattern. This precomputation, combined with both the computation of a critical factorization of the pattern and our string-matching algorithm, leads to an algorithm that is globally linear in time and space. But, since the string-matching algorithm operates in constant space, it is desirable to improve on the precomputation of the period of the pattern, with the aim of obtaining a global algorithm that is in linear time and constant space. There are two ways to achieve that goal. One is a direct computation of the period by an algorithm operating in linear time and constant space. Such an algorithm is described in [12]. It can also be obtained as a consequence of the results in [17]. Our approach here is different. We shall describe a modification of our string-matching algorithm that avoids the use of the period of the pattern, giving an algorithm that is linear time and constant space on the whole. The point is that the exact value of the period of the pattern is actually needed only when it is less than half the length of the pattern. When the period of the pattern is larger, an even simpler string-matching algorithm is provided.

But they don't seem to explain why they went the route of approximating the period even when they cite algorithms that satisfy their desired complexity constraints (linear time, constant space). So my next task is to read the cited papers and hopefully discern for myself.

[1] - https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memmem.c;h=4bf733f1f03cb27c289bd6dc61590909bb0eefdf;hb=HEAD

[2] - https://sourceware.org/git/?p=glibc.git;a=blob;f=string/str-two-way.h;h=b5011baafa77a2d211598be246657b9a33fd8a2e;hb=HEAD#l401

-----

4 points by rain1 628 days ago

Very good stuff, I learned about the duality between borders and periods and I understand how the KMP algorithm requires O(m) (pattern length) space to construct the table, but the critical factorization only takes up 2 integers O(1) so it's a much more space efficient matching algorithm.

-----




Welcome | Guidelines | Bookmarklet | Feature Requests | Source | Contact | Twitter | Lists

RSS (stories) | RSS (comments)

Search: