Skip to content

Commit 70b5056

Browse files
committed
---
yaml --- r: 136444 b: refs/heads/dist-snap c: d1bcd77 h: refs/heads/master v: v3
1 parent 1236525 commit 70b5056

File tree

2 files changed

+88
-11
lines changed

2 files changed

+88
-11
lines changed

[refs]

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ refs/heads/try: 189b7332968972f34cdbbbd9b62d97ababf53059
66
refs/tags/release-0.1: 1f5c5126e96c79d22cb7862f75304136e204f105
77
refs/heads/ndm: f3868061cd7988080c30d6d5bf352a5a5fe2460b
88
refs/heads/try2: 147ecfdd8221e4a4d4e090486829a06da1e0ca3c
9-
refs/heads/dist-snap: e9db8adebb9fe9c7f65266127fca926ff736b740
9+
refs/heads/dist-snap: d1bcd771a0c4c79e3952c27ece02ee81fdae8cf8
1010
refs/tags/release-0.2: c870d2dffb391e14efb05aa27898f1f6333a9596
1111
refs/tags/release-0.3: b5f0d0f648d9a6153664837026ba1be43d3e2503
1212
refs/heads/try3: 9387340aab40a73e8424c48fd42f0c521a4875c0

branches/dist-snap/src/libcore/str.rs

Lines changed: 87 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -419,8 +419,76 @@ struct TwoWaySearcher {
419419
memory: uint
420420
}
421421

422-
// This is the Two-Way search algorithm, which was introduced in the paper:
423-
// Crochemore, M., Perrin, D., 1991, Two-way string-matching, Journal of the ACM 38(3):651-675.
422+
/*
423+
This is the Two-Way search algorithm, which was introduced in the paper:
424+
Crochemore, M., Perrin, D., 1991, Two-way string-matching, Journal of the ACM 38(3):651-675.
425+
426+
Here's some background information.
427+
428+
A *word* is a string of symbols. The *length* of a word should be a familiar
429+
notion, and here we denote it for any word x by |x|.
430+
(We also allow for the possibility of the *empty word*, a word of length zero).
431+
432+
If x is any non-empty word, then an integer p with 0 < p <= |x| is said to be a
433+
*period* for x iff for all i with 0 <= i <= |x| - p - 1, we have x[i] == x[i+p].
434+
For example, both 1 and 2 are periods for the string "aa". As another example,
435+
the only period of the string "abcd" is 4.
436+
437+
We denote by period(x) the *smallest* period of x (provided that x is non-empty).
438+
This is always well-defined since every non-empty word x has at least one period,
439+
|x|. We sometimes call this *the period* of x.
440+
441+
If u, v and x are words such that x = uv, where uv is the concatenation of u and
442+
v, then we say that (u, v) is a *factorization* of x.
443+
444+
Let (u, v) be a factorization for a word x. Then if w is a non-empty word such
445+
that both of the following hold
446+
447+
- either w is a suffix of u or u is a suffix of w
448+
- either w is a prefix of v or v is a prefix of w
449+
450+
then w is said to be a *repetition* for the factorization (u, v).
451+
452+
Just to unpack this, there are four possibilities here. Let w = "abc". Then we
453+
might have:
454+
455+
- w is a suffix of u and w is a prefix of v. ex: ("lolabc", "abcde")
456+
- w is a suffix of u and v is a prefix of w. ex: ("lolabc", "ab")
457+
- u is a suffix of w and w is a prefix of v. ex: ("bc", "abchi")
458+
- u is a suffix of w and v is a prefix of w. ex: ("bc", "a")
459+
460+
Note that the word vu is a repetition for any factorization (u,v) of x = uv,
461+
so every factorization has at least one repetition.
462+
463+
If x is a string and (u, v) is a factorization for x, then a *local period* for
464+
(u, v) is an integer r such that there is some word w such that |w| = r and w is
465+
a repetition for (u, v).
466+
467+
We denote by local_period(u, v) the smallest local period of (u, v). We sometimes
468+
call this *the local period* of (u, v). Provided that x = uv is non-empty, this
469+
is well-defined (because each non-empty word has at least one factorization, as
470+
noted above).
471+
472+
It can be proven that the following is an equivalent definition of a local period
473+
for a factorization (u, v): any positive integer r such that x[i] == x[i+r] for
474+
all i such that |u| - r <= i <= |u| - 1 and such that both x[i] and x[i+r] are
475+
defined. (i.e. i > 0 and i + r < |x|).
476+
477+
Using the above reformulation, it is easy to prove that
478+
479+
1 <= local_period(u, v) <= period(uv)
480+
481+
A factorization (u, v) of x such that local_period(u,v) = period(x) is called a
482+
*critical factorization*.
483+
484+
The algorithm hinges on the following theorem, which is stated without proof:
485+
486+
**Critical Factorization Theorem** Any word x has at least one critical
487+
factorization (u, v) such that |u| < period(x).
488+
489+
The purpose of maximal_suffix is to find such a critical factorization.
490+
491+
*/
424492
impl TwoWaySearcher {
425493
fn new(needle: &[u8]) -> TwoWaySearcher {
426494
let (crit_pos1, period1) = TwoWaySearcher::maximal_suffix(needle, false);
@@ -436,15 +504,19 @@ impl TwoWaySearcher {
436504
period = period2;
437505
}
438506

507+
// This isn't in the original algorithm, as far as I'm aware.
439508
let byteset = needle.iter()
440509
.fold(0, |a, &b| (1 << ((b & 0x3f) as uint)) | a);
441510

442-
// The logic here (calculating crit_pos and period, the final if statement to see which
443-
// period to use for the TwoWaySearcher) is essentially an implementation of the
444-
// "small-period" function from the paper (p. 670)
511+
// A particularly readable explanation of what's going on here can be found
512+
// in Crochemore and Rytter's book "Text Algorithms", ch 13. Specifically
513+
// see the code for "Algorithm CP" on p. 323.
445514
//
446-
// In the paper they check whether `needle.slice_to(crit_pos)` is a suffix of
447-
// `needle.slice(crit_pos, crit_pos + period)`, which is precisely what this does
515+
// What's going on is we have some critical factorization (u, v) of the
516+
// needle, and we want to determine whether u is a suffix of
517+
// v.slice_to(period). If it is, we use "Algorithm CP1". Otherwise we use
518+
// "Algorithm CP2", which is optimized for when the period of the needle
519+
// is large.
448520
if needle.slice_to(crit_pos) == needle.slice(period, period + crit_pos) {
449521
TwoWaySearcher {
450522
crit_pos: crit_pos,
@@ -466,6 +538,11 @@ impl TwoWaySearcher {
466538
}
467539
}
468540

541+
// One of the main ideas of Two-Way is that we factorize the needle into
542+
// two halves, (u, v), and begin trying to find v in the haystack by scanning
543+
// left to right. If v matches, we try to match u by scanning right to left.
544+
// How far we can jump when we encounter a mismatch is all based on the fact
545+
// that (u, v) is a critical factorization for the needle.
469546
#[inline]
470547
fn next(&mut self, haystack: &[u8], needle: &[u8], long_period: bool) -> Option<(uint, uint)> {
471548
'search: loop {
@@ -520,9 +597,9 @@ impl TwoWaySearcher {
520597
}
521598
}
522599

523-
// returns (i, p) where i is the "critical position", the starting index of
524-
// of maximal suffix, and p is the period of the suffix
525-
// see p. 668 of the paper
600+
// Computes a critical factorization (u, v) of `arr`.
601+
// Specifically, returns (i, p), where i is the starting index of v in some
602+
// critical factorization (u, v) and p = period(v)
526603
#[inline]
527604
fn maximal_suffix(arr: &[u8], reversed: bool) -> (uint, uint) {
528605
let mut left = -1; // Corresponds to i in the paper

0 commit comments

Comments
 (0)