yes, all longest regex matches in linear time is possible

blog rust regex automata

in an earlier post i talked about how regex match semantics is a surprisingly overlooked topic. today i want to talk about a related gap that's even more fundamental, and one i've spent a long time thinking about: finding all matches is quadratic, even in "linear time" engines. in other words, even the engines built specifically to avoid this - RE2, Go, rust - break their own guarantees the moment you ask for all matches. for such a well-studied field, this is a strange thing to have gone unaddressed for so long.

finding all matches is a very common thing people - and these days, AI coding agents - do with regexes. it's a fundamental tool for navigating large amounts of data: ctrl+f in your editor, search-and-replace, grep across repositories. all of these need every match, not just the first one. yet almost all academic regex papers focus exclusively on the single-match problem and then handwave the rest away with "just iterate". in practice, that iteration is where the linear-time guarantee quietly breaks down.

part of the reason is that the theory of regexes boils everything down to a single yes/no question - does this string match or not? that's clean and great for proving theorems, but it throws away nearly everything that matters in practice: where the matches are, how long they are, and how many there are. once you reduce regexes to "match or no match", the all-matches problem simply disappears from view. whether experts are consciously ignoring it or the framing just doesn't surface it as a question, i can't tell - and i won't hold it against them. but it really does make me question the value of those theorems. if you have to throw away the connection to real-world use, what exactly is it a theory of?

every regex engine that advertises linear-time matching - RE2, Go's regexp, rust's regex crate - means linear time for a single match. the moment you call find_iter or FindAll, that guarantee evaporates. the rust regex crate docs are the only ones honest enough to say it outright:

the worst case time complexity for iterators is O(m * n²). [...] if both patterns and haystacks are untrusted and you're iterating over all matches, you're susceptible to worst case quadratic time complexity. There is no way to avoid this. One possible way to mitigate this is to [...] immediately stop as soon as a match has been found. Enabling this mode will thus restore the worst case O(m * n) time complexity bound, but at the cost of different semantics.

the mechanism is simple. take the pattern .*a|b and a haystack of n b's. at each position, the engine tries .*a first - scans the entire remaining haystack looking for an a, finds none, fails. then the b branch matches a single character. advance one position, repeat. that's n + (n-1) + (n-2) + ... = O(n²) work to report n single-character matches. a textbook triangular sum. hit play to see it:

matching .*a|b against "bbbbbbbbbb" - finding all matches

Russ Cox described this exact problem back in 2009, noting that even the original awk by Aho himself used the naive quadratic "loop around a DFA" for leftmost-longest matching. BurntSushi's rebar benchmark suite confirms it empirically across RE2, Go, and rust - the throughput halves when the input doubles. as he put it: "even for automata oriented engines, it provokes a case that is unavoidably O(m * n²)".

the standard escape hatch is switching to "earliest match" semantics (Hyperscan, rust regex-automata with earliest(true)), which reports a match the moment the DFA enters a match state instead of continuing to find the longest one. this avoids the quadratic scan, but it changes the results. for example, given the pattern a+ and the input aaaa:

leftmost-(greedy|longest): aaaa - one match

earliest: a a a a - four matches, each as short as possible

earliest match is a perfectly valid approach, but it's not what people want - they want the whole match, not the shortest prefix of it.

now that we've established that it's impossible, let me show you that it isn't. this is a follow-up to the previous posts (the F# version, the rewrite).

true-linear all-matches

the reason i'm talking about this is that i've been working on RE#, and i wanted to include a hardened mode that guarantees linear time even on adversarial patterns and inputs. and to the best of my knowledge, this is really the first of its kind - a regex engine that can find all matches in two passes, regardless of the pattern or the input, without altering the semantics.

the algorithm doesn't find matches one at a time. instead it does two passes over the entire input - one right-to-left, one left-to-right - and marks all match boundaries in those two passes. the total work is linear in the input length regardless of how many matches there are. one match or ten thousand, it's the same two passes. no restarting from each position, no triangular sum. same example as before:

matching .*a|b against "bbbbbbbbbb" - finding all matches
traditional (find, advance, repeat):
resharp (two passes):

on patterns that produce many matches - log parsing, data extraction, search-and-replace across large files - the difference between O(n) and O(n²) is the difference between "instant" and "why is this taking so long". and the matches are still leftmost-longest (POSIX), the mathematically correct interpretation of regex union. a|ab and ab|a give the same results. boolean algebra works. you can refactor your patterns without changing the output.

hardened mode

even with a DFA-based engine, find_all can go quadratic on adversarial input. not from backtracking - RE# doesn't backtrack - but from overlapping match candidates in the bidirectional sweep. now RE# has a hardened mode that runs an O(n * S) forward scan instead, returning exactly the same leftmost-longest matches with no semantic tradeoff. on pathological input, the difference is dramatic:

input sizenormalhardenedspeedup w/ hardened
1,0000.7ms28us25x
5,00018ms146us123x
10,00073ms303us241x
50,0001.8s1.6ms1,125x

normal mode goes quadratic; hardened stays linear. so why not make hardened the default? i went back and forth on this.

the quadratic blowup requires a pathological pattern and input that's long enough and structured in a way that triggers it - you need both halves. take a pattern like [A-Z][a-z]+ - every match starts at an uppercase letter and ends the moment the engine sees something that isn't lowercase. there's no ambiguity about where a match ends, so the engine never rescans the same input. for this pattern, the quadratic case is actually impossible - match boundaries are always unambiguous. most real-world patterns share this property. it's unlikely enough to stumble into a pathological case by accident that imposing a 3-20x constant-factor slowdown on every query felt wrong.

but then again, "you probably won't hit it" is exactly the kind of reasoning that leads to production incidents. in the end i kept the fast path as the default, mostly because the slowdown is real and measurable on every single query, while the pathological case requires a genuinely hostile combination. there's also a practical reality: i'm trying to show that RE# is the fastest and most consistent regex engine for common workloads, and i'm trying to push correct match semantics (leftmost-longest) as something you don't have to compromise on. i can't afford losses here - there will always be someone ready to say "oh but it's 20% slower on this one benchmark" and dismiss the whole approach. i won't have it. hardened mode is there for when you're accepting patterns from the internet and can't trust what you're getting - and i'd rather have it be an explicit opt-in than a silent tax on everyone.

let re = Regex::with_options(
    pattern,
    EngineOptions::default().hardened(true)
)?;

the cost on normal text:

patternnormalhardenedratio
[A-Z][a-z]+2.2ms6.5ms3.0x slower
[A-Za-z]{8,13}1.7ms7.6ms4.4x slower
\w{3,8}2.6ms22ms8.7x slower
\d+1.3ms5.2ms3.9x slower
[A-Z]{2,}0.7ms4.7ms6.7x slower

patterns with lookarounds are currently rejected in hardened mode - not for any theoretical reason, just haven't gotten to it yet. been busy with CPU intrinsics for the common cases.

skip acceleration

speaking of CPU intrinsics - in the previous post i described how skip acceleration works and where RE# was losing to regex on literal-heavy patterns. since then i've been closing those gaps with hand-written AVX2 and NEON implementations - rare byte search, teddy multi-position matching, and range-based character class scanning.

these used to be significant losses - closing them was one of the more satisfying things to get working. i was also eager to see how RE# performs on rebar, BurntSushi's benchmark suite for regex engines:

benchmarkresharprust/regex
literal, english34.8 GB/s33.7 GB/s
literal, case-insensitive english17.1 GB/s9.7 GB/s
literal, russian40.2 GB/s33.5 GB/s
literal, case-insensitive russian10.3 GB/s7.7 GB/s
literal, chinese22.4 GB/s44.0 GB/s
literal alternation, english12.2 GB/s12.5 GB/s
literal alternation, case-insensitive english4.9 GB/s2.5 GB/s
literal alternation, russian3.3 GB/s5.9 GB/s

RE# does very well here now - most numbers are within noise threshold of regex. the few differences here and there come down to byte frequency tables and algorithmic choices in the skip loop. for context, a DFA by itself gets you somewhere near 1 GB/s - CPU vector intrinsics can opportunistically push that to 40+ on patterns where most of the input can be skipped.

streaming semantics

"does RE# support streaming?" is a question i've been asked tens of times now, and it's a good one. the short answer is:

  • any pattern + leftmost-longest semantics = no. this isn't an engine limitation - it's inherent to the semantics. if you ask for the longest match on an infinite stream, the answer might be "keep going forever." a+ on an infinite stream of a's never terminates. you might think leftmost-greedy avoids this since it works left-to-right, but it doesn't - .*a|b on a stream of b's has the same problem, the .*a branch keeps scanning forward looking for an a that may never come.
  • pattern with an unambiguous end boundary = yes. some patterns already have unambiguous boundaries and work fine as-is. for the ones that don't, in RE# you can intersect with a boundary - ^.*$ for lines, ~(_*\n\n_*) for paragraphs, or any delimiter you want - and now the pattern is compatible with streaming.
  • any pattern + earliest semantics = yes. report a match the moment the DFA enters a match state, no need to scan further. this is what Hyperscan does - it works on streams because it never needs to look ahead.

the core issue is that any pattern where the match boundary depends on what comes later requires buffering, and the worst case is unbounded.

so how does RE# handle this? is_match works on streams trivially - it's a single pass in either direction, one byte at a time, stop when the DFA enters a match state. no buffering needed. find_all is harder because the bidirectional algorithm needs a right-to-left pass. but there's a natural chunked approach: split the input at known non-matching boundaries (like newlines), run the bidirectional sweep on each chunk, and emit results as chunks complete. line-oriented streaming works well because you never need to buffer more than one line, and most real-world streaming use cases are line-oriented anyway.

the API doesn't expose a streaming interface yet - find_all takes &[u8] - but the architecture is ready for chunked streaming, and that's the natural next step.

what RE# doesn't do

worth being upfront about the limitations:

  • no capture groups - RE# returns match boundaries only, not sub-group captures. this isn't impossible - captures are a post-match operation that can be layered on top. we just haven't figured out how to make it work cleanly with intersection yet.
  • no lazy quantifiers - .*? isn't supported. RE# uses leftmost-longest (POSIX) semantics, which is the mathematically unambiguous interpretation. lazy quantifiers are a backtracking concept that doesn't translate to this model.

capture groups may come eventually, but lazy quantifiers are a deliberate architectural choice. if you need captures today, use regex. if you need the properties RE# offers (boolean operators, lookarounds, true-linear all-matches, POSIX semantics), these limitations are unlikely to matter.

TBD

a note on REmatch

before i get asked about it - i'm aware of REmatch, a regex engine published at VLDB 2023 (a top database conference) that finds all matches efficiently. REmatch is interesting work, but it operates under all-match semantics, meaning it returns every possible match, including all overlapping ones. for the .*a|b example on a haystack of n b's, there are O(n²) overlapping matches to report - so REmatch's output is itself quadratic. REmatch solves a different problem - enumerating all overlapping spans - and solves it well. but it's not the same problem as finding non-overlapping leftmost-longest matches, which is what grep, ctrl+f, and find_iter do.