X-Ray Jam. June 9-15, 2025. See the results.

R5

A rigorous rectilinear runaway regex rejecter

About R5


Regexes are an often used tool for ad-hoc parsing of simple strings. Unfortunately, they can end up being pathological in run time with specially-crafted inputs. One mitigation is to time out for inputs that take too long to recognize (i.e. accept or reject the input). However, this is very unsatisfactory if the regex can be rewritten such that there are no input strings (of some constant bounded length) that can cause such behavior.

This project aims to be a verifier that rejects regexes that may induce quadratic, higher polynomial, or exponential run time for arbitrary inputs, up to some constant size, in reasonably implemented backtracking regex engines. The verifier must also be able to accept regexes that use features I consider indispensible, such as submatch extraction, non-capturing groups, and numbered repetition bounds. For correctness, false positives are allowed (i.e. regex that is safe is rejected), but false negatives are not (i.e. regex that is unsafe is accepted).

The minimum viable product will be able to validate regexes and precisely indicate which section(s) is problematic. When possible, a sample partial input should be provided that illustrates how the abstract regex state machine may choose two different paths at a branching point.

The asymptotic complexities for regex validation should be approximately O(n) for time and O(n) for extra space, where n is the length of the regex. The asymptotic complexities for input recognition on accepted regexes should be approximately O(n + m) for time and O(n) for extra space, where m is the length of the input string. These may not be possible, but these are the targets I have in mind.

Stretch goals:

  • Add support for non-standard regex nesting/substitution, e.g. foo = "fo+" and bar = "^(?\foo)+bar*(?\foo)$" -> "^(?:fo+)+bar*fo+$"
  • Add support for non-standard regex unnesting/factoring, e.g. "^[0-9]{5}(?:-[0-9]{5}){4}$" -> a = "[0-9]{5}" and b = "^(?\a)(?:-(?\a)){4}$"
  • Add a regex engine implementation
  • Perform some limited canonicalization on a supplied regex
  • Add a web-based gui with precision highlighting
  • Calculate tight bounds on time and memory costs for a specific implementation of the verifier and of the recognizer
Read more
Filters

Recent Activity

jam day 7/7

I finished mere minutes before the deadline.

2025-06-16 04:58:54

jam day 6/7

I was distracted by a message that caused me to work on another project for the whole day. Oops.

2025-06-15 15:56:44

jam day 5/7

I took a rest day to ponder on the alternation issue a bit more. There are two full days left before the jam ends, so there should be enough time to reach a minimum viable product.

2025-06-14 15:02:59

jam day 4/7

The lexer now supports numbered repetition bounds, escape sequences for tab and newline, and lazy repetition. Its code has also been cleaned up to reduce copy pasting.

I'm still unsure of how to handle unicode. I'm currently favoring the byte level parsing route, which would mean that "." will be replaced with "[\x00-\x09\x0b-\xff]", to prevent regex engines that operate on unicode codepoints from using a model different from the verifier's. This would be necessary even if I choose the codepoint level parsing route as apparently some regex engines have additional characters they exclude, e.g. carriage return.

Alternation is a bit trickier to handle than I initially thought. The problem comes from wanting to support regexes that are natural to write, e.g. "^(?:foo|foobar|bar)", while having consistent behavior across different regex engines. For an input of "foobarbaz", apparently some engines will return "foo" as the match, while others will return "foobar". A related issue is that the ordering of the cases may affect which branch is matched.

I think the path forward for alternation, when one case may be the prefix of another, is to decompose it to an equivalent form and check that the cases are disjoint. This may reject more regexes than I'd like, but I'm currently out of ideas on how to resolve this.

python3 handmade_jam_r5.py 
b'foo'
[(<Token_t.raw_string: 0>, b'foo')]
b'fo+'
[(<Token_t.raw_string: 0>, b'fo'), (<Token_t.repeat_plus: 11>, False)]
b'(?:abc|def)+([abc]*)?'
[(<Token_t.group_non_capture_start: 4>, 4), (<Token_t.raw_string: 0>, b'abc'), (<Token_t.alternation: 3>,), (<Token_t.raw_string: 0>, b'def'), (<Token_t.group_end: 6>, 0, None), (<Token_t.repeat_plus: 11>, False), (<Token_t.group_capture_start: 5>, 9), (<Token_t.char_set: 1>, [b'a', b'b', b'c']), (<Token_t.repeat_star: 10>, False), (<Token_t.group_end: 6>, 6, 0), (<Token_t.repeat_question: 9>, False)]
b'a{2}b{3,}c{5,7}'
[(<Token_t.raw_string: 0>, b'a'), (<Token_t.repeat_count: 12>, 2, 2, False), (<Token_t.raw_string: 0>, b'b'), (<Token_t.repeat_count: 12>, 3, None, False), (<Token_t.raw_string: 0>, b'c'), (<Token_t.repeat_count: 12>, 5, 7, False)]
b'^$|^foo...|bar$'
[(<Token_t.string_start: 7>,), (<Token_t.string_end: 8>,), (<Token_t.alternation: 3>,), (<Token_t.string_start: 7>,), (<Token_t.raw_string: 0>, b'foo'), (<Token_t.any_char: 2>,), (<Token_t.any_char: 2>,), (<Token_t.any_char: 2>,), (<Token_t.alternation: 3>,), (<Token_t.raw_string: 0>, b'bar'), (<Token_t.string_end: 8>,)]
b'[\\x11-\\x14][-.][.-<][a-z0-9.-]'
[(<Token_t.char_set: 1>, [(b'\x11', b'\x14')]), (<Token_t.char_set: 1>, [b'-', b'.']), (<Token_t.char_set: 1>, [(b'.', b'<')]), (<Token_t.char_set: 1>, [(b'a', b'z'), (b'0', b'9'), b'.', b'-'])]
b'a??b*?c+?\\td{1}?e{4,}?f{9,16}?\\n'
[(<Token_t.raw_string: 0>, b'a'), (<Token_t.repeat_question: 9>, True), (<Token_t.raw_string: 0>, b'b'), (<Token_t.repeat_star: 10>, True), (<Token_t.raw_string: 0>, b'c'), (<Token_t.repeat_plus: 11>, True), (<Token_t.raw_string: 0>, b'\td'), (<Token_t.repeat_count: 12>, 1, 1, True), (<Token_t.raw_string: 0>, b'e'), (<Token_t.repeat_count: 12>, 4, None, True), (<Token_t.raw_string: 0>, b'f'), (<Token_t.repeat_count: 12>, 9, 16, True), (<Token_t.raw_string: 0>, b'\n')]

2025-06-13 14:09:06

jam day 3/7

I wasn't able to finish parsing, but I'm almost done with lexing. Currently, the lexer can't handle numbered repetition bounds, non-hex escapes, negative character sets, and lazy repetition. The code is also in dire need of a cleanup.

I'm not sure on how I want to handle unicode yet. My two ideas are to either parse on the byte level or parse on the codepoint level. The first option provides full functionality at the cost of some verbosity. The second option simplifies usage, but complicates the implementation with regard to collisions with regexes meant to be used on byte strings. I can transparently translate codepoint-specified regexes into the byte-specified equivalent, but that might be very confusing when that's mixed with byte level parsing and the verifier rejects the regex.

Another decision I need to make is which escapes I'll support. The escaping of meta characters and bytes (via the hex code) is already done. I'll likely add support for tab and newline and likely won't add support for the digit/space/word escapes, but I don't know about the rest yet.

$ python3 handmade_jam_r5.py 
b'foo'
[(<Token_t.raw_string: 0>, b'foo')]
b'fo+'
[(<Token_t.raw_string: 0>, b'fo'), (<Token_t.repeat_plus: 11>,)]
b'(?:abc|def)+([abc]*)?'
[(<Token_t.group_non_capture_start: 4>, 4), (<Token_t.raw_string: 0>, b'abc'), (<Token_t.alternation: 3>,), (<Token_t.raw_string: 0>, b'def'), (<Token_t.group_end: 6>, 0, None), (<Token_t.repeat_plus: 11>,), (<Token_t.group_capture_start: 5>, 9), (<Token_t.char_set: 1>, [b'a', b'b', b'c']), (<Token_t.repeat_star: 10>,), (<Token_t.group_end: 6>, 6, 0), (<Token_t.repeat_question: 9>,)]
b'a{2}b{3,}c{5,7}'
Exception
b'^$|^foo...|bar$'
[(<Token_t.string_start: 7>,), (<Token_t.string_end: 8>,), (<Token_t.alternation: 3>,), (<Token_t.string_start: 7>,), (<Token_t.raw_string: 0>, b'foo'), (<Token_t.any_char: 2>,), (<Token_t.any_char: 2>,), (<Token_t.any_char: 2>,), (<Token_t.alternation: 3>,), (<Token_t.raw_string: 0>, b'bar'), (<Token_t.string_end: 8>,)]
b'[\\x11-\\x14][-.][.-<][a-z0-9.-]'
[(<Token_t.char_set: 1>, [(b'\x11', b'\x14')]), (<Token_t.char_set: 1>, [b'-', b'.']), (<Token_t.char_set: 1>, [(b'.', b'<')]), (<Token_t.char_set: 1>, [(b'a', b'z'), (b'0', b'9'), b'.', b'-'])]

2025-06-12 13:38:07

jam day 2/7

I didn't work on the project at all, which is setting me up for a time crunch in the next few days. To make up for today, I want to finish parsing and start on the control flow graph analysis by the end of day 3.

2025-06-11 10:56:41

jam day 1/7

I didn't get much done except for the basic skeleton for parsing.

$ python3 handmade_jam_r5.py 
bytearray(b'f')
bytearray(b'o')
bytearray(b'o')

2025-06-10 11:28:42