jam day 4/7

The lexer now supports numbered repetition bounds, escape sequences for tab and newline, and lazy repetition. Its code has also been cleaned up to reduce copy pasting.

I'm still unsure of how to handle unicode. I'm currently favoring the byte level parsing route, which would mean that "." will be replaced with "[\x00-\x09\x0b-\xff]", to prevent regex engines that operate on unicode codepoints from using a model different from the verifier's. This would be necessary even if I choose the codepoint level parsing route as apparently some regex engines have additional characters they exclude, e.g. carriage return.

Alternation is a bit trickier to handle than I initially thought. The problem comes from wanting to support regexes that are natural to write, e.g. "^(?:foo|foobar|bar)", while having consistent behavior across different regex engines. For an input of "foobarbaz", apparently some engines will return "foo" as the match, while others will return "foobar". A related issue is that the ordering of the cases may affect which branch is matched.

I think the path forward for alternation, when one case may be the prefix of another, is to decompose it to an equivalent form and check that the cases are disjoint. This may reject more regexes than I'd like, but I'm currently out of ideas on how to resolve this.

python3 handmade_jam_r5.py 
b'foo'
[(<Token_t.raw_string: 0>, b'foo')]
b'fo+'
[(<Token_t.raw_string: 0>, b'fo'), (<Token_t.repeat_plus: 11>, False)]
b'(?:abc|def)+([abc]*)?'
[(<Token_t.group_non_capture_start: 4>, 4), (<Token_t.raw_string: 0>, b'abc'), (<Token_t.alternation: 3>,), (<Token_t.raw_string: 0>, b'def'), (<Token_t.group_end: 6>, 0, None), (<Token_t.repeat_plus: 11>, False), (<Token_t.group_capture_start: 5>, 9), (<Token_t.char_set: 1>, [b'a', b'b', b'c']), (<Token_t.repeat_star: 10>, False), (<Token_t.group_end: 6>, 6, 0), (<Token_t.repeat_question: 9>, False)]
b'a{2}b{3,}c{5,7}'
[(<Token_t.raw_string: 0>, b'a'), (<Token_t.repeat_count: 12>, 2, 2, False), (<Token_t.raw_string: 0>, b'b'), (<Token_t.repeat_count: 12>, 3, None, False), (<Token_t.raw_string: 0>, b'c'), (<Token_t.repeat_count: 12>, 5, 7, False)]
b'^$|^foo...|bar$'
[(<Token_t.string_start: 7>,), (<Token_t.string_end: 8>,), (<Token_t.alternation: 3>,), (<Token_t.string_start: 7>,), (<Token_t.raw_string: 0>, b'foo'), (<Token_t.any_char: 2>,), (<Token_t.any_char: 2>,), (<Token_t.any_char: 2>,), (<Token_t.alternation: 3>,), (<Token_t.raw_string: 0>, b'bar'), (<Token_t.string_end: 8>,)]
b'[\\x11-\\x14][-.][.-<][a-z0-9.-]'
[(<Token_t.char_set: 1>, [(b'\x11', b'\x14')]), (<Token_t.char_set: 1>, [b'-', b'.']), (<Token_t.char_set: 1>, [(b'.', b'<')]), (<Token_t.char_set: 1>, [(b'a', b'z'), (b'0', b'9'), b'.', b'-'])]
b'a??b*?c+?\\td{1}?e{4,}?f{9,16}?\\n'
[(<Token_t.raw_string: 0>, b'a'), (<Token_t.repeat_question: 9>, True), (<Token_t.raw_string: 0>, b'b'), (<Token_t.repeat_star: 10>, True), (<Token_t.raw_string: 0>, b'c'), (<Token_t.repeat_plus: 11>, True), (<Token_t.raw_string: 0>, b'\td'), (<Token_t.repeat_count: 12>, 1, 1, True), (<Token_t.raw_string: 0>, b'e'), (<Token_t.repeat_count: 12>, 4, None, True), (<Token_t.raw_string: 0>, b'f'), (<Token_t.repeat_count: 12>, 9, 16, True), (<Token_t.raw_string: 0>, b'\n')]

2025-06-13 14:09:06