r_edge

I took a rest day to ponder on the alternation issue a bit more. There are two full days left before the jam ends, so there should be enough time to reach a minimum viable product.

2025-06-14 15:02:59

r_edge June 13, 2025

R5

R

jam day 4/7

The lexer now supports numbered repetition bounds, escape sequences for tab and newline, and lazy repetition. Its code has also been cleaned up to reduce copy pasting.

I'm still unsure of how to handle unicode. I'm currently favoring the byte level parsing route, which would mean that "." will be replaced with "[\x00-\x09\x0b-\xff]", to prevent regex engines that operate on unicode codepoints from using a model different from the verifier's. This would be necessary even if I choose the codepoint level parsing route as apparently some regex engines have additional characters they exclude, e.g. carriage return.

Alternation is a bit trickier to handle than I initially thought. The problem comes from wanting to support regexes that are natural to write, e.g. "^(?:foo|foobar|bar)", while having consistent behavior across different regex engines. For an input of "foobarbaz", apparently some engines will return "foo" as the match, while others will return "foobar". A related issue is that the ordering of the cases may affect which branch is matched.

I think the path forward for alternation, when one case may be the prefix of another, is to decompose it to an equivalent form and check that the cases are disjoint. This may reject more regexes than I'd like, but I'm currently out of ideas on how to resolve this.

python3 handmade_jam_r5.py 
b'foo'
[(<Token_t.raw_string: 0>, b'foo')]
b'fo+'
[(<Token_t.raw_string: 0>, b'fo'), (<Token_t.repeat_plus: 11>, False)]
b'(?:abc|def)+([abc]*)?'
[(<Token_t.group_non_capture_start: 4>, 4), (<Token_t.raw_string: 0>, b'abc'), (<Token_t.alternation: 3>,), (<Token_t.raw_string: 0>, b'def'), (<Token_t.group_end: 6>, 0, None), (<Token_t.repeat_plus: 11>, False), (<Token_t.group_capture_start: 5>, 9), (<Token_t.char_set: 1>, [b'a', b'b', b'c']), (<Token_t.repeat_star: 10>, False), (<Token_t.group_end: 6>, 6, 0), (<Token_t.repeat_question: 9>, False)]
b'a{2}b{3,}c{5,7}'
[(<Token_t.raw_string: 0>, b'a'), (<Token_t.repeat_count: 12>, 2, 2, False), (<Token_t.raw_string: 0>, b'b'), (<Token_t.repeat_count: 12>, 3, None, False), (<Token_t.raw_string: 0>, b'c'), (<Token_t.repeat_count: 12>, 5, 7, False)]
b'^$|^foo...|bar$'
[(<Token_t.string_start: 7>,), (<Token_t.string_end: 8>,), (<Token_t.alternation: 3>,), (<Token_t.string_start: 7>,), (<Token_t.raw_string: 0>, b'foo'), (<Token_t.any_char: 2>,), (<Token_t.any_char: 2>,), (<Token_t.any_char: 2>,), (<Token_t.alternation: 3>,), (<Token_t.raw_string: 0>, b'bar'), (<Token_t.string_end: 8>,)]
b'[\\x11-\\x14][-.][.-<][a-z0-9.-]'
[(<Token_t.char_set: 1>, [(b'\x11', b'\x14')]), (<Token_t.char_set: 1>, [b'-', b'.']), (<Token_t.char_set: 1>, [(b'.', b'<')]), (<Token_t.char_set: 1>, [(b'a', b'z'), (b'0', b'9'), b'.', b'-'])]
b'a??b*?c+?\\td{1}?e{4,}?f{9,16}?\\n'
[(<Token_t.raw_string: 0>, b'a'), (<Token_t.repeat_question: 9>, True), (<Token_t.raw_string: 0>, b'b'), (<Token_t.repeat_star: 10>, True), (<Token_t.raw_string: 0>, b'c'), (<Token_t.repeat_plus: 11>, True), (<Token_t.raw_string: 0>, b'\td'), (<Token_t.repeat_count: 12>, 1, 1, True), (<Token_t.raw_string: 0>, b'e'), (<Token_t.repeat_count: 12>, 4, None, True), (<Token_t.raw_string: 0>, b'f'), (<Token_t.repeat_count: 12>, 9, 16, True), (<Token_t.raw_string: 0>, b'\n')]

2025-06-13 14:09:06

r_edge June 12, 2025

R5

R

jam day 3/7

I wasn't able to finish parsing, but I'm almost done with lexing. Currently, the lexer can't handle numbered repetition bounds, non-hex escapes, negative character sets, and lazy repetition. The code is also in dire need of a cleanup.

I'm not sure on how I want to handle unicode yet. My two ideas are to either parse on the byte level or parse on the codepoint level. The first option provides full functionality at the cost of some verbosity. The second option simplifies usage, but complicates the implementation with regard to collisions with regexes meant to be used on byte strings. I can transparently translate codepoint-specified regexes into the byte-specified equivalent, but that might be very confusing when that's mixed with byte level parsing and the verifier rejects the regex.

Another decision I need to make is which escapes I'll support. The escaping of meta characters and bytes (via the hex code) is already done. I'll likely add support for tab and newline and likely won't add support for the digit/space/word escapes, but I don't know about the rest yet.

$ python3 handmade_jam_r5.py 
b'foo'
[(<Token_t.raw_string: 0>, b'foo')]
b'fo+'
[(<Token_t.raw_string: 0>, b'fo'), (<Token_t.repeat_plus: 11>,)]
b'(?:abc|def)+([abc]*)?'
[(<Token_t.group_non_capture_start: 4>, 4), (<Token_t.raw_string: 0>, b'abc'), (<Token_t.alternation: 3>,), (<Token_t.raw_string: 0>, b'def'), (<Token_t.group_end: 6>, 0, None), (<Token_t.repeat_plus: 11>,), (<Token_t.group_capture_start: 5>, 9), (<Token_t.char_set: 1>, [b'a', b'b', b'c']), (<Token_t.repeat_star: 10>,), (<Token_t.group_end: 6>, 6, 0), (<Token_t.repeat_question: 9>,)]
b'a{2}b{3,}c{5,7}'
Exception
b'^$|^foo...|bar$'
[(<Token_t.string_start: 7>,), (<Token_t.string_end: 8>,), (<Token_t.alternation: 3>,), (<Token_t.string_start: 7>,), (<Token_t.raw_string: 0>, b'foo'), (<Token_t.any_char: 2>,), (<Token_t.any_char: 2>,), (<Token_t.any_char: 2>,), (<Token_t.alternation: 3>,), (<Token_t.raw_string: 0>, b'bar'), (<Token_t.string_end: 8>,)]
b'[\\x11-\\x14][-.][.-<][a-z0-9.-]'
[(<Token_t.char_set: 1>, [(b'\x11', b'\x14')]), (<Token_t.char_set: 1>, [b'-', b'.']), (<Token_t.char_set: 1>, [(b'.', b'<')]), (<Token_t.char_set: 1>, [(b'a', b'z'), (b'0', b'9'), b'.', b'-'])]

2025-06-12 13:38:07

r_edge June 11, 2025

R5

R

jam day 2/7

I didn't work on the project at all, which is setting me up for a time crunch in the next few days. To make up for today, I want to finish parsing and start on the control flow graph analysis by the end of day 3.

2025-06-11 10:56:41

r_edge June 10, 2025

R5

R

jam day 1/7

I didn't get much done except for the basic skeleton for parsing.

$ python3 handmade_jam_r5.py 
bytearray(b'f')
bytearray(b'o')
bytearray(b'o')

2025-06-10 11:28:42

Projects

R5

Recent Activity