Snippet by r_edge

jam day 3/7

I wasn't able to finish parsing, but I'm almost done with lexing. Currently, the lexer can't handle numbered repetition bounds, non-hex escapes, negative character sets, and lazy repetition. The code is also in dire need of a cleanup.

I'm not sure on how I want to handle unicode yet. My two ideas are to either parse on the byte level or parse on the codepoint level. The first option provides full functionality at the cost of some verbosity. The second option simplifies usage, but complicates the implementation with regard to collisions with regexes meant to be used on byte strings. I can transparently translate codepoint-specified regexes into the byte-specified equivalent, but that might be very confusing when that's mixed with byte level parsing and the verifier rejects the regex.

Another decision I need to make is which escapes I'll support. The escaping of meta characters and bytes (via the hex code) is already done. I'll likely add support for tab and newline and likely won't add support for the digit/space/word escapes, but I don't know about the rest yet.

$ python3 handmade_jam_r5.py 
b'foo'
[(<Token_t.raw_string: 0>, b'foo')]
b'fo+'
[(<Token_t.raw_string: 0>, b'fo'), (<Token_t.repeat_plus: 11>,)]
b'(?:abc|def)+([abc]*)?'
[(<Token_t.group_non_capture_start: 4>, 4), (<Token_t.raw_string: 0>, b'abc'), (<Token_t.alternation: 3>,), (<Token_t.raw_string: 0>, b'def'), (<Token_t.group_end: 6>, 0, None), (<Token_t.repeat_plus: 11>,), (<Token_t.group_capture_start: 5>, 9), (<Token_t.char_set: 1>, [b'a', b'b', b'c']), (<Token_t.repeat_star: 10>,), (<Token_t.group_end: 6>, 6, 0), (<Token_t.repeat_question: 9>,)]
b'a{2}b{3,}c{5,7}'
Exception
b'^$|^foo...|bar$'
[(<Token_t.string_start: 7>,), (<Token_t.string_end: 8>,), (<Token_t.alternation: 3>,), (<Token_t.string_start: 7>,), (<Token_t.raw_string: 0>, b'foo'), (<Token_t.any_char: 2>,), (<Token_t.any_char: 2>,), (<Token_t.any_char: 2>,), (<Token_t.alternation: 3>,), (<Token_t.raw_string: 0>, b'bar'), (<Token_t.string_end: 8>,)]
b'[\\x11-\\x14][-.][.-<][a-z0-9.-]'
[(<Token_t.char_set: 1>, [(b'\x11', b'\x14')]), (<Token_t.char_set: 1>, [b'-', b'.']), (<Token_t.char_set: 1>, [(b'.', b'<')]), (<Token_t.char_set: 1>, [(b'a', b'z'), (b'0', b'9'), b'.', b'-'])]

2025-06-12 13:38:07