R5 Blog

Unfortunately, R5 is currently not functional, so I wasn't able to test out the verifier aspect of the project, which also happens to be the entire point of this project. I've been thinking about making this project for a while, but I never got around to putting in the work since I usually do a manual check over the regexes I use. So the silver lining is that despite falling short of my goals for this jam, I'm more than halfway there, which is better than having nothing at all.

My current work-in-progress output is at the end of this post, since I'll be focusing on the things I've learned.

The most surprising fact I discovered is that different regex engines behave differently despite superficially sharing the same syntax, even on the most basic features. For example, the dot ".", which represents all characters (more specifically, unicode codepoints) except newlines, may include carriage return and unicode variants, depending on the engine used. This is on me not checking precisely what the dot means. My takeaway is that I should replace all dots in my regexes, either automatically via a function or manually, so that using different regex engines will result in the same behavior.

Alternations proved to be a hurdle in preventing backtracking due to the potential for different branches to overlap. My current solution is to have two levels of checks. The first is to identify which branches will never overlap due to having different starting characters. If all branches are disjoint, there is no potential for backtracking. The second is to force the starting character after the alternation to not be in any of the branches that the verifier cannot definitively prove will not overlap. This means that a branch will never "undermatch" due to having two or more possible choices.

Lastly, I've learned that a week sounds like a lot more than it is for a jam. In the future, I'll need to have a stronger focus on progress throughout the week.

$ python3 handmade_jam_r5.py 
b'foo'
[(<Token_t.raw_string: 0>, b'foo')]
group info:
0 None
alternation segments:
(0, 0) [(<Block_t.string: 0>, b'foo', 1)]

b'fo+'
[(<Token_t.raw_string: 0>, b'fo'), (<Token_t.repeat_plus: 11>, False)]
group info:
0 None
alternation segments:
(0, 0) [(<Block_t.string: 0>, b'f', 1), (<Block_t.string: 0>, b'o', 1), (<Block_t.string: 0>, b'o', (None, False))]

b'(?:abc|def)+([abc]*)?'
[(<Token_t.group_non_capture_start: 4>, 4), (<Token_t.raw_string: 0>, b'abc'), (<Token_t.alternation: 3>,), (<Token_t.raw_string: 0>, b'def'), (<Token_t.group_end: 6>, 0, None), (<Token_t.repeat_plus: 11>, False), (<Token_t.group_capture_start: 5>, 9), (<Token_t.char_set: 1>, [b'a', b'b', b'c']), (<Token_t.repeat_star: 10>, False), (<Token_t.group_end: 6>, 6, 0), (<Token_t.repeat_question: 9>, False)]
group info:
0 None
1 (<Token_t.group_non_capture_start: 4>, None)
2 (<Token_t.group_capture_start: 5>, 0)
alternation segments:
(0, 0) [(<Block_t.group: 2>, 1, 1), (<Block_t.group: 2>, 1, (None, False)), (<Block_t.group: 2>, 2, (1, False))]
(1, 0) [(<Block_t.string: 0>, b'abc', 1)]
(1, 1) [(<Block_t.string: 0>, b'def', 1)]
(2, 0) [(<Block_t.char_set: 1>, [(97, 3)], (None, False))]

b'a{2}b{3,}c{5,7}'
[(<Token_t.raw_string: 0>, b'a'), (<Token_t.repeat_count: 12>, 2, 2, False), (<Token_t.raw_string: 0>, b'b'), (<Token_t.repeat_count: 12>, 3, None, False), (<Token_t.raw_string: 0>, b'c'), (<Token_t.repeat_count: 12>, 5, 7, False)]
group info:
0 None
alternation segments:
(0, 0) [(<Block_t.string: 0>, b'a', 2), (<Block_t.string: 0>, b'b', 3), (<Block_t.string: 0>, b'b', (None, False)), (<Block_t.string: 0>, b'c', 5), (<Block_t.string: 0>, b'c', (2, False))]

b'^$|^foo...|bar$'
[(<Token_t.string_start: 7>,), (<Token_t.string_end: 8>,), (<Token_t.alternation: 3>,), (<Token_t.string_start: 7>,), (<Token_t.raw_string: 0>, b'foo'), (<Token_t.any_char: 2>,), (<Token_t.any_char: 2>,), (<Token_t.any_char: 2>,), (<Token_t.alternation: 3>,), (<Token_t.raw_string: 0>, b'bar'), (<Token_t.string_end: 8>,)]
group info:
0 None
alternation segments:
(0, 0) [(<Token_t.string_start: 7>, 1), (<Token_t.string_end: 8>, 1)]
(0, 1) [(<Token_t.string_start: 7>, 1), (<Block_t.string: 0>, b'foo', 1), (<Token_t.any_char: 2>, 1), (<Token_t.any_char: 2>, 1), (<Token_t.any_char: 2>, 1)]
(0, 2) [(<Block_t.string: 0>, b'bar', 1), (<Token_t.string_end: 8>, 1)]

b'[\\x11-\\x14][-.][.-<][a-z0-9.-]'
[(<Token_t.char_set: 1>, [(b'\x11', b'\x14')]), (<Token_t.char_set: 1>, [b'-', b'.']), (<Token_t.char_set: 1>, [(b'.', b'<')]), (<Token_t.char_set: 1>, [(b'a', b'z'), (b'0', b'9'), b'.', b'-'])]
group info:
0 None
alternation segments:
(0, 0) [(<Block_t.char_set: 1>, [(17, 4)], 1), (<Block_t.char_set: 1>, [(45, 2)], 1), (<Block_t.char_set: 1>, [(46, 15)], 1), (<Block_t.char_set: 1>, [(45, 2), (48, 10), (97, 26)], 1)]

b'a??b*?c+?\\td{1}?e{4,}?f{9,16}?\\n'
[(<Token_t.raw_string: 0>, b'a'), (<Token_t.repeat_question: 9>, True), (<Token_t.raw_string: 0>, b'b'), (<Token_t.repeat_star: 10>, True), (<Token_t.raw_string: 0>, b'c'), (<Token_t.repeat_plus: 11>, True), (<Token_t.raw_string: 0>, b'\td'), (<Token_t.repeat_count: 12>, 1, 1, True), (<Token_t.raw_string: 0>, b'e'), (<Token_t.repeat_count: 12>, 4, None, True), (<Token_t.raw_string: 0>, b'f'), (<Token_t.repeat_count: 12>, 9, 16, True), (<Token_t.raw_string: 0>, b'\n')]
group info:
0 None
alternation segments:
(0, 0) [(<Block_t.string: 0>, b'a', (1, True)), (<Block_t.string: 0>, b'b', (None, True)), (<Block_t.string: 0>, b'c', 1), (<Block_t.string: 0>, b'c', (None, True)), (<Block_t.string: 0>, b'\t', 1), (<Block_t.string: 0>, b'd', 1), (<Block_t.string: 0>, b'e', 4), (<Block_t.string: 0>, b'e', (None, True)), (<Block_t.string: 0>, b'f', 9), (<Block_t.string: 0>, b'f', (7, True)), (<Block_t.string: 0>, b'\n', 1)]

b'(a(b(c(d(e(f(g+)0)1)2)3)4)5)'
[(<Token_t.group_capture_start: 5>, 27), (<Token_t.raw_string: 0>, b'a'), (<Token_t.group_capture_start: 5>, 25), (<Token_t.raw_string: 0>, b'b'), (<Token_t.group_capture_start: 5>, 23), (<Token_t.raw_string: 0>, b'c'), (<Token_t.group_capture_start: 5>, 21), (<Token_t.raw_string: 0>, b'd'), (<Token_t.group_capture_start: 5>, 19), (<Token_t.raw_string: 0>, b'e'), (<Token_t.group_capture_start: 5>, 17), (<Token_t.raw_string: 0>, b'f'), (<Token_t.group_capture_start: 5>, 15), (<Token_t.raw_string: 0>, b'g'), (<Token_t.repeat_plus: 11>, False), (<Token_t.group_end: 6>, 12, 6), (<Token_t.raw_string: 0>, b'0'), (<Token_t.group_end: 6>, 10, 5), (<Token_t.raw_string: 0>, b'1'), (<Token_t.group_end: 6>, 8, 4), (<Token_t.raw_string: 0>, b'2'), (<Token_t.group_end: 6>, 6, 3), (<Token_t.raw_string: 0>, b'3'), (<Token_t.group_end: 6>, 4, 2), (<Token_t.raw_string: 0>, b'4'), (<Token_t.group_end: 6>, 2, 1), (<Token_t.raw_string: 0>, b'5'), (<Token_t.group_end: 6>, 0, 0)]
group info:
0 None
1 (<Token_t.group_capture_start: 5>, 0)
2 (<Token_t.group_capture_start: 5>, 1)
3 (<Token_t.group_capture_start: 5>, 2)
4 (<Token_t.group_capture_start: 5>, 3)
5 (<Token_t.group_capture_start: 5>, 4)
6 (<Token_t.group_capture_start: 5>, 5)
7 (<Token_t.group_capture_start: 5>, 6)
alternation segments:
(0, 0) [(<Block_t.group: 2>, 1, 1)]
(1, 0) [(<Block_t.string: 0>, b'a', 1), (<Block_t.group: 2>, 2, 1), (<Block_t.string: 0>, b'5', 1)]
(2, 0) [(<Block_t.string: 0>, b'b', 1), (<Block_t.group: 2>, 3, 1), (<Block_t.string: 0>, b'4', 1)]
(3, 0) [(<Block_t.string: 0>, b'c', 1), (<Block_t.group: 2>, 4, 1), (<Block_t.string: 0>, b'3', 1)]
(4, 0) [(<Block_t.string: 0>, b'd', 1), (<Block_t.group: 2>, 5, 1), (<Block_t.string: 0>, b'2', 1)]
(5, 0) [(<Block_t.string: 0>, b'e', 1), (<Block_t.group: 2>, 6, 1), (<Block_t.string: 0>, b'1', 1)]
(6, 0) [(<Block_t.string: 0>, b'f', 1), (<Block_t.group: 2>, 7, 1), (<Block_t.string: 0>, b'0', 1)]
(7, 0) [(<Block_t.string: 0>, b'g', 1), (<Block_t.string: 0>, b'g', (None, False))]

2025-06-16 04:57:13