Regexes are an often used tool for ad-hoc parsing of simple strings. Unfortunately, they can end up being pathological in run time with specially-crafted inputs. One mitigation is to time out for inputs that take too long to recognize (i.e. accept or reject the input). However, this is very unsatisfactory if the regex can be rewritten such that there are no input strings (of some constant bounded length) that can cause such behavior.

This project aims to be a verifier that rejects regexes that may induce quadratic, higher polynomial, or exponential run time for arbitrary inputs, up to some constant size, in reasonably implemented backtracking regex engines. The verifier must also be able to accept regexes that use features I consider indispensible, such as submatch extraction, non-capturing groups, and numbered repetition bounds. For correctness, false positives are allowed (i.e. regex that is safe is rejected), but false negatives are not (i.e. regex that is unsafe is accepted).

The minimum viable product will be able to validate regexes and precisely indicate which section(s) is problematic. When possible, a sample partial input should be provided that illustrates how the abstract regex state machine may choose two different paths at a branching point.

The asymptotic complexities for regex validation should be approximately O(n) for time and O(n) for extra space, where n is the length of the regex. The asymptotic complexities for input recognition on accepted regexes should be approximately O(n + m) for time and O(n) for extra space, where m is the length of the input string. These may not be possible, but these are the targets I have in mind.

Stretch goals:

  • Add support for non-standard regex nesting/substitution, e.g. foo = "fo+" and bar = "^(?\foo)+bar*(?\foo)$" -> "^(?:fo+)+bar*fo+$"
  • Add support for non-standard regex unnesting/factoring, e.g. "^[0-9]{5}(?:-[0-9]{5}){4}$" -> a = "[0-9]{5}" and b = "^(?\a)(?:-(?\a)){4}$"
  • Add a regex engine implementation
  • Perform some limited canonicalization on a supplied regex
  • Add a web-based gui with precision highlighting
  • Calculate tight bounds on time and memory costs for a specific implementation of the verifier and of the recognizer