The Regular Expression Visualizer, Simulator and Cross-Compiler Tool

  • Hi, I'm the author of this page. This tool has a lot of complexity in it and a huge number of corner-cases. If you discover any bugs or XSS vulnerabilities, feel free to let me know.

  • along these lines, https://regex101.com is consistently a godsend for writing regex.

  • I would have loved this when I worked on a regex engine a few years back. It was before I took my CS theory course and had no idea what an automaton was. Let me tell you, having a formal theory and visualization in front of you while working on something is extremely helpful.

    This looks like it would be a great tool for both people looking to learn, and those who are looking to optimize their regular expressions as well.

  • When I read the title, I half hoped ‘cross-compiler’ referred to translating between different regex varieties. Anyone know of a tool that does this?

  • Any one write an AST aware regex engine?

    Examples: replace s/(?ast:comment)foo/bar to replace foo with bar but only in comments. Or s/(\w+): (?ast:openparent)(.?)(?ast:closeparen)\s+(?ast:openbracked)(.?)(?ast:closebracket)/function $1($2)\n{\n$3\n}\n/ to convert from

        foo: (args) {
          codeblock
        }
    
    to

        function foo(args) {
          codeblock
        }
    
    etc..?

  • Nice graphical explanations! It would be nice to see positive/negative look ahead/behind support as they're pretty powerful.

    The `\K` flag to clear the capture buffer (e.g. Perl flavor) is also a very useful regexp feature absent here (e.g. when testing `grep` commands).

  • Any suggestions to keep regexes simple? They often grow to huge lengths which are hard to read and/or debug. It would be nice to keep modular. Maybe using websites like this is the only way.

  • Note that when you use grep, awk, sed, libc regexec(), RE2 [1], or rust/regex, this isn't how your regexes are executed.

    They use automata-based engines whereas this is a backtracking engine, which can blow up on small inputs. (Although I will say the C code here is cool, because it gives you a fork() bomb, so it might exhaust your operating system and not just your CPU!)

    Canonical reference: Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)

    https://swtch.com/~rsc/regexp/regexp1.html

    Archive since it's down at the moment due to some App Engine stuff (!): https://web.archive.org/web/20200624195819/https://swtch.com...

    My blog post and sample code that verifies that GNU grep, awk, sed, and libc don't backtrack:

    Comments on Eggex and Regular Languages http://www.oilshell.org/blog/2020/07/eggex-theory.html

    https://github.com/oilshell/blog-code/tree/master/regular-la...

    [1] https://github.com/google/re2

    ----

    Also, the production quality way of compiling regular expressions to C code looks like this [2], which is completely different than what's shown in the article:

    https://www.oilshell.org/release/0.8.pre8/source-code.wwz/_d...

    The regex is converted to an NFA, then a DFA, then a bunch of switch/goto statements in C that implement the DFA.

    This example is huge, but the switch/goto and lack of any notion of "threading" is the important point.

    Essentially, the instruction pointer is used as the DFA state. It's a very natural use of "goto" (where you don't write or debug them yourself)

    [2] via http://re2c.org/ (not related to RE2)

  • neato bandito