For folks interested in finite-state transducers and other kinds of tooling available, check out XFST (Xerox Finite-State Transducer), which has been used in computational linguistics applications for a good 20 years now.
I remember a Finnish researcher from PARC coming to one of my classes at UT to show how you can use FSTs for handling Finnish morphology, which is, on its face, quite a feat.
If you are looking for an alternative to standard regex, especially if you have trouble with group logic and are looking for something maintainable, you might like the Rosie Pattern Language.
https://gitlab.com/rosie-pattern-language/rosie/-/blob/maste...
sweet, I did my "Diplom" CS thesis around 1997 on finite state transducers. was mich less trivial than I'd thought. the ask was to implement composition and DFAs where possible. also for composed transducers. "algebra of finite state transducers". use case was morphology. the topic was heavily underestimated and I had to finish half way through. so: chapeau :-)
anyhow
wrt syntax, are you sure you want ':' to bind stronger than concatenation 'ab' ?
This doesn’t seem sufficient as soon as you want to perform some kind of structural substitution, for example doing the equivalent of s/"([^"]*)"/'$1'/. If it could do that and also somehow be able to replace any of the [^"] that match ['] by \', that would seem more useful.
More generally speaking, since regular expressions effectively define a parse tree on their matches, being able to perform more general transformations of those trees would be useful.
> Regular expressions is a great tool for searching patterns in text. But I always found it unnatural for text editing.
The entire purpose of this project seems to hinge on this assertion, but there isn't a single example.
I don't understand what makes regex unnatural for editing? What is meant by editing? Why do people struggle with groups?
There are lots of examples of the syntax for this project, but why is it better than regular regex?
If there were a few examples showing "here's the vanilla regex version, here's my version, here's why my version makes this easier" I might be able to understand this project.
What a pleasure your C code is :) very nice (currently reading it)
My only quick comment is - the link to `theory.pdf` in README is broken (your pdf is in docs/ dir, so just need to change the url to include docs/).
> Avoid using * or + in the right part, as it can cause infinite loops
Why not just disallow this? I understand it would make the grammar more difficult to specify - but I don't see any good reason to keep it.
It's a cool exploration, but I'm missing examples of why it's actually better. (Which, TBF, might just be an indicator I spent too much time with regexps :)
I.e - why is trre "(cat):(dog)" an improvement over s/cat/dog? What's the improvement of "(x:)or" over s/xor/or? And so on. Pretty much all the examples match (at least in my head) to relatively easy regexps.
I think the core advantage, if there is one, would be in group logic, so maybe the examples should lean into that - even before explaining the basics. I'd explain why it's actually a better choice before explaining the full syntax, or hello-world use cases.
For the caesar cipher example, it screams for a "apply this in reverse" - it's a pretty common request for a lot of text replacements, but it's super clear with this one. (Because programmer brain immediately screams "why express logic twice")
I don't know if it's useful (yet), but I think it's great you're trying to explore alternatives to a long-established status quo. (Caveat: That usually means there's a good chance it won't land. But it's still great to see an exploration)
I feel like this is very underspecified, The very first example:
$ echo 'cat' | trre 'c:da:ot:g'
dog
Feels strange. What is happening here; the grammar says TRRE <- TRRE* TRRE|TRRE TRRE.TRRE
TRRE <- REGEX REGEX:REGEX
What is the parse tree here? Why is "c" not being replaced with "da"? Or why isn't c being removed and "da" being replaced by "ot"?I do like the idea of having a search/replace semantic that is more intuitive than grouping operators; back in MS-DOS days you could do "ren .log .txt" and this would work which feels bananas to my modern bash-minded way of thinking, but it's very obvious looking at this what it is supposed to do.
Are the examples all actual outputs of the program? It's entirely possible that my understanding of the grammar is off, but it looks like these examples are wrong:
$ echo 'cat dog' | trre 'c:bat|d:hog' bat hog
$ echo '' | trre ':a*' # <- do NOT do this dog
$ echo '' | trre ':(repeat-10-times){10}' dog
Not sure the range operator is fully specified, this works and seems like it shouldn't:
echo "regular expressions" | ./trre "[a:A-z:a]"
REGULAR EXPRESSIONS
In fact, "[a:A-z:x]" seems to do the same thing as "[a:A-z:Z]" for all x.In place replacing the text while parsing is very powerful approach. Fingers crosses this flies.
This, combined with probabilistic approach can be even more interesting. Probabilistic approaches regexes exist since 70s if memory serves right.
This feels like it would be interesting in the Helix editor. They don't have a good solution for traditional regexp search-and-replace, but this would fill that niche in a more elegant way.
Interesting! I did an internship where I tried to use transducers for fast information extraction. In theory, you can use FST's for fast approximate parsing. I didn't really work out, but I had lots of fun implementing a libary to compose FST's and explore cool algorithms to compose them. Not much business value was delivered, but I learned a lot.
I love the idea, wanna se where it goes. Gives me the same vibe as when jq came out all those years ago.
Check out Xerox XFST and its open source clone FOMA for finite state transducers as described by extended regular expresssions, both of which describe the language of regular relations:
The Xerox FST book https://press.uchicago.edu/ucp/books/book/distributed/F/bo36... (Xerox XRCE's finite-state-tools comprising the compilers lexc, xfst and twolc, the languages they compile and the formal language finite state transducers describe including linguistic applications)
The XFST book https://press.uchicago.edu/ucp/books/book/distributed/F/bo36...
FOMA: the open source clone https://fomafst.github.io/
FOMA: the paper (Holden, M. (2009) Proc. EACL) https://aclanthology.org/E09-2008.pdf
FOMA: the open source clone https://fomafst.github.io/
FOMA: the paper (Holden, M. (2009) Proc. EACL) https://aclanthology.org/E09-2008.pdf
From the readme:
$ echo 'cat' | trre 'c:da:ot:g'
dog
Why?Elsewhere, the readme says that ":" is "non-associative", and I had a look at the language grammar but haven't figured out how to parse a sequence of ":".
I would probably use this in a text editor if it was available. I don't struggle with group logic, but I often forget which tools use \ to reference capture groups, and which use $. (If it's Microsoft or Microsoft-adjacent, it's probably $.)
Cool project. Here's my use of Finite State Transducers (det and nondet) for navigation agents:
This is nifty --- and small.
You could port this to like V7 Unix from 1979 or earlier; why didn't they get this idea? :)
Tools like sed build a transducer around the whole automaton: s/this/that/g.
Another alternative is regex and use replace all. Then you cam replace the regex cat with dog (or use a group for wrapping with space)
This is a common modus operandi for me in VS Code.
> $ echo 'xor' | '(x:)or' 'xor'
> cat
I got lost here.
I would give the colon operator lower precedence than concatenation and repetition.
I ran the installation lines and got this error:
make && sh test.sh
cc -std=c99 -Wall -Wextra -Wpedantic -o2 trre_nft.c -o trre
cc -std=c99 -Wall -Wextra -Wpedantic -o2 trre_dft.c -o trre_dft
test.sh: 14: Syntax error: "(" unexpected
Using bash fixed it.Then I ran one of the generator examples:
echo '' | trre -g ':(0|1){,3}?'
And I got this error: ./trre: invalid option -- 'g'
Usage: ./trre [-d] [-m] expr [file]
you should call it `trex` :D
very inspiring idea, makes me wanna start projects I had in mind related to that. thanks and good luck
When I saw this headline, I got excited about the prospect of a new innovation in the application of Regular Expressions. After reading, I was scratching my head because trre doesn't provide any new capabilities and is essentially just yet another flavor of regex. Additional complexity without a significant upside.
Trre seems like an arbitrarily different version of `sed -e /a/b/ ..`. This method of search and replace is essentially ported everywhere else, from awk to vim to IntelliJ, and has always gotten the job done adequately and feels as natural as anything else I've learned.
Am I missing something?
p.s I just realized I've been regex'ing heavily for 21 (!) years now, time flies.
Another alternative that is similar would be Carmel (https://github.com/isi-nlp/Carmel-Repository)
My two penneth: I find the replacement part the easiest and escping all the characters which mean something in regexp to be the most annoying part.
Adding a new char to be escaped seems like another annoyance.
I try to avoid tools that make the hard bit harder, and the easy bit easier
[flagged]
[flagged]
[flagged]
so basically just `sed -e`?
Cool, I’m interested to see where you go with this.
I found the operator precedence unnatural, and it looks like a lot of other folks in this thread did too. I would naturally assume `cat:dog` would be equivalent to `(cat):(dog)` rather than `ca(t:d)og`.