Hacker News

Fast(er) regular expression engines in Ruby

by davidsojevicon 5/2/2025, 12:00:39 AM with 4 comments

by kayodelycaonon 5/2/2025, 11:28:13 PM
> Another nuance was found in ruby, which cannot scan the haystack with invalid UTF-8 byte sequences.
This is extremely basic ruby: UTF-8 encoded strings must be valid UTF-8. This is not unique to ruby. If I recall correctly, python 3 does the same thing.
```
    2.7.1 :001 > haystack = "\xfc\xa1\xa1\xa1\xa1\xa1abc"
    2.7.1 :003 > haystack.force_encoding "ASCII-8BIT"
    => "\xFC\xA1\xA1\xA1\xA1\xA1abc" 
    2.7.1 :004 > haystack.scan(/.+/)
    => ["\xFC\xA1\xA1\xA1\xA1\xA1abc"]
```
This person is a senior engineer on their Team page. All they had to do was google "ArgumentError: invalid byte sequence in UTF-8". Or ask a coworker... the company has Ruby on Rails applications. headdesk
by DmitryOlshanskyon 5/4/2025, 6:08:08 PM
I wonder how std.regex of dlang would fare in such test. Sadly due to a tiny bit of D’s GC use it’s hard to provide as a library for other languages. If there is an interest I might take it through the tests.
by yxhuvudon 5/2/2025, 9:29:06 PM
Eww, pretending to support utf8 matchers while not supporting them at all was not pretty to see.
by gitroomon 5/2/2025, 10:45:56 PM
Honestly that part bugs me, fake support is worse than no support imo