The fastest way to detect a vowel in a string

  • > The regex methods are unbelievably fast regardless of string length.

    I immediately pinned regex as the winner, and here is why:

    In Python, you can almost always count on specialized functionality in the stdlib to be faster than any Python code you write, because most of it has probably been optimized in CPython by now.

    Second, I have to ask, why would someone think that regular expressions, a tool specifically designed to search strings, would be slower than any other tool? Of course it's going to be the fastest! (At least in a language like Python.)

  • To paraphrase Arthur Dent, "this must be a strange new usage of the word 'fastest' with which I am not previously familiar".

    The fastest way to detect a vowel in a string on any reasonable architecture (Intel or AMD equipped with SIMD of some kind) is using 3-4 instructions which will process 16/32/64 (depends on SIMD length) bytes at once. Obviously access to these will require using a Python library that exposes SIMD.

    Leaving SIMD aside, a flat byte array of size 256 will outperform a bitmap since it's always faster to look up bytes in an array than bits in a bitmap, and the size is trivial.

  • A few years ago I came across the article "SIMD-friendly algorithms for substring searching". The "Generic SIMD" section into and example is small and quite understandable. I modified it to implement an LZ77 sliding window in Zig.*

    http://0x80.pl/notesen/2016-11-28-simd-strfind.html

    * Or rather, I tried my best. I burnt out on that project because I kept jumping back and forth between making a proper DEFLATE implementation or something bespoke. The SIMD stuff was really tough and once I got it "working", I figured I got all I needed from the project and let it go.

    https://github.com/rendello/compressor/blob/dev/src/str_matc...

  • Assume use of 8 bit characters. Declare a constant 256 entry array pre-filled with all False except for the five (or six) vowel characters. This is baked into the code and not initialized at runtime.

    Now for each character c in the input string, simply do an array index and see if it is true (a vowel) or not. This avoids either five conditionals, or a loop over the string 'aeiou'. The vowel test is constant time regardless of the character value.

  • This doesn’t even cover strings like “fly” and “naĂŻve”. ;)

  • Things don't have to be complicated or pretty to be fast or good. For a UTF-8 or ASCII (English) string:

      for(char c in string)
        if(c & 1 == 0)
          continue;
        switch(c & 0x1F)
          case(a & 0x1F)
            return true; //a or A
          case(e & 0x1F)
            return true; //e or E
          case(i & 0x1F)
            return true; //i or I
          case(o & 0x1F)
            return true; //o or O
          case(u & 0x1F)
            return true; //u or U
          default
            continue;
    
    Checking for consonants is about as free as it gets. 50-70% of characters in English text are consonants. By checking one bit, you can eliminate that many checks across the whole string. This should also more or less apply to any text encoding; this technique comes from an artifact of the alphabet itself. It just so happens that all English vowels (including Y and W) fall on odd indices within the alphabet.

    Characters aren't magic! They're just numbers. You can do math and binary tricks on them just like you would with any other primitive type. Instead of thinking about finding letters within a string, sometimes you get better answers by asking "how do I find one of a set of numbers within this array of numbers?". It seems to me that a lot of programmers consider these to be entirely disjoint problems. But then again, I'm an embedded programmer and as far as I'm concerned characters are only ever 8 bits wide. String problems are numeric problems for me.

    While I don't want to discourage people from exploring problem spaces, do understand that the problem space of ASCII has been trodden to the bedrock. Many problems like "does this string contain a vowel" have been optimally solved for decades. Your explorations should include looking at how we solved these problems in the 20th century, because those solutions are likely still extremely relevant.

  • Waiting for a few follow-ups:

    “What every programmer should know about vowels”

    “You and your vowels”

    “What the vowels have wrought”

    “We’re hiring! Vowels (YC 26)”

    And last but not least:

    “Things I Won’t Work With: Vowels”

  • need to add a bloom filter

  • LOL. I found that take time to find amazon affiliate link on this blog is so funny.

  • If your approach to "the fastest way to do x" starts with writing some python, you're already not taking the assignment seriously.

  • [flagged]