Hacker News

Validating UTF-8 bytes using only 0.45 cycles per byte (AVX edition)

by akarambiron 10/20/2018, 11:28:39 AM with 5 comments

by the_clarenceon 10/20/2018, 4:26:04 PM
I see a lot of applications trying to take advantage of SIMD, but what when you try to run them on systems that don't support these instructions? My guess is that you need to write multiple files taking advantage of different sets of instructions and then dynamically figure out which to use at runtime with cpuid, but isn't that cumbersome and a way to inflate a codebase dramatically?
by bradleyjgon 10/20/2018, 1:55:02 PM
Under the new string model in java > 8 a fairly frequent workflow is:
1) get external string
2) figure out if it is UTF-8, UTF-16, or some other recognizable encoding
3) validate the byte stream
4) figure out if the code points in the incoming string can be represented in Latin-1
5) instantiate a java string using either the Latin-1 encoder or the UTF-16 encoder
I know some or all of these steps are done using hotspot intrinsics, and then the JIT/VM does inlining, folding and so on, but I wonder how fast a custom assembly function to do all these steps at once could be.
by jwilkon 10/20/2018, 3:48:58 PM
Previous blog post on HN:
https://news.ycombinator.com/item?id=17081571
by kissielon 10/20/2018, 1:29:52 PM
I wonder about the Joules per byte. AFAIK AVX units are quite expensive energy-wise.
by akarambiron 10/20/2018, 11:52:00 AM
What does linux utilities like sed, awk use for text manipulation because they were very slow when I was changing a few table names in a sql file.