Going harder on the topic of string validation

TL;DR: Simple string validation code for ASCII characters can be massively faster than Regular Expressions based approaches. Jump tables represent a technique with amazing speed characteristics for string validations in particular and similar problem patterns. When optimizing performance, always include a wide range of sample data to cover best and worst case scenarios!

Two days ago, I read an interesting blog post by Maarten Balliauw on "Making string validation faster by not using a regular expression". The article caught my interest just because it was about performance and that's what generally gets me - waaaay too easily! Anyway, Maarten's blog post finally made me decide to start my own blog (thanks, Maarten!). So you are just looking at my very first blog post.

The bottom line of Maarten's blog post was roughly this: While regular expressions quite often seem like an obvious choice and the Swiss army knife for anything that's to do with strings, sometimes it pays off quite nicely on the performance side to handcraft a certain routine using plain honest (readable even!) C# code - so ignoring the potential beauty and suspected benefits of any specialized frameworks.

Maarten's challenge was all about string validation which most of us face on an unpleasantly regular basis. In this concrete case: How can we best tell if a parsed token contains a valid identifier where "valid" means: the only allowed characters are
- all uppercase letters: A-Z
- all lowercase letters: a-z
- all numbers: 0-9
- plus a handful of special characters: @/._-

His latest suggestion was to use the following method:
private static bool MatchesASCII(string value)
{
    var len = value.Length;
    var matches = len >= 1 && len <= 254;

    if (matches)
    {
        for (int i = 0; i < len; i++)
        {
            matches = (value[i] >= 48 && value[i] <= 57) // 0-9
                      || (value[i] >= 65 && value[i] <= 90) // A-Z
                      || (value[i] >= 97 && value[i] <= 122) // a-z
                      || value[i] == '@'
                      || value[i] == '/'
                      || value[i] == '.'
                      || value[i] == '_'
                      || value[i] == '-';

            if (!matches) return false;
        }
    }

    return matches;
} 
Now, this code is already way faster than the RegEx based version. But is there anything else hiding that could be optimized? And it turns out: There is.

There are at least two issues with the above method. The first one is that it runs through all characters and performs a sequence of up to 11 checks on each character. One of the worst case-scenarios for it would be a string of the form "--------------------". In order to tell that this is actually a valid identifier (admittedly a bit of a weird one), we would need to traverse 20 characters and run all checks on each of them like so:
- Is our ASCII value >= 48 (check #1) and at the same time <= 57 (check #2)? No. --> Next check!
- Is our ASCII value >= 65 (check #3) and at the same time <= 90 (check #4)? No. --> Next check!
- Is our ASCII value >= 97 (check #5) and at the same time <= 122 (check #6)? No. --> Next check!
- Is our ASCII value value[i] == '@' (check #7)? No. --> Next check!
- Is our ASCII value value[i] == '/' (check #8)? No. --> Next check!
- Is our ASCII value value[i] == '.' (check #9)? No. --> Next check!
- Is our ASCII value value[i] == '_' (check #10)? No. --> Next check!
- Is our ASCII value value[i] == '-' (check #11)? Err. Yes. Hooray! Phew, we got a valid character. Let's check the next one!

Needless to say that this gets worse the more checks we need to perform. What if we had more allowed special character? What if only some very specific non-continuous set of characters was allowed instead of the pleasantly wide sets A-Z and a-z which we can use range checks on?

The second issue with this method is that its performance depends on the shape of the input (not only the length). If we were to pass in the following strings we should expect to see vastly different results:

"00000000000000000000" (20x zero)
"--------------------" (20x dash)
"@@@@@@@@@@@@@@@@@@@@" (20x at)

That, again, is caused by the order of all the checks in the big "if" statement which determines how many failed matching attempts we will have to go through before we identify a valid character. Measuring these cases with BenchmarkDotNet, I see the following results on my machine:
BenchmarkDotNet=v0.10.5, OS=Windows 6.1.7601
Processor=Intel Xeon CPU X5670 2.93GHzIntel Xeon CPU X5670 2.93GHz, ProcessorCount=24
Frequency=2857460 Hz, Resolution=349.9612 ns, Timer=TSC
  [Host]       : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1098.0
  LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1098.0;compatjit-v4.6.1098.0
  LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1098.0
  RyuJitX64    : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1098.0
Runtime=Clr  
MethodJobJitPlatformMeanErrorStdDevMedian
'Custom code - ASCII only - 20 x zero'LegacyJitX64LegacyJitX6448.40 ns1.2470 ns3.6768 ns49.92 ns
'Custom code - ASCII only - 20 x dash'LegacyJitX64LegacyJitX64133.33 ns3.1474 ns7.3569 ns129.01 ns
'Custom code - ASCII only - 20 x at'LegacyJitX64LegacyJitX6487.70 ns0.3725 ns0.3484 ns87.59 ns
'Custom code - ASCII only - 20 x zero'LegacyJitX86LegacyJitX8645.06 ns0.0205 ns0.0171 ns45.06 ns
'Custom code - ASCII only - 20 x dash'LegacyJitX86LegacyJitX8688.10 ns1.4674 ns1.1457 ns87.76 ns
'Custom code - ASCII only - 20 x at'LegacyJitX86LegacyJitX8671.33 ns2.4138 ns4.5925 ns68.75 ns
'Custom code - ASCII only - 20 x zero'RyuJitX64RyuJitX6441.90 ns1.6176 ns2.0458 ns40.79 ns
'Custom code - ASCII only - 20 x dash'RyuJitX64RyuJitX6482.16 ns0.0549 ns0.0486 ns82.16 ns
'Custom code - ASCII only - 20 x at'RyuJitX64RyuJitX6466.99 ns1.4145 ns3.4162 ns64.94 ns

So how can we improve all of that?

A looong time ago, some utterly nifty guys came up with the concept of "jump tables". The basic idea here boils down to translating some input value (the "value under test") into an offset that maps directly into a small precomputed array (aka "to a location in memory") where we find the answer to all our problems. This technique is something that e.g. most compilers use in order to optimize certain cases of switch statements. You can read more on that topic herehere or here.

Sounds confusing? It's easy. Let's go!

Some background information (skip this if you're short of time) that we might want to understand first before we go ahead is something just as ancient as jump tables: the good ol' ASCII table. .NET uses UTF-16 to store strings internally. Luckily for us, though, the first 128 code points of UTF-16 are identical with the first part of the ASCII table (read and read) and we are looking for characters in the ASCII range only. So that's convenient because we do not need to descend into the crazy lands of Unicode. All we need to know is ASCII. Also, we are not making an oversimplification here since the original check in Maarten's blog post was using RegEx character ranges [A-Za-z] which again only check for plain ASCII characters without umlauts, accents or any other fancy stuff.

Right. Where were we? Ah.


The first step towards our goal is creating our jump table. This table is nothing but an array, wide enough to hold one boolean value ("is the character allowed or not?") for the entire range of potentially valid characters. Since the highest ASCII value of the characters we need to check for is 122 (lowercase 'z') we can cut the array at 123 elements:
        private static readonly bool[] _allowedChars = InitializeAllowedChars();

        private static bool[] InitializeAllowedChars()
        {
            var allowedChars = new bool[123]; // all values will be false by default
            for (int c = 'a'; c <= 'z'; c++)
            {
                allowedChars[c] = true;
            }
            for (int c = 'A'; c <= 'Z'; c++)
            {
                allowedChars[c] = true;
            }
            for (int c = '0'; c <= '9'; c++)
            {
                allowedChars[c] = true;
            }
            foreach (char c in "@/._-")
            {
                allowedChars[c] = true;
            }

            return allowedChars;
        }
The cost of this is a little bit of memory on the heap (read) which is 123*sizeof(bool)=123 bytes - that should not cause us too much of a headache. Also, we run the initialization code only once per AppDomain during the static initialization phase so we get implicit thread-safety (read). Nice!

And that is more or less it. Hold on! The actual check method is still missing. Let's fix that:
        public static bool MatchesWithJumpTable(string value)
        {
            if (!(value.Length > 0 && value.Length < 255)) return false;

            for (int i = 0; i < value.Length; i++)
            {
                // now, we can
                // a) first check the boundaries of our pre-populated array
                //    to see if the char has an integer representation higher than what we keep in the array
                //    in which case we would return false straight away or otherwise
                // b) jump to the right element in the array by using an offset (int value of the char)
                //    which will immediately tell us if the char is allowed or not.
                if (value[i] >= _allowedChars.Length || !_allowedChars[value[i]])
                {
                    return false;
                }
            }

            return true;
        }
And that's it, really. Still short and concise enough for my liking.

We have managed to cut down the number of required checks to a static number which is 2:
- the array dimension check and
- the actual jump table check.

We can now throw any number of arbitrary characters into our filter and the algorithm will still work at the same speed. Also, performance now only depends on the input length, string characteristics do not play a role anymore.

Let's see what BenchmarkDotNet thinks about this algorithm (we only use the "20 x zero" test string because any kind of string will now exhibit the exact same performance characteristics):

MethodJobJitPlatformMeanErrorStdDevMedian
'Jump table - 20 x zero'LegacyJitX64LegacyJitX6446.95 ns1.6837 ns4.9646 ns43.34 ns
'Jump table - 20 x zero'LegacyJitX86LegacyJitX8631.87 ns0.0474 ns0.0444 ns31.87 ns
'Jump table - 20 x zero'RyuJitX64RyuJitX6440.04 ns1.5401 ns2.1589 ns38.82 ns

Not too surprisingly, these numbers a largely in line with the results above for the "20 x zero" string best case. After all, we only check two conditions per character for this special case in the method with the multiple checks, too...

Now, there is one more thing we can do as usual. Some like it, others might call the police. "Unsafe" is the magic keyword here. And to cut a long story short, the fastest solution I could come up with looks like this:
        public static unsafe bool UnsafeMatches(string value)
        {
            if (!(value.Length > 0 && value.Length < 255)) return false;

            fixed (char* c = value)
            {
                for (int i = 0; i < value.Length; i++)
                {
                    int i2 = *(c + i);
                    if (i2 >= _allowedChars.Length || !_allowedChars[i2])
                    {
                        return false;
                    }
                }
            }

            return true;
        }
Nothing fancy or surprising here. The exact same algorithm as above - simply rewritten to use a pointer.

BenchmarkDotNet tells us:
MethodJobJitPlatformMeanErrorStdDevMedian
'Unsafe jump table - 20 x zero'LegacyJitX64LegacyJitX6432.73 ns0.0755 ns0.0631 ns32.70 ns
'Unsafe jump table - 20 x zero'LegacyJitX86LegacyJitX8626.85 ns0.0980 ns0.0765 ns26.84 ns
'Unsafe jump table - 20 x zero'RyuJitX64RyuJitX6437.36 ns2.0045 ns2.0585 ns36.22 ns

Depending on the concrete usage scenario of this method, we could optimize it even further by splitting it into two parts the way Sergey Teplyakov describes here. But that would require real world sample data and loads of additional test runs.

So we are happy enough for today. Let's conclude:

- Jump tables provide a neat way of speeding up things where we can precalculate check results for a small range of candidates.

- When optimizing for performance, we need to always make sure to include a wide range of test candidates in order to cover the best, average and worst case scenarios.

- Unsafe code generally rocks when it comes to performance. It's not as easily digestible. But once you start micro-optimizing, oh well, you should be able to quickly grasp a short block of unsafe code like the above.

- I wonder if jump tables could potentially be useful in the case of a compiled RegEx where character ranges are known upfront, too. But that's one for the experts...

Private lesson for me:

- I should not blog. That stuff takes time...

Anyhoo, my next post might shed some more light on the the topic of jump tables and there's more in the pipeline so stay tuned.

Comments

  1. great article! And good style too. Although I believe it should be 123*sizeof(bool), not sizeof(char).

    ReplyDelete
    Replies
    1. Well spotted! Thanks! It's now corrected. It was way past midnight when I wrote this and I guess my brain was full of chars only...

      Delete
  2. You could save more space with a BitArray

    ReplyDelete
    Replies
    1. That's certainly correct and could prove valuable in situations where we have to deal with a lot more and more sparsely distributed elements. Have you tried implementing it? My quick and dirty test using the built-in System.Collections.BitArray type says that for this specific case here it would be significantly (~4 times) slower. And looking at the BitArray source code this result does not come as a surprise since the Get() method has several checks and bit operations: https://github.com/Microsoft/referencesource/blob/master/mscorlib/system/collections/bitarray.cs#L194-L201

      Delete
  3. Hello Daniel, Trying to reach you to see if we can get your input on this stackoverflow question that has been quite challenging. If you think I can connect with a charge for the hour you help with this, I am happy to do that. https://stackoverflow.com/questions/54818445/extract-a-particular-record-from-mongodb-where-the-records-use-object-guid-in-c My email is dmaczoner@gmail.com

    ReplyDelete
  4. Hello Daniel, Trying to reach you to see if we can get your input on this stackoverflow question that has been quite challenging. If you think I can connect with a charge for the hour you help with this, I am happy to do that. https://stackoverflow.com/questions/54818445/extract-a-particular-record-from-mongodb-where-the-records-use-object-guid-in-c My email is dmaczoner@gmail.com

    ReplyDelete
  5. RE suffer also for the "non streamable" issue.

    They work on full strings, so you need concatenation. FSM offers you a streamable version where you give input character per character.

    On top of that, you can do calculations and anything funny like looking up in a doctionnary etc.

    Regexp are really nice for quickly implementing something. Not the "go to" thing.

    It's also funny how people are frightened by RE and at the same time, abuse them 😅

    ReplyDelete

Post a Comment