Going harder on the topic of string validation

TL;DR: Simple string validation code for ASCII characters can be massively faster than Regular Expressions based approaches. Jump tables represent a technique with amazing speed characteristics for string validations in particular and similar problem patterns. When optimizing performance, always include a wide range of sample data to cover best and worst case scenarios!

Two days ago, I read an interesting blog post by Maarten Balliauw on "Making string validation faster by not using a regular expression". The article caught my interest just because it was about performance and that's what generally gets me - waaaay too easily! Anyway, Maarten's blog post finally made me decide to start my own blog (thanks, Maarten!). So you are just looking at my very first blog post.

The bottom line of Maarten's blog post was roughly this: While regular expressions quite often seem like an obvious choice and the Swiss army knife for anything that's to do with strings, sometimes it pays off quite nicely on the performance side to handcraft a certain routine using plain honest (readable even!) C# code - so ignoring the potential beauty and suspected benefits of any specialized frameworks.

Maarten's challenge was all about string validation which most of us face on an unpleasantly regular basis. In this concrete case: How can we best tell if a parsed token contains a valid identifier where "valid" means: the only allowed characters are
- all uppercase letters: A-Z
- all lowercase letters: a-z
- all numbers: 0-9
- plus a handful of special characters: @/._-

His latest suggestion was to use the following method:

private static bool MatchesASCII(string value)
{
    var len = value.Length;
    var matches = len >= 1 && len <= 254;

    if (matches)
    {
        for (int i = 0; i < len; i++)
        {
            matches = (value[i] >= 48 && value[i] <= 57) // 0-9
                      || (value[i] >= 65 && value[i] <= 90) // A-Z
                      || (value[i] >= 97 && value[i] <= 122) // a-z
                      || value[i] == '@'
                      || value[i] == '/'
                      || value[i] == '.'
                      || value[i] == '_'
                      || value[i] == '-';

            if (!matches) return false;
        }
    }

    return matches;
}

Now, this code is already way faster than the RegEx based version. But is there anything else hiding that could be optimized? And it turns out: There is.

There are at least two issues with the above method. The first one is that it runs through all characters and performs a sequence of up to 11 checks on each character. One of the worst case-scenarios for it would be a string of the form "--------------------". In order to tell that this is actually a valid identifier (admittedly a bit of a weird one), we would need to traverse 20 characters and run all checks on each of them like so:
- Is our ASCII value >= 48 (check #1) and at the same time <= 57 (check #2)? No. --> Next check!
- Is our ASCII value >= 65 (check #3) and at the same time <= 90 (check #4)? No. --> Next check!
- Is our ASCII value >= 97 (check #5) and at the same time <= 122 (check #6)? No. --> Next check!
- Is our ASCII value value[i] == '@' (check #7)? No. --> Next check!
- Is our ASCII value value[i] == '/' (check #8)? No. --> Next check!
- Is our ASCII value value[i] == '.' (check #9)? No. --> Next check!
- Is our ASCII value value[i] == '_' (check #10)? No. --> Next check!
- Is our ASCII value value[i] == '-' (check #11)? Err. Yes. Hooray! Phew, we got a valid character. Let's check the next one!

Needless to say that this gets worse the more checks we need to perform. What if we had more allowed special character? What if only some very specific non-continuous set of characters was allowed instead of the pleasantly wide sets A-Z and a-z which we can use range checks on?

The second issue with this method is that its performance depends on the shape of the input (not only the length). If we were to pass in the following strings we should expect to see vastly different results:

"00000000000000000000" (20x zero)
"--------------------" (20x dash)
"@@@@@@@@@@@@@@@@@@@@" (20x at)

That, again, is caused by the order of all the checks in the big "if" statement which determines how many failed matching attempts we will have to go through before we identify a valid character. Measuring these cases with BenchmarkDotNet, I see the following results on my machine:

BenchmarkDotNet=v0.10.5, OS=Windows 6.1.7601
Processor=Intel Xeon CPU X5670 2.93GHzIntel Xeon CPU X5670 2.93GHz, ProcessorCount=24
Frequency=2857460 Hz, Resolution=349.9612 ns, Timer=TSC
  [Host]       : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1098.0
  LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1098.0;compatjit-v4.6.1098.0
  LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1098.0
  RyuJitX64    : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1098.0

Runtime=Clr

Method	Job	Jit	Platform	Mean	Error	StdDev	Median
'Custom code - ASCII only - 20 x zero'	LegacyJitX64	LegacyJit	X64	48.40 ns	1.2470 ns	3.6768 ns	49.92 ns
'Custom code - ASCII only - 20 x dash'	LegacyJitX64	LegacyJit	X64	133.33 ns	3.1474 ns	7.3569 ns	129.01 ns
'Custom code - ASCII only - 20 x at'	LegacyJitX64	LegacyJit	X64	87.70 ns	0.3725 ns	0.3484 ns	87.59 ns
'Custom code - ASCII only - 20 x zero'	LegacyJitX86	LegacyJit	X86	45.06 ns	0.0205 ns	0.0171 ns	45.06 ns
'Custom code - ASCII only - 20 x dash'	LegacyJitX86	LegacyJit	X86	88.10 ns	1.4674 ns	1.1457 ns	87.76 ns
'Custom code - ASCII only - 20 x at'	LegacyJitX86	LegacyJit	X86	71.33 ns	2.4138 ns	4.5925 ns	68.75 ns
'Custom code - ASCII only - 20 x zero'	RyuJitX64	RyuJit	X64	41.90 ns	1.6176 ns	2.0458 ns	40.79 ns
'Custom code - ASCII only - 20 x dash'	RyuJitX64	RyuJit	X64	82.16 ns	0.0549 ns	0.0486 ns	82.16 ns
'Custom code - ASCII only - 20 x at'	RyuJitX64	RyuJit	X64	66.99 ns	1.4145 ns	3.4162 ns	64.94 ns

So how can we improve all of that?

A looong time ago, some utterly nifty guys came up with the concept of "jump tables". The basic idea here boils down to translating some input value (the "value under test") into an offset that maps directly into a small precomputed array (aka "to a location in memory") where we find the answer to all our problems. This technique is something that e.g. most compilers use in order to optimize certain cases of switch statements. You can read more on that topic here, here or here.

Sounds confusing? It's easy. Let's go!

Some background information (skip this if you're short of time) that we might want to understand first before we go ahead is something just as ancient as jump tables: the good ol' ASCII table. .NET uses UTF-16 to store strings internally. Luckily for us, though, the first 128 code points of UTF-16 are identical with the first part of the ASCII table (read and read) and we are looking for characters in the ASCII range only. So that's convenient because we do not need to descend into the crazy lands of Unicode. All we need to know is ASCII. Also, we are not making an oversimplification here since the original check in Maarten's blog post was using RegEx character ranges [A-Za-z] which again only check for plain ASCII characters without umlauts, accents or any other fancy stuff.

Right. Where were we? Ah.

The first step towards our goal is creating our jump table. This table is nothing but an array, wide enough to hold one boolean value ("is the character allowed or not?") for the entire range of potentially valid characters. Since the highest ASCII value of the characters we need to check for is 122 (lowercase 'z') we can cut the array at 123 elements:

        private static readonly bool[] _allowedChars = InitializeAllowedChars();

        private static bool[] InitializeAllowedChars()
        {
            var allowedChars = new bool[123]; // all values will be false by default
            for (int c = 'a'; c <= 'z'; c++)
            {
                allowedChars[c] = true;
            }
            for (int c = 'A'; c <= 'Z'; c++)
            {
                allowedChars[c] = true;
            }
            for (int c = '0'; c <= '9'; c++)
            {
                allowedChars[c] = true;
            }
            foreach (char c in "@/._-")
            {
                allowedChars[c] = true;
            }

            return allowedChars;
        }

The cost of this is a little bit of memory on the heap (read) which is 123*sizeof(bool)=123 bytes - that should not cause us too much of a headache. Also, we run the initialization code only once per AppDomain during the static initialization phase so we get implicit thread-safety (read). Nice!

And that is more or less it. Hold on! The actual check method is still missing. Let's fix that:

        public static bool MatchesWithJumpTable(string value)
        {
            if (!(value.Length > 0 && value.Length < 255)) return false;

            for (int i = 0; i < value.Length; i++)
            {
                // now, we can
                // a) first check the boundaries of our pre-populated array
                //    to see if the char has an integer representation higher than what we keep in the array
                //    in which case we would return false straight away or otherwise
                // b) jump to the right element in the array by using an offset (int value of the char)
                //    which will immediately tell us if the char is allowed or not.
                if (value[i] >= _allowedChars.Length || !_allowedChars[value[i]])
                {
                    return false;
                }
            }

            return true;
        }

And that's it, really. Still short and concise enough for my liking.

We have managed to cut down the number of required checks to a static number which is 2:
- the array dimension check and
- the actual jump table check.

We can now throw any number of arbitrary characters into our filter and the algorithm will still work at the same speed. Also, performance now only depends on the input length, string characteristics do not play a role anymore.

Let's see what BenchmarkDotNet thinks about this algorithm (we only use the "20 x zero" test string because any kind of string will now exhibit the exact same performance characteristics):

Method	Job	Jit	Platform	Mean	Error	StdDev	Median
'Jump table - 20 x zero'	LegacyJitX64	LegacyJit	X64	46.95 ns	1.6837 ns	4.9646 ns	43.34 ns
'Jump table - 20 x zero'	LegacyJitX86	LegacyJit	X86	31.87 ns	0.0474 ns	0.0444 ns	31.87 ns
'Jump table - 20 x zero'	RyuJitX64	RyuJit	X64	40.04 ns	1.5401 ns	2.1589 ns	38.82 ns

Not too surprisingly, these numbers a largely in line with the results above for the "20 x zero" string best case. After all, we only check two conditions per character for this special case in the method with the multiple checks, too...

Now, there is one more thing we can do as usual. Some like it, others might call the police. "Unsafe" is the magic keyword here. And to cut a long story short, the fastest solution I could come up with looks like this:

        public static unsafe bool UnsafeMatches(string value)
        {
            if (!(value.Length > 0 && value.Length < 255)) return false;

            fixed (char* c = value)
            {
                for (int i = 0; i < value.Length; i++)
                {
                    int i2 = *(c + i);
                    if (i2 >= _allowedChars.Length || !_allowedChars[i2])
                    {
                        return false;
                    }
                }
            }

            return true;
        }

Nothing fancy or surprising here. The exact same algorithm as above - simply rewritten to use a pointer.

BenchmarkDotNet tells us:

Method	Job	Jit	Platform	Mean	Error	StdDev	Median
'Unsafe jump table - 20 x zero'	LegacyJitX64	LegacyJit	X64	32.73 ns	0.0755 ns	0.0631 ns	32.70 ns
'Unsafe jump table - 20 x zero'	LegacyJitX86	LegacyJit	X86	26.85 ns	0.0980 ns	0.0765 ns	26.84 ns
'Unsafe jump table - 20 x zero'	RyuJitX64	RyuJit	X64	37.36 ns	2.0045 ns	2.0585 ns	36.22 ns

Depending on the concrete usage scenario of this method, we could optimize it even further by splitting it into two parts the way Sergey Teplyakov describes here. But that would require real world sample data and loads of additional test runs.

So we are happy enough for today. Let's conclude:

- Jump tables provide a neat way of speeding up things where we can precalculate check results for a small range of candidates.

- When optimizing for performance, we need to always make sure to include a wide range of test candidates in order to cover the best, average and worst case scenarios.

- Unsafe code generally rocks when it comes to performance. It's not as easily digestible. But once you start micro-optimizing, oh well, you should be able to quickly grasp a short block of unsafe code like the above.

- I wonder if jump tables could potentially be useful in the case of a compiled RegEx where character ranges are known upfront, too. But that's one for the experts...

Private lesson for me:

- I should not blog. That stuff takes time...

Anyhoo, my next post might shed some more light on the the topic of jump tables and there's more in the pipeline so stay tuned.

Comments

UnknownMay 3, 2017 at 3:43 PM
great article! And good style too. Although I believe it should be 123*sizeof(bool), not sizeof(char).
UnknownMay 6, 2017 at 7:47 AM
You could save more space with a BitArray
DMacMarch 1, 2019 at 6:54 PM
Hello Daniel, Trying to reach you to see if we can get your input on this stackoverflow question that has been quite challenging. If you think I can connect with a charge for the hour you help with this, I am happy to do that. https://stackoverflow.com/questions/54818445/extract-a-particular-record-from-mongodb-where-the-records-use-object-guid-in-c My email is dmaczoner@gmail.com
DMacMarch 1, 2019 at 6:54 PM
Hello Daniel, Trying to reach you to see if we can get your input on this stackoverflow question that has been quite challenging. If you think I can connect with a charge for the hour you help with this, I am happy to do that. https://stackoverflow.com/questions/54818445/extract-a-particular-record-from-mongodb-where-the-records-use-object-guid-in-c My email is dmaczoner@gmail.com
programathsSeptember 23, 2021 at 9:23 AM
RE suffer also for the "non streamable" issue.

They work on full strings, so you need concatenation. FSM offers you a streamable version where you give input character per character.

On top of that, you can do calculations and anything funny like looking up in a doctionnary etc.

Regexp are really nice for quickly implementing something. Not the "go to" thing.

It's also funny how people are frightened by RE and at the same time, abuse them 😅

Search This Blog

Harp on C# - a blog about micro optimizations and other disgraceful attempts at code

Going harder on the topic of string validation

Comments

Post a Comment