The Power of SearchValues in .NET

The SearchValues is a new type introduced in .NET 8. It provides an immutable, read-only set of values optimized for efficient searching.

But how is it better than traditional search? How does it work? Why is it faster? Let’s dive into the details.

Table Of Contents

Intro
Classic way
Spans
SearchValues
Search by multiple strings
Summary

Intro

Text searching is often a task in programming. When we search for some characters or substrings in relatively small strings, it works fast. However, we can expect slower performance when finding a “needle” in the “haystack”.

In programming, “haystack” is often used metaphorically to describe the larger data set in which you search for a specific item, commonly called the “needle.” The phrase “finding a needle in a haystack” frequently illustrates the challenge of searching for a small, specific item within a large dataset.

For example, let’s take the book content of Hackers, Heroes of the Computer Revolution. Chapters 1 and 2.

Imagine we want to count all letters and digits from 0 to 9. That is “ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789”.

Classic way

Let’s get a book content.

NOTE:
I use .Result instead of the await keyword because I use Spans later. In .NET 8, you cannot use Spans with async/await even if they are not part of the async block because Spans are ref struct. This issue will be fixed in .NET 9.

string bookContent = new HttpClient()
    .GetStringAsync("https://www.gutenberg.org/cache/epub/729/pg729.txt")
    .Result;

Having the string and search values, we are going to use string.IndexOfAny method.

[Benchmark]
public int StringIndexOfAny()
{
    char[] searchValues = 
        "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"
        .ToArray();

    int count = 0, index = 0;
    while ((index = bookContent.IndexOfAny(searchValues, index)) >= 0)
    {
        count++;
        index++;
    }

    return count;
}

The mean execution time for this method is 22,336.9 μs (microseconds). It’s fast, but we can do better.

All benchmarks are at the end of the post.

Spans

Let’s solve the same task using ReadOnlySpan<char>.IndexOfAny method.

[Benchmark]
public int SpanIndexOfAny()
{
    string searchValues 
        = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
    ReadOnlySpan<char> remaining = bookContent;
    int count = 0, pos;
    while ((pos = remaining.IndexOfAny(searchValues)) >= 0)
    {
        count++;
        remaining = remaining.Slice(pos + 1);
    }

    return count;
}

This time, the mean execution time is 4,055.2 μs. It’s 5.5 times faster!

The thing is that the Spans IndexOfAny method is optimized for search.

It uses vector processing if the search set is a maximum of 5 characters.

Vector processing, also known as SIMD (Single Instruction, Multiple Data), is a technique used in modern CPUs to simultaneously perform operations on multiple data points. It allows for significant performance improvements, particularly in tasks that involve large amounts of data processing.

However, if the search set is bigger than 5 characters (like in our case), it uses the Bloom filter. In .NET, it’s a probabilistic map implementation. You can take a look at the implementation of IndexOfAny.

In simple words, it creates a bitmap. For every needle character, it sets 2 bits in that bitmap. Each character is checked against a bitmap when searching the haystack to see if both corresponding bits are set. If neither bit is set, the character can’t be part of the needle, allowing the search to proceed. However, if both bits are set, it’s probable — though not sure — that the character exists in the needle. At this point, the needle is searched for the character to confirm a match.

It’s a more efficient way to search the values. But can we do even better?

SearchValues

.NET 8 introduced the new type SearchValues. It provides an immutable, read-only set of values optimized for efficient searching.

First, we need to create the instance of our search values.

private static readonly SearchValues<char> s_searchValues 
    = SearchValues.Create("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789");

Now, we can use them to find all our search values in the book content.

[Benchmark(Baseline = true)]
public int Search_Values()
{
    ReadOnlySpan<char> remaining = bookContent;
    int count = 0, pos;
    while ((pos = remaining.IndexOfAny(s_searchValues)) >= 0)
    {
        count++;
        remaining = remaining.Slice(pos + 1);
    }

    return count;
}

The mean execution time is 582.9 μs!

All benchmarks look the following.

The SearchValues do the same job almost 7 times faster than Span.IndexOfAny and 38 times faster than the classic string.IndexOfAny!

But how?

In the case of Spans IndexOfAny, the searching algorithm first examines if the search can be vectorized; if they’re not, it builds the probabilistic map.

The problem is that these steps are done for every lookup. Over and over again.

The whole magic of SearchValues is to do it once and cache the result for multiple lookups.

Search by multiple strings

In .NET 8, the SearchValues type is limited to searching by characters. However, in .NET 9 (in the preview when this post is published), it will be possible to search by multiple strings.

ReadOnlySpan<string> searchWords = ["dummy", "text", "and"];
SearchValues<string> searchValues =
    SearchValues.Create(searchWords, StringComparison.OrdinalIgnoreCase);

var searchString =
    """
    Lorem Ipsum is simply dummy text of
    the printing and typesetting industry.
    """;

var index = searchString
    .AsSpan()
    .IndexOfAny(searchValues);

Summary

SearchValues in .NET is a powerful structure that improves the efficiency of search operations. Providing a dedicated and optimized method for lookups, helps you write more performant and cleaner code, especially in scenarios where checking for multiple values is frequent.

The source code for benchmarks you can find on my GitHub.

Kamran

September 5, 2024 at 01:26

Great subject, Oleg! However, some inconsistencies exist between the terms used in the code samples and the explanations. This makes the reading experience a bit uneven.
Example 1: you introduce and explain “haystack” and “needle” but it is not obvious in the code which variable is which one. This can especially be a little bit confusing for newbies.
Example 2: about Span.IndexOfAny you say “It [Span’s IndexOfAny] uses vector processing if the search set is a maximum of 5 characters.” Again juniors and less experienced people may find it hard to tell which one, haystack or needle is “search set”.

Oleg Kyrylchuk
September 7, 2024 at 11:34
Thanks for your feedback, Kamran!
I agree that some terms can be difficult and confusing for newbies. I’ll take your feedback into account for future posts!
Cheers,
Oleg