Menno@Tabeoka: In which I learn about multithreading performance

... or an exercise in optimizing IndexOf on List<T> for multicore.

So I had a great plan.
.NET 4 introduced AsParallel() into LINQ (well, 'PLINQ' , but try using that word in conversation without giggling uncontrollably). So how about re-implementing some of these methods in a multithreaded way for .NET 3.5?
To get my feet wet, I decided to start off simple: I re-implement IndexOf on List<T>. Should be easy:

Perfect for splitting up, each thread just takes a range of the list
No need for critical sections

So I wrote an extension method IndexOfParallel<T>:

Creates as many threads as there are logical cores,
pass them a list range, and a status object
each thread checks its range of the list, sets the status object to found and returns
main thread calls .Join() on each created thread

Done! Let's check how much better it performs compared to the regular IndexOf!

Looking for a thousand random ints in a list of a million ints:

Regular version: 8565064 ticks
Multithreaded version: 542255931 ticks

Whoops! Only sixty times slower!

Perhaps a million ints is too small to have the proper effect. Let's try ten million:

Regular version: 9131999 ticks
Multithreaded version: 149267210 ticks

Well, that's only sixteen times slower now. Progress!

I check the source for IndexOf on List<T>, which uses Array.IndexOf on its internal array. Turns out it uses a native method for basic types. Clearly I can't improve on that. Perhaps I should compare against a List<string> (100,000 items):

Regular version: 7044903 ticks
Multithreaded version: 93284309 ticks

Thirteen times slower. This is starting to piss me off. Why is it slow? What is slow?

Perhaps I shouldn't be trying to reinvent the wheel. What happens if I just use the overload of IndexOf on List<T> that takes a range instead of writing my own loop? Clearly I'll be losing my early exit, but with enough cores maybe it evens out:

Regular version: 7563221 ticks
Multithreaded version: 140339244 ticks

Nope! Twenty times slower! If I fire this version off against a List<int> of a million, I gave up waiting for it to finish at all. Extremely slow. I suspect the native method lock the List's internal array in memory, and that may be marked as a critical section.
Whatever the reason, let's scrap this, and go back to my own loop. What if we unroll the loop, say, four times? If that speeds it up significantly, we can deduce the loop implementation is slowing everything down:

Regular version: 7122743 ticks
Multithreaded version: 89569552 ticks

A small gain: twelve times slower. Clearly the loop is fine. So the thread creating is probably to blame. Let's not create our own threads, and use the ThreadPool:

Regular version: 7034966ticks
Multithreaded version: 5879994 ticks

Whaaa! Success! It's not much, but I finally beat the built-in version. Let's see if we can improve it a bit more. How about turn the status class into a struct, and avoid the getter for the properties by turning them into public fields? Obviously we need to ref the method parameter:

Regular version: 6979197 ticks
Multithreaded version: 4006500 ticks

Awesome! If we increase the number of strings in the list to a million, the effect increases too:

Regular version: 76802790 ticks
Multithreaded version: 36655462 ticks

Twice as fast! I'm sure this can still be improved significantly. I still need to figure out at what list size it makes sense to switch to multithreaded. If you're working on a List<T> where T's implementation of Equals is slow, it should do better. There are a number of lessons learned already though:

This method's only worth it in a few situations. Mostly just not.
Only create new Threads if you will hold onto them for a long time. Creating new threads takes long.
Prefer using the ThreadPool.
Measure Measure Measure!

It'll be interesting to see how we manage reimplementing .Where(). The Predicate delegate could be pretty expensive. IEnumerable<T> is forward only. Will exporting to List<T> and splitting up be faster? Excitement!

So there we are.We beat the built-in IndexOf.

Menno

Here's, for now, the final result:

using System;
using System.Collections.Generic;
using System.Threading;

namespace Tabeoka.Extensions
{
    public static class ExtensionMethods
    {
        public static int IndexOfParallel<T>(this List<T> source, T item)
        {
            int threadCount = GetOptimalThreadCount(source.Count);

            if (threadCount == 1)
                return source.IndexOf(item);

            SearchStatus status = new SearchStatus()
            {
                Found = false,
                FoundIndex = -1
            };

            // Looks like the ThreadPool always hangs onto at least 
            // # of cores threads, if left unset otherwise
            using (ManualResetEvent resetEvent = new ManualResetEvent(false))
            {
                int threadsFinished = 0;
                for (int i = 0; i < threadCount; i++)
                {
                    int fromIndex = (source.Count * i) / threadCount;
                    int toIndex = (source.Count * (i + 1)) / threadCount;

                    ThreadPool.QueueUserWorkItem(new WaitCallback(delegate(object t)
                    {
                        SearchListRange(source, item, fromIndex, toIndex, ref status);
                        if (Interlocked.Increment(ref threadsFinished) == threadCount)
                            resetEvent.Set();

                    }));
                }

                resetEvent.WaitOne();
            }

            return status.FoundIndex;
        }

        private static int GetOptimalThreadCount(int listCount)
        {
            // needs more sophisticated logic
            return Math.Min(listCount, Environment.ProcessorCount);
        }

        private static void SearchListRange<T>(List<T> source, T item, int fromIndex, int toIndex, ref SearchStatus status)
        {
            int i;

            for (i = fromIndex; i < toIndex; i += 4)
            {
                if (source[i].Equals(item))
                {
                    status.FoundIndex = i;
                    status.Found = true;
                    return;
                }

                if (source[i + 1].Equals(item))
                {
                    status.FoundIndex = i + 1;
                    status.Found = true;
                    return;
                }

                if (source[i + 2].Equals(item))
                {
                    status.FoundIndex = i + 2;
                    status.Found = true;
                    return;
                }

                if (source[i + 3].Equals(item))
                {
                    status.FoundIndex = i + 3;
                    status.Found = true;
                    return;
                }

                // if some other thread found it; quit searching!
                if (status.Found)
                    return;
            }

            // finishing up loop
            for (i = i - 3; i < toIndex; i++)
            {
                if (source[i].Equals(item))
                {
                    status.FoundIndex = i;
                    status.Found = true;
                    return;
                }
            }
        }
    }

    internal struct SearchStatus
    {
        public bool Found;
        public int FoundIndex;
    }
}

Menno@Tabeoka

Code Highlighting

Thursday, October 4, 2012

In which I learn about multithreading performance

No comments:

Post a Comment