Harnessing the Power of SIMD/SSE Assembly Instructions… For Good?

Harnessing the Power of SIMD/SSE Assembly Instructions… For Good?

SIMD Assembly instructions let you manipulate batches of data in parallel, in a single core. I’ve said it once and I’ll say it again: programming is the closest thing we have to magic.

Shell commands are like small cantrips, Python scripts are little helpful Tulpas. We even have our own Daemons! But whenever we need to squeeze performance to the last byte–When we know a single misstep will make the program slow down drastically… That’s when Assembly, the darkest of black magicks, comes in.

I haven’t been writing too often lately, as you probably didn’t notice (incidentally, if this is your first time reading me, welcome! Nice to meet you!). That’s because I had a pretty harsh exam last week, and I had to prepare a lot. The subject is called ‘Computer’s Organization II’, and it’s been a huge challenge keeping up to date with it.
So I decided I would take one of the exercises I made while I practiced, and turn it into an article. That way I can break two scissors with one stone (killing birds is bad and you should feel bad).

Without delaying this further, let’s cut straight to the chase. As usual, the code is available at this Github project.

What are processor instructions?

All the code we write, be it in Python, Java, or C, is eventually interpreted, or compiled, into tiny, atomic (from a programmer’s perspective) instructions for our CPU(s).
These instructions number in the thousands, and each of them does a very small thing, interacting directly with hardware.

As an example, an instruction may write a value into memory (that’s what variable assignment translates into), turn a bit on or off, or do a logical AND.

My PC has an Intel processor, which is also the architecture we learn about in class, so sorry to all my ARM using readers, I won’t be inclusive enough today.

The language in which these instructions are specified (which is only one translation and linking away from literal binary) is called Assembly.

Translating C to Assembly: Let’s become compilers for a while.

For this article we’ll be using a very small C function. Here’s it’s entire code:

This function takes as parameters a pointer to a stream of bytes (a char weighs a single byte), a have char and a want char.
It’ll assume the stream ends in a 0 (and crash into a segmentation-fault if that’s not the case), and iterate it byte by byte, replacing each instance of ‘have’ with want.
As far as C goes, this is as fast as it gets–And it’s faster than Python by a long shot (when I ran some benchmarks, the Python version of this function took two minutes for an input size that took 6 seconds in C).

What does this function look like in Assembly language, after going through the compiler? It will probably be something like this:

Running that assembly function instead of the C version shouldn’t increase our performance. It may even lower it, since the compiler knows a few tricks we probably don’t, and does a few optimizations on this kind of code.

There’s an optimization it doesn’t usually use, though, and when it does, it never uses it to its full potential.

SIMD Instructions: Single Instruction, Multiple Data

Whenever we think of parallelism, we think of multicore processes, or even clusters. But what if we made a single core do many things at a time? That’s the idea the people at Intel had a few decades ago, and the world of image processing hasn’t been the same ever since.

You see, normally data are stored in general purpose Registers, like the ones we just used, in our CPU. Most of them are 64 bits in size, and thus can store a long, a float, or an int. Well, technically two ints, but it’s still not enough to be worth making the instructions to use them in parallel.

However, most processors have even bigger registers available: XMMs, with 128 bits on them. That’s enough for 16 whole bytes!

Let’s get this party started. source: Pixabay.

Imagine what we could do if we could process 16 bytes at a time, and make our programs 16 times faster. Reading from memory once and fetching 16 different bytes? Check. Processing them in batches and them writing them all at once into memory again? Check. The possibilities are endless.

Especially at image or signal processing, this opens the possibility to make trivially concurrent computations a lot faster. Like, a whole order of magnitude faster. And you know what function is trivially concurrent? processing a  stream of bytes with no dependencies between them.

There’s a catch, though: if regular Intel instructions seem counterintuitive or ugly to you, get ready for the SIMD ones. Their names are batpoop crazy, and most of us can’t write 5 of them in a row without looking them up in the official manual (which is thankfully available for free).

With all those caveats dealt with, let me guide you through the SIMD implementation of the function I just showed you.
Let me warn you: it’s not pretty, but it’s pretty darned fast.

First example: getting a string’s length with SIMD Assembly instructions.

First, I iterate the stream once to size it up. This function assumes the stream has a size that’s divisible by 16 (padded with zeroes in the end, with at least one 0 byte to indicate the end of the stream) for simplicity’s sake, since otherwise I’d just have to add a new if and run the non-SIMD version of the size function.
Let’s assume our users are willing to pad their arrays before passing them to us to make them divisible, in exchange for the performance boost and the saved therapy sessions.

If any of you need to throw up, or go have a bath, I’ll understand. Bookmark this article and come back in an hour.
But, we are reading from memory once every 16 bytes.
That makes this program 16 times faster than the (trivially implemented, without optimizations) C version!

Now, for the second part, let’s actually do what we were asked: Let’s replace some bytes!

That’s it. Since boolean operations byte by byte are concurrent, and so are comparisons, we can actually process each set of 16 bytes completely in parallel.

Disclaimer: To those who really care about performance, note that I’ve done 2 memory reads to fetch the same data every time: once to calculate the length, and then another to do the replacement.

It would be optimal to do everything in one swoop, but the code would be a lot uglier and not that educational. That’s the reason we should expect this program to be only 8 times faster than the C version, instead of 16. It’s still a very nice improvement.

How much faster is using SIMD? Benchmarks!

To run these benchmarks I just used the time.h C library in a small program (available at the Github project). All I did was:

  • Pick an array size (multiple of 16)
  • Initialize an array of bytes with alternating values of the given size
  • Run the C version of the function, and measure the time it takes to complete the cycle.
  • Do the same with the SIMD Assembly version.

I repeated these steps with input sizes of 1.6e7, 1.6e8 and 1.6e9. I stopped there because the next one would’ve taken a bit longer than I wished to wait, but the trend was pretty clear:

INPUT_SIZE |   ASM   | C  (Seconds)
1.6e7 | 0.005 | 0.069
1.6e8 | 0.050 | 0.687
1.6e9 | 0.466 | 6.868

It’s even more linear than I expected! And the ratio is around 13, so it’s even better than I estimated as well. I guess it has to do with doing less memory reads.

Conclusions

My first conclusion here was, learning assembly is fun.
My second conclusion was, there’s no way I’m passing that exam.
For the sake of practicality, my third conclusion is writing SIMD programs is very efficient, and whenever we need to do something quicker, we could try using it.

Or, if you’re more into higher level languages, you could look for a framework that already uses SIMD instructions in its implementation (cough, NumPy, Pandas).

In any case, I think learning about low-level stuff and the inner workings of our processors can help us write better code, and get a better idea of how things are working under the hood.

That’s all for today. I hope you’ve found this article entertaining, or even useful. If you wish to read further into this topic, I’d advise you to think of a problem where parallel processing may be useful, try and write it using SIMD, and then consult the Intel manual every couple steps to learn new instructions.

By the way, this I’ve found C and Assembly to be the perfect languages to code without an IDE. I actually wrote all the code for this article using vim, which would have been impossible in, say, Java.

As usual, if you find any bugs in my code, or a way I could have optimized this further, or even a typo, please let me know in the comments! The same goes for any positive feedback as well, as that is always appreciated.

I’ll see you around, keep coding!

Follow me on Twitter or Medium to keep receiving more articles and tutorials. Please share this article in any social media you use.
Maybe you got a friend who’s been meaning to learn more Assembly lately? Hit him up with this!

3 thoughts on “Harnessing the Power of SIMD/SSE Assembly Instructions… For Good?

  1. Shouldn’t size.asm use the trailing zero count of the complement of the mask? With the popcnt it can act funny if there are are different things stored just after the end of the string

  2. “Disclaimer: To those who really care about performance, note that I’ve done 2 memory reads to fetch the same data every time: once to calculate the length, and then another to do the replacement.

    It would be optimal to do everything in one swoop, but the code would be a lot uglier and not that educational. That’s the reason we should expect this program to be only 8 times faster than the C version, instead of 16. It’s still a very nice improvement.”

    Do you have a copy of the 16x faster version of the code available somewhere?

Leave a Reply

Your email address will not be published. Required fields are marked *