Have you ever asked anyone if assembly language might be useful nowadays? So, here’s the short answer: YES. When you know how your computer works (not a processor itself, but the whole thing - memory organization, math co-processor and others), you may optimize your code while writing it. In this short article, I shall try to show you some use cases of optimizations, which you may incorporate with the usage of low-level programming.

Recently I was reading through my old posts and found out there is a gap in the article about SSE - the post did not cover some of the implementation caveats. I decided to fulfill this and re-publish a new version.

## Finding maximum

So, let’s start-off searching a maximum element in the array. Usually, it is nothing just iterating through the array, comparing each element with some starting value. For optimization reason and for the precision’s sake we set the initial value to the first array’s element. Like this:

What we could do firstly is to store not the search element itself, but its index:

This naive optimization has its effect (time in seconds; the value found in braces):

## Vector operations

This is quite a universal algorithm, which could be used for any type, which allows comparing. But let’s think abotu how we can speed up that code. First of all, we could split the array into pieces and find maximum among them.

There is a technology, allowing that. It is called SIMD - Single Instruction - Multiple Data Stream. Simply saying, it means dealing with multiple data pieces (cells, variables, elements) with the use of a single processor’ instruction. This is done in processor command extension called SSE.

Note: your processor or even operating system may not support these operations. So, before continuing reading this article, be sure to check if it does. On Unix systems you may look for `mmx|sse|sse2|sse3|sse4_1|sse4_1|avx` in `/proc/cpuinfo` file.

SSE extension has a set of vector variables to be used. These variables (on the lowest, assembly level, they are called `XMM0` .. `XMM7` registers) allow us for storing and processing 128 bits of data as it was a set of `16 char`, `8 short`, `4 float/int`, `2 double` or `1 128-bit int` variables.

But wait! There are other versions of SSE, allowing for different registers of a different size! Check this out:

SIMD extensions

MMX - hot 1997:

• only integer items
• vectors have a length of `64 bits`
• 8 registers, namely `MM0`..`MM7`

SSE highlights:

• only 8 registers
• each register has a size of 128 bit
• 70 operations
• allow for floating-point operations and vector’ elements

SSE2 features:

• adds 8 more registers (so now we have `XMM0` .. `XMM15`)
• makes floating-point operations more precise

SSE3 changes:

• allows for horizontal vector operations

• adds 54 more operations (47 are given by SSE4.1 and 7 more come from SSE4.2)

AVX - brand new version:

• vector size is now `256 bit`
• registers are renamed to `YMMi`, while `XMMi` are the lower 128 bits of `YMMi`
• operations now have three operands - `DEST`, `SRC1`, `SRC2` (`DEST = SRC1 op SRC2`)

SSE operations

So, I mentioned horizontal vector operations. But let’s do it in a series.

There are two SSE operation types: scalar and packed. Scalar operations use only the lowest elements of vectors. Packed operations deal with each element of vectors given. Look at the images below and you shall see the difference:

Horizontal operations deal on vectors in a different direction. Instead of operating on elements in the corresponding positions, these operate on elements in adjacent positions:

So there are six “types” of operations, as described above. They are:

• operations, dealing with scalar or double values
• operations, operating on all elements in a pack or on last elements of a pack
• operations, handling values on corresponding or adjacent positions

To determine if an operation type, you just need to look at the last two characters of operation’s name:

`HADDPS` -> `Horizontal` `ADD` `Packed` `Single-precision`

Working with SSE

Images above describe how processor instructions (assembly commands) work. To map those onto C++ functions, you only need to replace assembly operation with the corresponding function from SSE headers (I’ll cover that in just a second). But the main goal of those explanations above was to give you an idea how operations themselves work and where do they store results.

To work with SSE we need to follow these three steps:

1. load data into XMM registers
2. perform all the operations needed on those XMM registers
3. store data from XMMs into usual variables

To use vector operations, you shall need to have some header files included in your code, as well as compiler flags, turned on.

• `mmintrin.h` - MMX
• `xmmintrin.h` - SSE
• `emmintrin.h` - SSE2
• `pmmintrin.h` - SSE3
• `smmintrin.h` - SSE4.1
• `nmmintrin.h` - SSE4.2
• `immintrin.h` - AVX

None of the header files requires all the previous ones to be included too. Compiler flags are `-mmmx`, `-msse`, `-msse2`, `-msse3`, `-msse4`, `-mavx`, correspondingly. As with header files, none of these flags requires previous ones to be turned on.

Data types

There are three “standard” data types within SSE:

1. `__m128`, which is SSE’s `float[4]`
2. `__m128d` corresponds to `double[2]`
3. `__m128i` represents one of these: `char[16]`, `short int[8]`, `int[4]` or `uint64_t[2]`

Each of them needs to be converted from or to standard C++ types with its own intrinsic (SSE operation).

Intrinsics

SSE operations in C++ are named this way: `_mm_{OPERATION}_{SUFFIX}`. The operation is the operation on vectors you want to perform. The suffix is a set of flags for a processor, showing in what way it should work with operands (packed/scalar, single-/double- precision, etc.).

For optimization’s sake, it is better if operands for intrinsincs are aligned in memory for base 16. But do not worry, the compiler will automatically decide if the variable is aligned or not and perform all the needed operations itself.

1. `_mm_set_ps(4.0, 3.0, 2.0, 1.0)` -> `[4.0, 3.0, 2.0, 1.0]`
2. `_mm_set1_ps(3.0)` -> `[3.0, 3.0, 3.0, 3.0]`
3. `_mm_set_ss(4.0)` -> `[0.0, 0.0, 0.0, 4.0]`
4. `_mm_setzero_ps()` -> `[0.0, 0.0, 0.0, 0.0]`

And like those, there are very similar intrinsics for storing data from vectors in a usual C++ types (in the examples below assume working with the same `__m128 t = [4.0, 3.0, 2.0, 1.0]`):

1. `_mm_store_ps(float[4], __m128)` -> `[4.0, 3.0, 2.0, 1.0]`
2. `_mm_store_ss(float*, __m128)` -> `1.0`
3. `_mm_store_ss(float*, __m128)` -> `[1.0, 1.0, 1.0, 1.0]`
4. `double _mm_cvtsd_f64(__m128d)` -> `1.0`
5. `int _mm_cvtsi128_si32(__m128i)` -> `1` (for given `__m128i [4, 3, 2, 1]`)

Finding maximum

So, let’s get back to finding the maximum in an array. For this task we will search maximums on each 4 elements of our array, storing them in the `XMMi` register:

But if you run this code, you may notice it returns the maximum, not in 100% of cases. This is because we are storing four maximums between each portion of an array. So, only one of those four is the maximum. But how can we find the maximum among four numbers? Running a loop seems obvious but not effective enough…

We may use the `shuffle` intrinsic! That is, cycle-shifting vector three times and finding maximum between that shifted vector and its previous value. That will give us the maximum in all four positions of our vector.

Here is a better explanation:

If we want to cycle-shift a 4-number array, we use `_mm_shuffle_ps` intrisinc.

It takes 3 parameters: `m1`, `m2`, and `mask`. First two are four-word (four-number) packs. The mask consists of four numbers and shows which elements of pack `m2` and which elements of pack `m1` will form the result. This mask could be obtained using `_MM_SHUFFLE(z, y, x, w)` macro, which forms an integer according to the formula `(z << 6) | (y << 4) | (x << 2) | w`.

Given those definitions, the call `m3 = _mm_shuffle_ps(m1, m2, _MM_SHUFFLE(z, y, x, w))` is equal to the formula `m3 = (m2(z) << 6) | (m2(y) << 4) | (m1(x) << 2) | m1(w)`.

So we want to shift a pack by one element right, like this: `[4, 2, 3, 1] => [2, 3, 1, 4]`. We need to pass the initial pack, `[4, 2, 3, 1]` twice: `_mm_shuffle_ps([4, 2, 3, 1], [4, 2, 3, 1], mask)` and form a mask, which will use elements `[2, 3]` for the higher words of a result and elements `[3, 1]` for the lower words. These elements can be then indexed as follows:

So to get the pair `[2, 3]` we need elements with indices `[2, 1]`. And to get the pair `[1, 4]` we need elements with indices `[0, 3]`.

Given that, we can use macro `_MM_SHUFFLE()` to generate the mask: `_MM_SHUFFLE(2, 1, 0, 3)`. And the final formula looks like this: `_mm_shuffle_ps(m1, m2, _MM_SHUFFLE(2, 1, 0, 3))`.

And our `max` function in pseudo-code looks like this:

Which will be executed like this:

The `_MM_SHUFFLE(2, 1, 0, 3)` call could be expanded to `(2 << 6) | (1 << 4) | (0 << 2) | 3`, which equals to `147` or `0x93` in HEX.

And here is the final C++ implementation:

The code for finding maximum with SSE among integer array is very, very similar to the previous one - you just need to decorate intrinsics with a different prefix and change store operation:

Profit?

If we compare the results of all three methods - usual loop, index-based searching and SSE, we may see something like this (I ran these tests on my laptop’s i7 processor on one million random float/int values):

Here you can see that index-based searching gives some speeding-up (around `15%`). But the real speed boost is gained with SSE (almost `4 times`!).

## Calculating the sum

Now let’s try something harder - calculating a sum of array’s elements. Here we will use the horizontal vector operations. But first, here’s the general algorithm:

Simple enough, huh? Now let’s use the SSE’s `_mm_add_ps` intrinsic. Running it on each pack of four elements will give us the summary vector of four floats:

But if we now add the elements of that vector horizontally to themselves, we would then have the two-element vector. Adding it to itself will give us the final single-element vector:

Nice, isn’t it? But wait! Integers are available too! And they need their special intrinsics! Have no fear, nothing that different here, only the prefixes are different:

And, just to approve our assumption of speeding-up, here’s the benchmarking (on one million of elements):

## Limitations?

Please note the difference between sums calculated with naive loop and the one calculated with SSE: they do differ. This is caused by a way computer work nowadays. Actually, how they store floating-point values. Since computers deal with binary system, they can not simply store all those digits after point in the memory and operate on them effectively.

Remember, how integers are stored in a binary system? Say, 14:

``````+-----------+------------------------+
|     n     | 5   4   3   2   1   0  |
+------------------------------------+
| pow(2, n) | 32  16  8   4   2   1  |
+------------------------------------+
|   fits?   | N   N   Y   Y   Y   N  |
+------------------------------------+
|   14  =   | 0 + 0 + 8 + 4 + 2 + 0  |
+-----------+------------------------+
| bin(14) = | 0   0   1   1   1   0  |
+-----------+------------------------+
``````

E.g. binary representation of `14` is: `142 = 001110`. Leading zeroes could be skipped in a binary system (as there might be as many of those as you wish).

A similar thing happens to floating-point numbers: the difference is that computer stores the negative powers of two:

``````+------------+-----------------------------------------+
|     n      | 5        4       3      2     1    0    |
+------------------------------------------------------+
| pow(2, -n) | 0.03125  0.0625  0.125  0.25  0.5  1.0  |
+------------------------------------------------------+
|   fits?    | N        N       Y      Y      Y    N   |
+------------------------------------------------------+
|   0.9  =   | 0    +   0  +  0.125 + 0.25 + 0.5 + 0   |
+------------------------------------------------------+
| bin(0.9) = | 0        0       1      1      1    0   |
+------------------------------------------------------+
``````

As you can see, using 5 bits is not enough to represent 0.9, but only `0.875`. Even if we use 32 bits (which is just a `float` data type in C), we will have `0.0110011001100110011001100110011102`, which is `0.8999999999068677`, but still, it’s not exactly what we wanted. On 64 bits (`double` type in C) it is better, `0.8999999999999999`, but, again, not exact value. And if we try adding one million unprecise numbers, we will probably get the unprecise result.

Another big limitation of SSE is that initial data should be aligned to contain the number of elements, which is a multiply of either 2 or 4 (depending on the SSE operation type you are using - scalar or double).