In order to achieve the necessary processing throughput to allow Vibratium to render in realtime on older laptops (i.e. CPU-only) I looked to the SSE and AVX instruction sets. SSELibrary makes SSE and AVX intrinsics extremely simple to use from any .Net or CPP project.

The .Net class handles the copying of data between managed byte arrays and unmanaged buffers.  It also handles all of the unmanaged allocations. I achieved performance gains of anywhere from 1.4x to 20x from already optimized c-style loops. (And I’m no slouch at optimizations.) Using the normal CPU registers just doesn’t compete.

The Zip file includes all of the code for the C++ dll, including a Visual Studio 2015 project, the x64 release .dll and .lib files (and dependent VC Runtime dll if needed for your system), and the .cs files for the managed class.

SSE Library also implements CPU feature detection (via cpuid) that detects SSE3, SSSE3, SSE4A, SSE4.1, SSE4.2, AVX1, and AVX2. These tests are exposed via simple boolean functions.

Currently only 8-bit operations are implemented but I plan to add 16-bit operations in a future release.

Each operation has multiple functions, dependent upon available intrinsics – CPP, SSE2/3, SSSE3, SSE41, SSE42, AVX1, AVX2. Each of these are adorned with a prefix indicating the flavor. Unadorned functions are dispatchers – upon first call it will determine the fasted flavor that your processor supports and choose that.

I don’t have a processor that supports AVX2 yet, so all of the AVX2 functions are implemented but untested.

There is very little in the way of parameter checking. I did not implement this defensively but the code is very straightforward with the idea of “fail fast” during development rather than bloat every call with unnecessary code. (This should be easy to add in the SSELibrary.cs / Operation* managed functions. Next version.)

The code is released under GPL3. Any changes you make need to be shared and credit listed.