Algorithmic Optimizations: How to Leverage SIMD

Recently, I was optimizing an algorithm in C++ for our WebAR engine, which needs to perform smoothly across a variety of devices. The goal was to ensure smooth AR experiences, ideally hitting 30 FPS even on low-end devices. When we talk about optimization, the first thing that usually comes to mind is improving time complexity. But there’s a lot more to optimization than just that. Let’s break down some common approaches:

  1. Algorithmic Optimization: The first thing we tend to look at is improving time complexity, like reducing O(n²) to O(n log n). There are also space-time trade-offs and considerations around data structure selection.
  2. Domain-Specific Optimization: Depending on the problem domain, optimizations can be applied. For example, if you're working with a geometric problem, you might optimize based on its specific characteristics. For instance, a 3D rectangular object might not need as many triangles to represent its surface—reducing triangle count without sacrificing rendering quality can be an effective optimization.
  3. System-Level Optimization: This typically involves cache optimization, memory alignment, and loop unrolling. These days, compilers handle much of this, but there's always room for manual tweaking.
  4. Parallel Processing: Threads and async operations can speed things up by processing tasks concurrently. However, for WebAssembly (WASM), threading has its limitations.
  5. SIMD Vectorization: This is where register-level parallelism comes in, allowing the CPU to handle multiple operations at once. SIMD (Single Instruction, Multiple Data) can be incredibly powerful when leveraged correctly.

In my case, I started with the usual approach that every engineer tries first—optimizing time and space complexity. Once that was sorted, I looked at any geometric optimizations I could make. But even after that, it still wasn’t enough. I was processing streams of pixels from a 2D image, extracting various features and calculating feature scores, then converting them into descriptors on the fly. In computer vision, descriptors are like fingerprints for specific points in an image - they help us identify and match these points across different frames. Creating these descriptors involves complex mathematical calculations on pixel neighborhoods, making it a computationally intensive operation. So, I thought, "Why not use threads?" Now, threads in WASM are relatively new, available through pthreads, but they come with their own limitations—especially when you’re building an SDK that needs to run on local servers or via CDN.

"The main issue with WASM threads is the requirement for Cross-Origin-Opener-Policy (COOP) and Cross-Origin-Embedder-Policy (COEP) headers. These headers are necessary to enable threading but require control over web server configurations. This makes WASM threads incompatible with popular hosting services, and developers have to coordinate with IT teams to implement the required headers—an extra hassle. As we strive to make AR developers' lives easier, this is a point of concern."

Once I had exhausted the traditional methods, SIMD—Single Instruction, Multiple Data—came into play and made a huge difference. SIMD allows you to leverage the CPU’s registers to handle multiple operations simultaneously, which is perfect for performance-hungry tasks like mine. Let's dive deeper into how it works.

What is SIMD?

SIMD stands for Single Instruction, Multiple Data. It allows a single instruction to be applied to multiple pieces of data simultaneously, making it incredibly useful for repetitive computations, such as vector math, image processing, or AI workloads.

How SIMD Works with Registers

SIMD leverages the CPU’s registers to perform multiple operations at once. A register is a small, fast storage space the CPU uses during operations. In normal scalar code, each register holds a single value. Think of it as processing one item at a time.With SIMD, a single register can hold multiple values at once. For example, a 128-bit register can hold four 32-bit floats or integers or eight 16-bit values. The processor can then execute the same operation on all the values in the register simultaneously, providing a substantial performance boost, particularly for tasks involving large datasets or repetitive operations like matrix multiplications or pixel transformations.

How to Write SIMD Code

Writing SIMD code requires understanding the target architecture—whether it's ARM, x86, or WASM. Each platform has its own set of intrinsics, which are functions that let you directly interact with SIMD instructions at the hardware level. Here are some examples:

  • ARM: NEON intrinsics
  • x86: SSE, AVX intrinsics
  • WASM: A common SIMD library that works on both ARM and x86 chips

These intrinsics allow you to instruct the CPU to perform the same operation across multiple data points simultaneously.

SIMD in Action (C++ Example)

Here’s an example of SIMD in action. First, I’ll show a scalar version of a simple vector addition, and then the SIMD-optimized version using ARM NEON intrinsics.cpp

#include <arm_neon.h>
#include <iostream>
#include <chrono>

// Regular scalar addition
void addArraysScalar(const float* arr1, const float* arr2, float* result, int size) {
    for (int i = 0; i < size; i++) {
        result[i] = arr1[i] + arr2[i];
    }
}

// NEON SIMD addition - processes 4 floats at once
void addArraysNEON(const float* arr1, const float* arr2, float* result, int size) {
    int i;
    for (i = 0; i <= size - 4; i += 4) {
        float32x4_t v1 = vld1q_f32(arr1 + i);
        float32x4_t v2 = vld1q_f32(arr2 + i);
        float32x4_t sum = vaddq_f32(v1, v2);
        vst1q_f32(result + i, sum);
    }
    for (; i < size; i++) {
        result[i] = arr1[i] + arr2[i];
    }
}

int main() {
    const int SIZE = 1024;  
    float* array1 = new float[SIZE];
    float* array2 = new float[SIZE];
    float* result1 = new float[SIZE];  
    float* result2 = new float[SIZE];  

    for (int i = 0; i < SIZE; i++) {
        array1[i] = i * 1.1f;
        array2[i] = i * 2.2f;
    }

    auto start = std::chrono::high_resolution_clock::now();
    addArraysScalar(array1, array2, result1, SIZE);
    auto end = std::chrono::high_resolution_clock::now();
    auto scalarTime = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

    start = std::chrono::high_resolution_clock::now();
    addArraysNEON(array1, array2, result2, SIZE);
    end = std::chrono::high_resolution_clock::now();
    auto simdTime = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

    std::cout << "Scalar time: " << scalarTime.count() << " microseconds\n";
    std::cout << "SIMD time: " << simdTime.count() << " microseconds\n";
    std::cout << "Speedup: " << float(scalarTime.count())/simdTime.count() << "x\n";
    
    delete[] array1;
    delete[] array2;
    delete[] result1;
    delete[] result2;
    
    return 0;
}

When I ran this on my system (M2 Pro Chip, 16 GB) using gcc clang-1600.0.26.3 compiler, SIMD delivered a speedup of 2x over scalar code. But, take it with a pinch of salt - SIMD code is not always faster. Let's see when we should not do vectorization.

When Not to Use SIMD

While SIMD can be powerful, it's not always the right choice. Here are a few cases where SIMD may not provide the expected benefits:

  • Small Data Sets: SIMD has setup overhead. For small arrays (typically less than 1000 elements), the overhead might exceed the performance benefits. In these cases, regular scalar code might be faster.
  • Non-Sequential Memory Access: If your data access pattern is random or scattered, such as processing sparse arrays, SIMD may not be helpful.
  • Complex Conditional Logic: SIMD thrives on repetitive operations. If your code involves complex conditional logic (e.g., many if-else conditions), it might not be vectorized efficiently.
  • Already Optimized Code: When compiler auto-vectorization is already doing a good job or when code is I/O bound rather than CPU bound.
  • Platform Portability Concerns: When you can't guarantee SIMD support on target platforms as code needs to run on many different architectures.

Always profile your code before and after applying SIMD to ensure you're getting the expected performance boost.


Conclusion

In my scenario, switching to SIMD for vectorizing operations led to significant performance gains. In my case, without getting into too many details, when I ran the scalar code using the wasm emscripten compiler (emcc) locally on an M2 Pro Chip (16 GB), it took around 7ms. But after ivectorization, the time dropped to just 1ms on my profiler—a massive improvement that helped push the FPS to 30 even on low-end Android devices.SIMD won’t solve every problem, but when used in the right context, it can be a game-changer.

-- written by Amandeep Singh