[Documentation] [TitleIndex] [WordIndex

NOTE: All code is in my pcl_simd repo on kforge.

Representing point data

In PCL currently, points are stored with their fields interleaved. For the simplest PointXYZ type, this looks like:

XYZ_XYZ_XYZ_XYZ_ ...

where _ denotes an extra padding float so that each point is 16-byte aligned. Operating on XYZ_ data efficiently often requires the use of horizontal SSE instructions, which perform computations using multiple elements of the same SSE register.

Instead a vertical representation, aka Structure-of-Arrays (SoA), can be used:

XXXXXXXX ...
YYYYYYYY ...
ZZZZZZZZ ...

This layout fits traditional vertical SSE processing better. Most arithmetic SSE instructions are binary operations on corresponding elements of two SSE registers.

Why does PCL use AoS?

PCL's use of AoS, normally non-optimal, does have its logic. In PCL, frequently we wish to process only some (indexed / valid) subset of a point cloud. Besides dense processing of all points, we then have two other cases.

Indexed subsets

PCL operators routinely provide a setIndices() method, ordering them to use only certain points identified by index. With the AoS representation, each individual point can be used in an SSE register via a simple aligned load. Indexed access therefore does not much complicate an SSE-optimized implementation.

Vertical SSE (in the dense case) processes four adjacent points simultaneously, and indexed access breaks the adjacency requirement. Instead of an aligned load, the implementation must "gather" the data for the next four indexed points (spread out in memory).

Organized point clouds

PCL permits point clouds with missing data. For imager-based 3D sensors, this allows point clouds to retain their 2D structure, making it trivial to identify nearby points. Invalid points have each field set to NaN, so that it is clear when invalid data is accidentally used in a computation.

Handling invalid points in PCL (with AoS) is again rather simple. For each point, check if X is NaN; if so, ignore it.

The SoA situation is much more complicated. Since we operate on four points at a time, we have to check if any of the four points are invalid. If so, it becomes very tricky to use SSE at all without destroying our result. Masking tricks are possible, but imply some overhead over the simple dense code.

Horizontal or vertical?

Both representations have pros and cons:

Horizontal

Vertical

Data structures

For benchmarking, we define two very simple point cloud representations:

   1 // Array-of-Structures
   2 struct AOS
   3 {
   4   float x;
   5   float y;
   6   float z;
   7   float h;
   8 };
   9 
  10 // Structure-of-Arrays
  11 struct SOA
  12 {
  13   float* x;
  14   float* y;
  15   float* z;
  16   size_t size;
  17 };

Computations considered

We benchmark two basic operations:

For both operations, we implemented several versions covering the space of:

Representative examples are listed below.

Dot product

Vertical (SoA), SSE2-optimized:

   1 void dotSSE2 (const SOA& vectors, const AOS& vector,
   2               float* result, unsigned long size)
   3 {
   4   float x = vector.x, y = vector.y, z = vector.z;
   5 
   6   // Broadcast X, Y, Z of constant vector into 3 SSE registers
   7   __m128 vX  = _mm_set_ps1(x);
   8   __m128 vY  = _mm_set_ps1(y);
   9   __m128 vZ  = _mm_set_ps1(z);
  10   __m128 X, Y, Z;
  11 
  12   unsigned i = 0;
  13   for ( ; i < size - 3; i += 4)
  14   {
  15     // Load data for next 4 points
  16     X = _mm_load_ps (vectors.x + i);
  17     Y = _mm_load_ps (vectors.y + i);
  18     Z = _mm_load_ps (vectors.z + i);
  19     
  20     // Compute X*X'+Y*Y'+Z*Z' for each point
  21     X = _mm_mul_ps (X, vX);
  22     Y = _mm_mul_ps (Y, vY);
  23     Z = _mm_mul_ps (Z, vZ);
  24     X = _mm_add_ps (X, Y);
  25     X = _mm_add_ps (X, Z);
  26 
  27     // Store results
  28     _mm_store_ps(result + i, X);
  29   }
  30 
  31   // Handle any leftovers at the end
  32   for ( ; i < size; ++i)
  33   {
  34     result[i] = vectors.x[i]*x + vectors.y[i]*y + vectors.z[i]*z;
  35   }
  36 }

Horizontal (AoS), SSE4.1-optimized (with horizontal DPPS instruction):

   1 void dotSSE4_1 (const AOS* vectors, const AOS& vector,
   2                 float* result, unsigned long size)
   3 {
   4   // Load constant vector into an SSE register
   5   __m128 vec = _mm_load_ps ((const float*) &vector);
   6   __m128 XYZH;
   7   
   8   // Set mask to ignore the padding elements
   9   const int mask = 123;
  10   for (unsigned i = 0; i < size; ++i)
  11   {
  12     // Load next point
  13     XYZH = _mm_load_ps ((const float*)(vectors + i));
  14 
  15     // Dot product from SSE4.1
  16     XYZH = _mm_dp_ps (XYZH, vec, mask);
  17 
  18     // Store single result (the bottom register element)
  19     _mm_store_ss (&(result [i]), XYZH);
  20   }
  21 }

Centroid

Vertical (SoA), SSE2-optimized:

   1 void centroidSSE2 (const SOA& vectors, AOS& result, size_t size)
   2 {
   3   __m128 X_sum = _mm_setzero_ps();
   4   __m128 Y_sum = _mm_setzero_ps();
   5   __m128 Z_sum = _mm_setzero_ps();
   6   __m128 X, Y, Z;
   7 
   8   size_t i = 0;
   9   for ( ; i < size - 3; i += 4)
  10   {
  11     // Load next 4 points
  12     X = _mm_load_ps (vectors.x + i);
  13     Y = _mm_load_ps (vectors.y + i);
  14     Z = _mm_load_ps (vectors.z + i);
  15 
  16     // Accumulate 4 sums in each dimension
  17     X_sum = _mm_add_ps(X_sum, X);
  18     Y_sum = _mm_add_ps(Y_sum, Y);
  19     Z_sum = _mm_add_ps(Z_sum, Z);
  20   }
  21 
  22   // Horizontal adds (HADD from SSE3 could help slightly)
  23   float* pX = reinterpret_cast<float*>(&X_sum);
  24   float* pY = reinterpret_cast<float*>(&Y_sum);
  25   float* pZ = reinterpret_cast<float*>(&Z_sum);
  26   result.x = pX[0] + pX[1] + pX[2] + pX[3];
  27   result.y = pY[0] + pY[1] + pY[2] + pY[3];
  28   result.z = pZ[0] + pZ[1] + pZ[2] + pZ[3];
  29   
  30   // Leftover points
  31   for ( ; i < size; ++i)
  32   {
  33     result.x += vectors.x[i];
  34     result.y += vectors.y[i];
  35     result.z += vectors.z[i];
  36   }
  37 
  38   // Average
  39   float inv_size = 1.0f / size;
  40   result.x *= inv_size;
  41   result.y *= inv_size;
  42   result.z *= inv_size;
  43 }

Horizontal (AoS), SSE2-optimized:

   1 void centroidSSE2 (const AOS* vectors, AOS& result, size_t size)
   2 {
   3   __m128 sum = _mm_setzero_ps();
   4 
   5   for (unsigned i = 0; i < size; ++i)
   6   {
   7     __m128 XYZH = _mm_load_ps ((const float*)(vectors + i));
   8     sum = _mm_add_ps(sum, XYZH);
   9   }
  10   _mm_store_ps((float*)&result, sum);
  11 
  12   float inv_size = 1.0f / size;
  13   result.x *= inv_size;
  14   result.y *= inv_size;
  15   result.z *= inv_size;
  16 }

Indexed

When using point indices, the vertical implementation can no longer use aligned loads. Instead it's best to use the _mm_set_ps intrinsic to gather the next four points.

Vertical (SoA) dot product, SSE2-optimized:

   1 void dotIndexedSSE2 (const SOA& vectors, const AOS& vector, const int* indices, float* result, unsigned long size)
   2 {
   3   float x = vector.x, y = vector.y, z = vector.z;
   4 
   5   __m128 vX  = _mm_set_ps1(x);
   6   __m128 vY  = _mm_set_ps1(y);
   7   __m128 vZ  = _mm_set_ps1(z);
   8   __m128 X, Y, Z;
   9   
  10   unsigned i = 0;
  11   for ( ; i < size - 3; i += 4)
  12   {
  13     int i0 = indices[i + 0];
  14     int i1 = indices[i + 1];
  15     int i2 = indices[i + 2];
  16     int i3 = indices[i + 3];
  17 
  18     // Gather next four indexed points
  19     X = _mm_set_ps(vectors.x[i3], vectors.x[i2], vectors.x[i1], vectors.x[i0]);
  20     Y = _mm_set_ps(vectors.y[i3], vectors.y[i2], vectors.y[i1], vectors.y[i0]);
  21     Z = _mm_set_ps(vectors.z[i3], vectors.z[i2], vectors.z[i1], vectors.z[i0]);
  22     
  23     // Computation
  24     X = _mm_mul_ps (X, vX);
  25     Y = _mm_mul_ps (Y, vY);
  26     Z = _mm_mul_ps (Z, vZ);
  27     X = _mm_add_ps (X, Y);
  28     X = _mm_add_ps (X, Z);
  29 
  30     // Store result
  31     _mm_store_ps(result + i, X);
  32   }
  33 
  34   for ( ; i < size; ++i)
  35   {
  36     int idx = indices[i];
  37     result[i] = vectors.x[idx]*x + vectors.y[idx]*x + vectors.z[idx]*z;
  38   }
  39 }

Benchmarks (random data)

The test point cloud is randomly generated, 640x480, dense. Each operation is repeated 1000 times.

For indexed tests, the indices list every 4th point. More random index patterns would change execution time by affecting caching and prefetching, but I'd expect such effects to be similar for horizontal and vertical code.

"Scalar" code uses no vector instructions, otherwise the instruction set is listed. A trailing u# means the code was unrolled by factor #.

Dot product

Dense

Horizontal (AOS)
  Scalar:   0.621674 seconds 
  SSE2:     0.756300 seconds 
  SSE4.1:   0.532441 seconds 
  SSE4.1u4: 0.476841 seconds 
Vertical (SOA)
  Scalar:   0.519625 seconds 
  SSE2:     0.215499 seconds

The vertical SSE2 code is the clear winner, more than twice as fast as horizontal code even with the special horizontal dot product from SSE4.1.

On the i7 at the ski house, horizontal SSE4.1 was actually the slowest implementation. Unrolling it x4 helped significantly, although it was still much worse than vertical SSE2. I attributed this to the very high latency of the DPPS instruction; it takes 11 cycles before the result can be stored. Unrolling helps hide the latency by providing more computation to do during that time. I don't know why the results from my WG i7 (shown above) are so different.

Indexed

Horizontal (AOS)
  Scalar:   0.271768 seconds
  SSE2:     0.276114 seconds
  SSE4.1:   0.259613 seconds
Vertical (SOA)
  Scalar:   0.193394 seconds
  SSE2:     0.177262 seconds

SSE optimization actually gives meager benefits in both the horizontal and vertical cases. However vertical SSE2 is still the winner.

Centroid

The story for centroid is similar; vertical SSE2 is fastest, significantly so for dense data.

Dense

Horizontal (AOS)
  Scalar:  0.628597 seconds 
  SSE2:    0.326645 seconds 
  SSE2u2:  0.247539 seconds 
  SSE2u4:  0.236474 seconds 
Vertical (SOA)
  Scalar:  0.711040 seconds 
  SSE2:    0.149806 seconds 

Indexed

Horizontal (AOS)
  Scalar:  0.256237 seconds 
  SSE2:    0.195724 seconds 
Vertical (SOA)
  Scalar:  0.194030 seconds 
  SSE2:    0.166639 seconds 

Vertical SSE for organized point clouds

We still need a way to effectively use vertical SSE for organized point clouds (containing NaNs). A promising approach is to compute a run-length encoding (RLE) of the valid points as a preprocessing step. The data structure is very simple:

   1 struct RlePair
   2 {
   3   size_t good;
   4   size_t skip;
   5 };
   6 typedef std::vector<RlePair> RLE;

The RLE counts the length of alternating runs of valid and invalid points. Once computed, it allows us to process only valid points without explicitly checking each one for NaNs. In fact, operations become O(#valid points) instead of O(#total points), which can itself be a win if many points are invalid.

In real scenes, valid points are clustered together (into objects), so valid (and invalid) runs should be lengthy on average. A long run of valid points can be split into <4 beginning points, <4 final points, and a run of aligned, valid point data which can be safely processed with vertical SSE.

Abstracting point iteration

We are still left with three distinct cases for processing point clouds, requiring different methods of iterating over point data:

Writing and maintaining three copies of each PCL algorithm is a huge burden. The RLE for organized data in particular imposes a relatively complicated iteration method. Ideally we should be able to write the computational core of an algorithm only once, and have it work efficiently in each of the three cases.

Currently PCL does not meet this goal. In fact, core algorithms tend to have four near-identical implementations:

I think it's unnecessary to distinguish between "dense indexed" and "organized indexed", if we require that indices point to valid data.

Writing algorithms as computational kernels

As an experiment, I rewrote the vertical centroid as a "kernel" class. This implements only the computation, without worrying about the memory layout of the whole cloud:

   1 struct CentroidKernel
   2 {
   3   // State
   4   float x_sum, y_sum, z_sum;
   5   __m128 X_sum, Y_sum, Z_sum;
   6   size_t count;
   7   AOS result;
   8   
   9   void init()
  10   {
  11     // Initialization
  12     x_sum = y_sum = z_sum = 0.0f;
  13     X_sum = _mm_setzero_ps();
  14     Y_sum = _mm_setzero_ps();
  15     Z_sum = _mm_setzero_ps();
  16     count = 0;
  17   }
  18 
  19   // Scalar operator
  20   inline void operator() (float x, float y, float z)
  21   {
  22     x_sum += x;
  23     y_sum += y;
  24     z_sum += z;
  25     ++count;
  26   }
  27 
  28   // SIMD operator
  29   inline void operator() (__m128 X, __m128 Y, __m128 Z)
  30   {
  31     X_sum = _mm_add_ps(X_sum, X);
  32     Y_sum = _mm_add_ps(Y_sum, Y);
  33     Z_sum = _mm_add_ps(Z_sum, Z);
  34     count += 4;
  35   }
  36 
  37   void reduce()
  38   {
  39     float* pX = reinterpret_cast<float*>(&X_sum);
  40     float* pY = reinterpret_cast<float*>(&Y_sum);
  41     float* pZ = reinterpret_cast<float*>(&Z_sum);
  42     result.x = pX[0] + pX[1] + pX[2] + pX[3] + x_sum;
  43     result.y = pY[0] + pY[1] + pY[2] + pY[3] + y_sum;
  44     result.z = pZ[0] + pZ[1] + pZ[2] + pZ[3] + z_sum;
  45 
  46     float inv_count = 1.0f / count;
  47     result.x *= inv_count;
  48     result.y *= inv_count;
  49     result.z *= inv_count;
  50   }
  51 };

Kernel applicators

We can then define "applicator" functions that apply a kernel to a particular case of point cloud. The dense version simply uses aligned loads:

   1 template <typename Kernel>
   2 void applyDense (Kernel& kernel, const SOA& pts)
   3 {
   4   kernel.init();
   5   
   6   size_t i = 0;
   7   for ( ; i < pts.size - 3; i += 4)
   8   {
   9     __m128 X = _mm_load_ps (pts.x + i);
  10     __m128 Y = _mm_load_ps (pts.y + i);
  11     __m128 Z = _mm_load_ps (pts.z + i);
  12 
  13     kernel(X, Y, Z);
  14   }
  15   for ( ; i < pts.size; ++i)
  16   {
  17     kernel(pts.x[i], pts.y[i], pts.z[i]);
  18   }
  19   
  20   kernel.reduce();
  21 }

The indexed version performs the necessary data gathering:

   1 template <typename Kernel>
   2 void applySparse (Kernel& kernel, const SOA& pts, const std::vector<int>& indices)
   3 {
   4   kernel.init();
   5 
   6   size_t i = 0;
   7   for ( ; i < indices.size() - 3; i += 4)
   8   {
   9     int i0 = indices[i + 0];
  10     int i1 = indices[i + 1];
  11     int i2 = indices[i + 2];
  12     int i3 = indices[i + 3];
  13 
  14     // Gather next four indexed points
  15     __m128 X = _mm_set_ps(pts.x[i3], pts.x[i2], pts.x[i1], pts.x[i0]);
  16     __m128 Y = _mm_set_ps(pts.y[i3], pts.y[i2], pts.y[i1], pts.y[i0]);
  17     __m128 Z = _mm_set_ps(pts.z[i3], pts.z[i2], pts.z[i1], pts.z[i0]);
  18     
  19     kernel(X, Y, Z);
  20   }
  21   for ( ; i < indices.size(); ++i)
  22   {
  23     int idx = indices[i];
  24     kernel(pts.x[idx], pts.y[idx], pts.z[idx]);
  25   }
  26 
  27   kernel.reduce();
  28 }

The organized version is most complicated, and uses the RLE to vectorize as much of the computation as possible:

   1 template <typename Kernel>
   2 void applyOrganized (Kernel& kernel, const SOA& pts, const RLE& rle)
   3 {
   4   kernel.init();
   5 
   6   size_t i = 0;
   7   for (RLE::const_iterator rle_it = rle.begin(); rle_it != rle.end(); ++rle_it)
   8   {
   9     // Process current stretch of good pixels
  10     size_t good = rle_it->good;
  11     size_t skip = rle_it->skip;
  12     size_t good_end = i + good;
  13 
  14     // Any unaligned points at start
  15     size_t unaligned_end = std::min( (i + 3) & ~3, good_end );
  16     for ( ; i < unaligned_end; ++i)
  17       kernel(pts.x[i], pts.y[i], pts.z[i]);
  18     // Aligned SIMD point data
  19     for ( ; i + 4 <= good_end; i += 4)
  20     {
  21       __m128 X = _mm_load_ps (pts.x + i);
  22       __m128 Y = _mm_load_ps (pts.y + i);
  23       __m128 Z = _mm_load_ps (pts.z + i);
  24 
  25       kernel(X, Y, Z);
  26     }
  27     // <4 remaining points
  28     for ( ; i < good_end; ++i)
  29       kernel(pts.x[i], pts.y[i], pts.z[i]);
  30 
  31     // Skip the following stretch of NaNs
  32     i += skip;
  33   }
  34 
  35   kernel.reduce();
  36 }

The kernel + applicator combinations for the dense and indexed cases were added to the centroid benchmark for random point data, and show identical performance to the hand-written vertical SSE2 code.

The above code is written with simplicity in mind. The biggest improvement would be to combine the scalar and SSE operator() (...) functions; this could possibly be achieved by using Eigen::Array as an SSE backend (similar to how Eigen::Matrix maps are currently used), something like:

   1 // N can be 1 or 4
   2 template <int N>
   3 void operator() (const Eigen::Array<float, N, 1>& x,
   4                  const Eigen::Array<float, N, 1>& y,
   5                  const Eigen::Array<float, N, 1>& z);

Benchmarks (real point clouds)

Finally, we compare CentroidKernel + applicator to pcl::compute3DCentroid() for several real organized (and one dense) point clouds.

The point clouds used are:

capture0001.pcd (organized, 640x480, 57553 NaNs)

PCL:    0.926901 seconds

RLE:    0.348173 seconds
Kernel: 0.174194 seconds

capture0002.pcd (organized, 640x480, 57269 NaNs)

PCL:    0.931111 seconds

RLE:    0.345437 seconds
Kernel: 0.171373 seconds

Even if you include the RLE computation time (which could be amortized over several operations, and perhaps optimized) in the total, the vertical kernel beats the current PCL implementation. Discounting RLE, it's more than 5x faster.

table_scene_mug_stereo_textured.pcd (organized, 640x480, 97920 NaNs)

PCL:    3.36001 seconds

RLE:    0.379737 seconds
Kernel: 0.183159 seconds

The very poor performance of PCL on the mug scene is a mystery to me. Perhaps the larger number of NaNs has an effect?

table_scene_lms400.pcd (dense, 460400 pts)

PCL:    0.678805 seconds

RLE:    N/A
Kernel: 0.242546 seconds

Conclusions

For the simple operations considered here, vertical SSE is a huge win. In the best case, this suggests that much of PCL could get at least a 3x speedup by switching to the more SSE-friendly memory layout.

Vertical SSE presents some complications in usage and implementation for PCL, but good solutions (RLE, kernel abstraction) are possible.

Looking at instruction sets, vertical SSE is especially advantageous both on older and very new processors. On older processors, because it makes excellent use of SSE2 instructions, whereas horizontal SSE may require horizontal instructions (introduced in SSE3 and later) for good performance. On new processors, because the latest AVX extensions expand SSE register to 256 bits, allowing 8 floating point operations at a time instead of 4. The vertical SSE techniques shown here trivially extend to AVX, and future instruction sets will likely expand SSE registers even further. The upcoming AVX2 extensions add dedicated "gather" instructions, which should improve performance with indices.

Remaining questions

Are there PCL algorithms that aren't easily implementable in the proposed kernel style?

How to handle nearest neighbor searches? These may be hard to vectorize.


2024-12-07 13:03