Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve speed of transformPointCloud/WithNormals() functions #2247

Merged

Conversation

taketwo
Copy link
Member

@taketwo taketwo commented Mar 7, 2018

tl;dr;

This adds an implementation of point cloud transform functions using SSE2 intrinsics. Depending on the transform scalar precision, compiler flags, compiler version, point type, and point cloud properties this may lead to up to 35% faster transforms of VGA-sized point clouds.

The old, non-optimized implementation is retained as a fallback for machines without SSE2 instructions.

Long version

Recently I was wondering if I can improve performance of transformPointCloud() family of functions. This is how they transform every point internally:

//cloud_out.points[i].getVector3fMap () = transform * cloud_in.points[i].getVector3fMap ();
Eigen::Matrix<Scalar, 3, 1> pt (cloud_in[i].x, cloud_in[i].y, cloud_in[i].z);
cloud_out[i].x = static_cast<float> (transform (0, 0) * pt.coeffRef (0) + transform (0, 1) * pt.coeffRef (1) + transform (0, 2) * pt.coeffRef (2) + transform (0, 3));
cloud_out[i].y = static_cast<float> (transform (1, 0) * pt.coeffRef (0) + transform (1, 1) * pt.coeffRef (1) + transform (1, 2) * pt.coeffRef (2) + transform (1, 3));
cloud_out[i].z = static_cast<float> (transform (2, 0) * pt.coeffRef (0) + transform (2, 1) * pt.coeffRef (1) + transform (2, 2) * pt.coeffRef (2) + transform (2, 3));

This manually spelled out vector-matrix product looks strange, especially since Eigen should properly align points and provide vectorized matrix operations. In fact, this already seemed strange to me nearly four years ago. So I asked why and the answer was:

You wouldn't believe it, but we had a look a long time ago at the generated assembly code and it was faster. Things might have change since then.

I decided to check myself. My development environment is Ubuntu 16.04 with Eigen 3.3 and GCC 5.4. When I switched to use Eigen's vector-matrix multiplication operator, I observed 2x slowdown, indeed. Same with GCC 6.3. However, GCC 7.2 gave nearly 2x speeedup! Clearly, older compilers were not vectorizing properly.

Looking at the disassembly, I found that indeed they are not using vectorized SSE2 instructions. Funny enough, when I compiled with GCC 5.4 and -msse2 flag (instead of -march=native), I got vectorized code. Turns out, in native mode in addition to SSE2 extensions, FMA (fuse-multiply-add) instructions become enabled and compiler starts to abuse them.

So simply switching to vector-matrix product is not an option because depending on the compiler version and compiler flags this introduces either speedups or slowdowns. An alternative was to implement everything directly with SSE2 intrinsics such that compiler can not screw up and is guaranteed to emit optimal SSE2 assembly.

Since PCL does not have a built-in benchmarking framework, I have a separate repository with the proposed implementation and benchmarks. Here are my results with VGA-sized point clouds on i7 from 2015:

benchmark

For every cell two tests are run: baseline PCL transform and proposed transform. The reported number is the runtime (in microseconds) for the proposed transform and the fraction of the baseline time. Thus numbers less than 1.0 represent improved performance.

I am lazy to provide complete discussion of the results and let anyone interested study this table or even run tests himself. Just couple of points:

  1. The "-no-sse2" people are obviously not affected by this change;
  2. Dense XYZ point clouds are transformed 24% to 35% faster;
  3. For double-precision transform matrices there is a performance gain only with old GCC. There are a couple of red cells also, but I tend to think these are measurement noise;
  4. Transformation of XYZRGBNormal point clouds is nearly 3 times slower than XYZ clouds, although it involves only double the number of operations. I conclude that memory access is the real bottleneck here.

Closes #1255.

One last note: double-precision transforms can be made faster with AVX instructions. But I leave this as an exercise for the future generations. (This PR now contains AVX optimizations as well.)

}
}

void so3 (const float* src, float* tgt) const
Copy link
Member

@SergioRAgostinho SergioRAgostinho Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels misleading. You're taking single precision float pointers at interface level and then treating them as double in the implementation. Why not simply define them double and make equivalent use of _mm_load_pd1? According to these guys, they're part of the SSE2 intrinsics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But these are point coordinates, e.g. data[] member of PointXYZ. They are always single-precision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense 👍

@SergioRAgostinho
Copy link
Member

I'll have three machines in which I'm interested in giving this benchmark a try. I'll report results later.

@SergioRAgostinho
Copy link
Member

CPU: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
OS: Ubuntu 16.04
Kernel: 4.13.0-36-generic
Compiler: gcc-5-4-0

None

Running benchmarks
Compiler: gcc-5-4-0
Flag: none
Celero
Timer resolution: 0.001000 us
Writing results to: out
-----------------------------------------------------------------------------------------------------------------------------------------------
     Group      |   Experiment    |   Prob. Space   |     Samples     |   Iterations    |    Baseline     |  us/Iteration   | Iterations/sec  | 
-----------------------------------------------------------------------------------------------------------------------------------------------
D_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1143.55000 |          874.47 | 
D_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      2458.16000 |          406.81 | 
D_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1455.81000 |          686.90 | 
D_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      3523.26000 |          283.83 | 
S_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1217.54000 |          821.33 | 
S_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      2541.50000 |          393.47 | 
S_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1487.62000 |          672.21 | 
S_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      3589.68000 |          278.58 | 
D_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.91872 |      1050.60000 |          951.84 | 
D_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.79790 |      1961.37000 |          509.85 | 
D_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.91492 |      1331.95000 |          750.78 | 
D_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.85674 |      3018.53000 |          331.29 | 
S_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.80564 |       980.90000 |         1019.47 | 
S_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.70550 |      1793.02000 |          557.72 | 
S_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.84202 |      1252.61000 |          798.33 | 
S_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.78879 |      2831.50000 |          353.17 | 
Complete.

Native

Flag: native
Celero
Timer resolution: 0.001000 us
Writing results to: out
-----------------------------------------------------------------------------------------------------------------------------------------------
     Group      |   Experiment    |   Prob. Space   |     Samples     |   Iterations    |    Baseline     |  us/Iteration   | Iterations/sec  | 
-----------------------------------------------------------------------------------------------------------------------------------------------
D_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |       888.64000 |         1125.32 | 
D_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      2033.82000 |          491.69 | 
D_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1074.31000 |          930.83 | 
D_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      2984.99000 |          335.01 | 
S_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1046.48000 |          955.58 | 
S_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      2276.48000 |          439.27 | 
S_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1304.99000 |          766.29 | 
S_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      3348.83000 |          298.61 | 
D_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.64262 |       571.06000 |         1751.13 | 
D_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.70225 |      1428.24000 |          700.16 | 
D_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.78740 |       845.91000 |         1182.16 | 
D_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.81589 |      2435.42000 |          410.61 | 
S_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.75009 |       784.95000 |         1273.97 | 
S_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.63527 |      1446.17000 |          691.48 | 
S_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.79945 |      1043.27000 |          958.52 | 
S_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.74284 |      2487.64000 |          401.99 | 
Complete.

SSE2

Flag: sse2
Celero
Timer resolution: 0.001000 us
Writing results to: out
-----------------------------------------------------------------------------------------------------------------------------------------------
     Group      |   Experiment    |   Prob. Space   |     Samples     |   Iterations    |    Baseline     |  us/Iteration   | Iterations/sec  | 
-----------------------------------------------------------------------------------------------------------------------------------------------
D_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1134.42000 |          881.51 | 
D_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      2434.32000 |          410.79 | 
D_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1455.29000 |          687.15 | 
D_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      3526.78000 |          283.54 | 
S_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1208.06000 |          827.77 | 
S_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      2525.06000 |          396.03 | 
S_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1492.27000 |          670.12 | 
S_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      3570.31000 |          280.09 | 
D_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.92557 |      1049.98000 |          952.40 | 
D_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.80552 |      1960.89000 |          509.97 | 
D_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.91754 |      1335.29000 |          748.90 | 
D_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.86764 |      3059.96000 |          326.80 | 
S_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.82094 |       991.75000 |         1008.32 | 
S_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.70866 |      1789.40000 |          558.85 | 
S_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.84115 |      1255.23000 |          796.67 | 
S_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.79985 |      2855.72000 |          350.17 | 
Complete.

NO-SSE2

Flag: no-sse2
Celero
Timer resolution: 0.001000 us
Writing results to: out
-----------------------------------------------------------------------------------------------------------------------------------------------
     Group      |   Experiment    |   Prob. Space   |     Samples     |   Iterations    |    Baseline     |  us/Iteration   | Iterations/sec  | 
-----------------------------------------------------------------------------------------------------------------------------------------------
D_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |       813.59000 |         1229.12 | 
D_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      1799.12000 |          555.83 | 
D_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1110.37000 |          900.60 | 
D_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      2892.70000 |          345.70 | 
S_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1048.35000 |          953.88 | 
S_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      2066.47000 |          483.92 | 
S_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1322.48000 |          756.16 | 
S_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      3124.52000 |          320.05 | 
D_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.99837 |       812.26000 |         1231.13 | 
D_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         1.08292 |      1948.30000 |          513.27 | 
D_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.98999 |      1099.26000 |          909.70 | 
D_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         1.04916 |      3034.91000 |          329.50 | 
S_XYZ_NC        | Proposed        |            Null |              50 |             100 |         1.03451 |      1084.53000 |          922.06 | 
S_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         1.10091 |      2274.99000 |          439.56 | 
S_XYZ_C         | Proposed        |            Null |              50 |             100 |         1.01860 |      1347.08000 |          742.35 | 
S_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         1.06407 |      3324.70000 |          300.78 | 
Complete.

@SergioRAgostinho
Copy link
Member

CPU: Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
OS: Mac OS X 10.13.3
Kernel: Darwin Kernel Version 17.4.0
Compiler: Apple LLVM version 9.0.0 (clang-900.0.39.2)

None

Flag: none
Celero
Timer resolution: 0.001000 us
Writing results to: out
-----------------------------------------------------------------------------------------------------------------------------------------------
     Group      |   Experiment    |   Prob. Space   |     Samples     |   Iterations    |    Baseline     |  us/Iteration   | Iterations/sec  | 
-----------------------------------------------------------------------------------------------------------------------------------------------
D_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |       806.44000 |         1240.02 | 
D_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      1914.05000 |          522.45 | 
D_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1318.46000 |          758.46 | 
D_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      3482.73000 |          287.13 | 
S_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1705.67000 |          586.28 | 
S_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      2937.37000 |          340.44 | 
S_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      2026.10000 |          493.56 | 
S_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      4694.22000 |          213.03 | 
D_XYZ_NC        | Proposed        |            Null |              50 |             100 |         1.39912 |      1128.31000 |          886.28 | 
D_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         1.21140 |      2318.68000 |          431.28 | 
D_XYZ_C         | Proposed        |            Null |              50 |             100 |         1.17552 |      1549.88000 |          645.21 | 
D_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         1.04832 |      3651.00000 |          273.90 | 
S_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.89120 |      1520.09000 |          657.86 | 
S_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.88390 |      2596.33000 |          385.16 | 
S_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.92362 |      1871.35000 |          534.37 | 
S_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.86469 |      4059.03000 |          246.36 | 
Complete.

Native

Flag: native
Celero
Timer resolution: 0.001000 us
Writing results to: out
-----------------------------------------------------------------------------------------------------------------------------------------------
     Group      |   Experiment    |   Prob. Space   |     Samples     |   Iterations    |    Baseline     |  us/Iteration   | Iterations/sec  | 
-----------------------------------------------------------------------------------------------------------------------------------------------
D_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |       823.92000 |         1213.71 | 
D_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      1879.89000 |          531.95 | 
D_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1215.83000 |          822.48 | 
D_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      3748.30000 |          266.79 | 
S_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1733.37000 |          576.91 | 
S_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      2651.45000 |          377.15 | 
S_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      2124.83000 |          470.63 | 
S_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      4543.82000 |          220.08 | 
D_XYZ_NC        | Proposed        |            Null |              50 |             100 |         1.32741 |      1093.68000 |          914.34 | 
D_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         1.20823 |      2271.34000 |          440.27 | 
D_XYZ_C         | Proposed        |            Null |              50 |             100 |         1.24942 |      1519.08000 |          658.29 | 
D_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.93014 |      3486.43000 |          286.83 | 
S_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.92442 |      1602.37000 |          624.08 | 
S_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.91641 |      2429.82000 |          411.55 | 
S_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.88452 |      1879.46000 |          532.07 | 
S_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.93180 |      4233.91000 |          236.19 | 
Complete.

SSE2

Flag: sse2
Celero
Timer resolution: 0.001000 us
Writing results to: out
-----------------------------------------------------------------------------------------------------------------------------------------------
     Group      |   Experiment    |   Prob. Space   |     Samples     |   Iterations    |    Baseline     |  us/Iteration   | Iterations/sec  | 
-----------------------------------------------------------------------------------------------------------------------------------------------
D_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |       801.03000 |         1248.39 | 
D_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      1890.78000 |          528.88 | 
D_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1204.52000 |          830.21 | 
D_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      3524.30000 |          283.74 | 
S_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1775.19000 |          563.32 | 
S_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      2774.60000 |          360.41 | 
S_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      2119.85000 |          471.73 | 
S_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      4609.43000 |          216.95 | 
D_XYZ_NC        | Proposed        |            Null |              50 |             100 |         1.39942 |      1120.98000 |          892.08 | 
D_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         1.26436 |      2390.62000 |          418.30 | 
D_XYZ_C         | Proposed        |            Null |              50 |             100 |         1.24478 |      1499.36000 |          666.95 | 
D_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         1.10822 |      3905.71000 |          256.04 | 
S_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.87538 |      1553.96000 |          643.52 | 
S_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         1.00478 |      2787.87000 |          358.70 | 
S_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.94332 |      1999.70000 |          500.08 | 
S_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.91635 |      4223.86000 |          236.75 | 
Complete.

@taketwo
Copy link
Member Author

taketwo commented Mar 7, 2018

I don't like your Mac results! They make no sense. How can SSE code be slower? Can you check disassembly? (build/bench/bench_transform.s). Search inside e.g. for "D_XYZ_NC::begin".

Note that this file is overwritten by compilation of every "bench_transform_xxx" target.

@taketwo
Copy link
Member Author

taketwo commented Mar 7, 2018

Oh, and by the way. Unless you changed line 12 in "bench_transforms.cpp", you are benchmarking double precision transforms! Apparently I forgot to switch back from double to float after my last tests and before committing to the repo. (We would still need to investigate why you are getting this results) but I am very curious to see the results for single precision.

@SergioRAgostinho
Copy link
Member

The weirdest thing for me is how a 4 year old laptop is reporting better results than a highend desktop computer from last year, in the D_XYZ_NC test of the original pcl implementation. I'll dig through the assembly tomorrow.

@taketwo
Copy link
Member Author

taketwo commented Mar 8, 2018

I pushed a few commits to the benchmarking repo:

  • Now the disassembly is saved for each target separately
  • Separate targets are created for float and double precision transforms

@SergioRAgostinho
Copy link
Member

SergioRAgostinho commented Mar 8, 2018

Files from the desktop:
ubuntu-16-04-desktop.tar.gz

tl;dr it looks good. I eyeball an 0.8 baseline in average for all flags/precision. no-sse yields no performance improvement/degradation with doubles.

@SergioRAgostinho
Copy link
Member

SergioRAgostinho commented Mar 8, 2018

Files from the laptop:
macosx-10-13-3-laptop.tar.gz

tl;dr It still reports abnormally fast values for the D_XYZ_NC tests (as a baseline). Everything is looking green in single floating point precision. Eyeballing the baseline average I would say ~0.85. For double precision all dense tests are slower. The sparse ones are running green.

@taketwo
Copy link
Member Author

taketwo commented Mar 14, 2018

I had a look at your benchmarking results for single precision. The improvement is not that great on Mac, but still there is some. The assembly for optimized functions looks similar on your machines, though gcc 5.4 makes use of FMA instructions, while appleclang inserts separate multiplies and adds. The assembly for the baseline also looks similar, and there is definitely nothing extraordinary in what is generated on Mac, so I also don't quite get why D_XYZ_NC is so fast there.

I would say that the float part of this PR is verified and ready. I'll have another look at double later.

@SergioRAgostinho
Copy link
Member

The mac laptop already has some years, so FMA might not be available.

I would say that the float part of this PR is verified and ready. I'll have another look at double later.

On a different PR or still on this one?

@taketwo
Copy link
Member Author

taketwo commented Mar 14, 2018

The mac laptop already has some years, so FMA might not be available.

Oh no, FMA is kind of ancient, think 200x.

On a different PR or still on this one?

This one, we should not check in code that worsens performance ;)

@SergioRAgostinho
Copy link
Member

My comment was derived from the info here.

https://en.wikipedia.org/wiki/FMA_instruction_set#CPUs_with_FMA3

@taketwo
Copy link
Member Author

taketwo commented Mar 14, 2018

Mine from info here:

https://en.wikipedia.org/wiki/FMA_instruction_set#History

😆

@taketwo
Copy link
Member Author

taketwo commented Mar 17, 2018

I finally found time to examine the results for double-precision benchmarks. Looking at the disassembly, the worse-than-baseline performance with Appleclang does not make sense, and here is why. Let's look at D_XYZ_NC test in bench_transforms_double_none.s.

The inner transform loop of the optimized version spans from LBB5_13 to LBB5_14:

LBB5_13:                                ## =>This Inner Loop Header: Depth=1
        movq    48(%rbx), %rsi
        movss   (%rsi,%rcx), %xmm0      ## xmm0 = mem[0],zero,zero,zero
        movss   4(%rsi,%rcx), %xmm1     ## xmm1 = mem[0],zero,zero,zero
        shufps  $0, %xmm0, %xmm0        ## xmm0 = xmm0[0,0,0,0]
        cvtps2pd        %xmm0, %xmm0
        movapd  %xmm11, %xmm3
        mulpd   %xmm0, %xmm3
        addpd   %xmm6, %xmm3
        mulpd   %xmm10, %xmm0
        addpd   %xmm7, %xmm0
        shufps  $0, %xmm1, %xmm1        ## xmm1 = xmm1[0,0,0,0]
        cvtps2pd        %xmm1, %xmm1
        movapd  %xmm13, %xmm2
        mulpd   %xmm1, %xmm2
        addpd   %xmm3, %xmm2
        mulpd   %xmm12, %xmm1
        addpd   %xmm0, %xmm1
        movss   8(%rsi,%rcx), %xmm0     ## xmm0 = mem[0],zero,zero,zero
        shufps  $0, %xmm0, %xmm0        ## xmm0 = xmm0[0,0,0,0]
        cvtps2pd        %xmm0, %xmm0
        movapd  %xmm4, %xmm3
        mulpd   %xmm0, %xmm3
        addpd   %xmm2, %xmm3
        mulpd   %xmm14, %xmm0
        addpd   %xmm1, %xmm0
        cvtpd2ps        %xmm3, %xmm1
        cvtpd2ps        %xmm0, %xmm0
        unpcklpd        %xmm0, %xmm1    ## xmm1 = xmm1[0],xmm0[0]
        movapd  %xmm1, (%rax,%rcx)
        incq    %rdx
        movq    56(%r14), %rsi
        movq    (%r15), %rax
        subq    %rax, %rsi
        sarq    $4, %rsi
        addq    $16, %rcx
        cmpq    %rsi, %rdx
        jb      LBB5_13
        jmp     LBB5_24

It follows closely what is written in intrinsics, making use of packed additions/multiplications (they have pd suffix), i.e. instructions that act on two doubles at the same time. Including the loop boilerplate code, there is total of 38 instructions.

The inner transform loop of the baseline version spans from LBB3_15 to LBB3_39:

LBB3_15:                                ## %scalar.ph
                                        ## =>This Inner Loop Header: Depth=1
        movss   -8(%rax), %xmm4         ## xmm4 = mem[0],zero,zero,zero
        movss   -4(%rax), %xmm5         ## xmm5 = mem[0],zero,zero,zero
        xorps   %xmm7, %xmm7
        cvtss2sd        %xmm4, %xmm7
        xorps   %xmm0, %xmm0
        cvtss2sd        %xmm5, %xmm0
        movss   (%rax), %xmm4           ## xmm4 = mem[0],zero,zero,zero
        cvtss2sd        %xmm4, %xmm4
        movapd  %xmm7, %xmm6
        mulsd   %xmm13, %xmm6
        movapd  %xmm0, %xmm5
        mulsd   %xmm3, %xmm5
        addsd   %xmm6, %xmm5
        movapd  %xmm4, %xmm6
        mulsd   %xmm2, %xmm6
        addsd   %xmm5, %xmm6
        addsd   %xmm1, %xmm6
        xorps   %xmm5, %xmm5
        cvtsd2ss        %xmm6, %xmm5
        movss   %xmm5, -8(%rsi)
        movapd  %xmm7, %xmm5
        mulsd   %xmm12, %xmm5
        movapd  %xmm0, %xmm6
        mulsd   %xmm14, %xmm6
        addsd   %xmm5, %xmm6
        movapd  %xmm4, %xmm5
        mulsd   %xmm10, %xmm5
        addsd   %xmm6, %xmm5
        movapd  %xmm11, %xmm6
        addsd   %xmm6, %xmm5
        movapd  %xmm8, %xmm6
        cvtsd2ss        %xmm5, %xmm5
        movss   %xmm5, -4(%rsi)
        movapd  %xmm15, %xmm5
        mulsd   %xmm5, %xmm7
        mulsd   %xmm6, %xmm0
        addsd   %xmm7, %xmm0
        mulsd   -144(%rbp), %xmm4       ## 16-byte Folded Reload
        addsd   %xmm0, %xmm4
        addsd   %xmm9, %xmm4
        xorps   %xmm0, %xmm0
        cvtsd2ss        %xmm4, %xmm0
        movss   %xmm0, (%rsi)
        incq    %rdx
        addq    $16, %rax
        addq    $16, %rsi
        cmpq    %rcx, %rdx
        jb      LBB3_15

This code does not take advantage of packed additions/multiplications (note sd suffix) and is therefore longer (48 instructions).

sd and pd instructions have the same latency (and that's the point of SIMD). I have no good explanation as why the longer baseline code can be faster. Perhaps something wrong with the benchmarking protocol? So, don't know how to proceed.

@taketwo
Copy link
Member Author

taketwo commented Mar 19, 2018

I've added an optimized implementation for double-precision transforms with AVX intrinsics, here are the updated results:

benchmark

As you can see, the "native" lines in the double-precision section became overwhelmingly green.

@taketwo taketwo force-pushed the optimize-transforms branch from dababe1 to 04ae840 Compare March 19, 2018 09:29
@SergioRAgostinho
Copy link
Member

That's encouraging 👍 I'm rushing for deadlines till the end of the week so I won't be able to look properly at this and test it till then.

@stale
Copy link

stale bot commented May 18, 2018

This pull request has been automatically marked as stale because it hasn't had
any activity in the past 60 days. Commenting or adding a new commit to the
pull request will revert this.

Come back whenever you have time. We look forward to your contribution.

@stale stale bot added the status: stale label May 18, 2018
@SergioRAgostinho
Copy link
Member

I'll give this a go on the Mac on the weekend. It's a good PR sitting here for too long just pending on my tests.

@stale stale bot removed the status: stale label May 30, 2018
@SergioRAgostinho
Copy link
Member

Bottom line, everything is running green 🎊

$ make benchmarks
[ 40%] Built target celero
[ 50%] Built target bench_transforms_double_native
[ 60%] Built target bench_transforms_double_sse2
[ 70%] Built target bench_transforms_float_none
[ 80%] Built target bench_transforms_float_sse2
[ 90%] Built target bench_transforms_double_none
[100%] Built target bench_transforms_float_native
Running benchmarks
Compiler: appleclang-9-0-0-9000039
Precision: float
Flag: none
Celero
Timer resolution: 0.001000 us
Writing results to: out
-----------------------------------------------------------------------------------------------------------------------------------------------
     Group      |   Experiment    |   Prob. Space   |     Samples     |   Iterations    |    Baseline     |  us/Iteration   | Iterations/sec  | 
-----------------------------------------------------------------------------------------------------------------------------------------------
D_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1049.80000 |          952.56 | 
D_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      1999.67000 |          500.08 | 
D_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1423.63000 |          702.43 | 
D_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      3528.85000 |          283.38 | 
S_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1378.33000 |          725.52 | 
S_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      2268.97000 |          440.73 | 
S_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1754.60000 |          569.93 | 
S_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      3772.52000 |          265.07 | 
D_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.48537 |       509.54000 |         1962.55 | 
D_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.74278 |      1485.31000 |          673.26 | 
D_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.61792 |       879.69000 |         1136.76 | 
D_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.84595 |      2985.24000 |          334.98 | 
S_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.67064 |       924.37000 |         1081.82 | 
S_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.67883 |      1540.24000 |          649.25 | 
S_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.73723 |      1293.55000 |          773.07 | 
S_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.81266 |      3065.79000 |          326.18 | 
Complete.
awk: unknown option -e ignored

awk: unknown option -e ignored

Compiler: appleclang-9-0-0-9000039
Precision: float
Flag: native
Celero
Timer resolution: 0.001000 us
Writing results to: out
-----------------------------------------------------------------------------------------------------------------------------------------------
     Group      |   Experiment    |   Prob. Space   |     Samples     |   Iterations    |    Baseline     |  us/Iteration   | Iterations/sec  | 
-----------------------------------------------------------------------------------------------------------------------------------------------
D_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1066.81000 |          937.37 | 
D_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      1984.49000 |          503.91 | 
D_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1443.01000 |          693.00 | 
D_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      3483.84000 |          287.04 | 
S_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1388.22000 |          720.35 | 
S_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      2244.64000 |          445.51 | 
S_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1776.49000 |          562.91 | 
S_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      3747.09000 |          266.87 | 
D_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.42310 |       451.37000 |         2215.48 | 
D_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.74813 |      1484.65000 |          673.56 | 
D_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.56155 |       810.32000 |         1234.08 | 
D_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.85073 |      2963.82000 |          337.40 | 
S_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.63501 |       881.54000 |         1134.38 | 
S_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.68450 |      1536.46000 |          650.85 | 
S_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.70598 |      1254.16000 |          797.35 | 
S_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.80500 |      3016.40000 |          331.52 | 
Complete.
awk: unknown option -e ignored

Compiler: appleclang-9-0-0-9000039
Precision: float
Flag: sse2
Celero
Timer resolution: 0.001000 us
Writing results to: out
-----------------------------------------------------------------------------------------------------------------------------------------------
     Group      |   Experiment    |   Prob. Space   |     Samples     |   Iterations    |    Baseline     |  us/Iteration   | Iterations/sec  | 
-----------------------------------------------------------------------------------------------------------------------------------------------
D_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1045.84000 |          956.17 | 
D_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      1990.59000 |          502.36 | 
D_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1414.27000 |          707.08 | 
D_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      3504.95000 |          285.31 | 
S_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1379.44000 |          724.93 | 
S_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      2257.25000 |          443.02 | 
S_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1726.39000 |          579.24 | 
S_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      3760.73000 |          265.91 | 
D_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.49001 |       512.47000 |         1951.33 | 
D_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.74557 |      1484.12000 |          673.80 | 
D_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.61577 |       870.87000 |         1148.28 | 
D_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.85141 |      2984.16000 |          335.10 | 
S_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.66961 |       923.69000 |         1082.61 | 
S_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.68734 |      1551.49000 |          644.54 | 
S_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.74924 |      1293.48000 |          773.11 | 
S_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.81388 |      3060.80000 |          326.71 | 
Complete.
awk: unknown option -e ignored

Compiler: appleclang-9-0-0-9000039
Precision: double
Flag: none
Celero
Timer resolution: 0.001000 us
Writing results to: out
-----------------------------------------------------------------------------------------------------------------------------------------------
     Group      |   Experiment    |   Prob. Space   |     Samples     |   Iterations    |    Baseline     |  us/Iteration   | Iterations/sec  | 
-----------------------------------------------------------------------------------------------------------------------------------------------
D_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1497.42000 |          667.82 | 
D_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      2708.99000 |          369.14 | 
D_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1852.63000 |          539.77 | 
D_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      4220.65000 |          236.93 | 
S_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1911.79000 |          523.07 | 
S_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      3153.30000 |          317.13 | 
S_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      2270.51000 |          440.43 | 
S_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      4666.94000 |          214.27 | 
D_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.71064 |      1064.12000 |          939.74 | 
D_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.77171 |      2090.56000 |          478.34 | 
D_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.77468 |      1435.20000 |          696.77 | 
D_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.85170 |      3594.73000 |          278.19 | 
S_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.77892 |      1489.14000 |          671.53 | 
S_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.74109 |      2336.89000 |          427.92 | 
S_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.80788 |      1834.29000 |          545.17 | 
S_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.82615 |      3855.61000 |          259.36 | 
Complete.
awk: unknown option -e ignored

Compiler: appleclang-9-0-0-9000039
Precision: double
Flag: native
Celero
Timer resolution: 0.001000 us
Writing results to: out
-----------------------------------------------------------------------------------------------------------------------------------------------
     Group      |   Experiment    |   Prob. Space   |     Samples     |   Iterations    |    Baseline     |  us/Iteration   | Iterations/sec  | 
-----------------------------------------------------------------------------------------------------------------------------------------------
D_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1478.03000 |          676.58 | 
D_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      2721.18000 |          367.49 | 
D_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1826.73000 |          547.43 | 
D_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      4230.04000 |          236.40 | 
S_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1893.17000 |          528.21 | 
S_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      3107.36000 |          321.82 | 
S_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      2246.25000 |          445.19 | 
S_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      4673.45000 |          213.97 | 
D_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.55272 |       816.93000 |         1224.10 | 
D_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.61566 |      1675.31000 |          596.90 | 
D_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.65648 |      1199.22000 |          833.88 | 
D_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.79310 |      3354.86000 |          298.08 | 
S_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.70549 |      1335.61000 |          748.72 | 
S_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.66745 |      2074.01000 |          482.16 | 
S_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.77434 |      1739.37000 |          574.92 | 
S_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.82069 |      3835.44000 |          260.73 | 
Complete.
awk: unknown option -e ignored

Compiler: appleclang-9-0-0-9000039
Precision: double
Flag: sse2
Celero
Timer resolution: 0.001000 us
Writing results to: out
-----------------------------------------------------------------------------------------------------------------------------------------------
     Group      |   Experiment    |   Prob. Space   |     Samples     |   Iterations    |    Baseline     |  us/Iteration   | Iterations/sec  | 
-----------------------------------------------------------------------------------------------------------------------------------------------
D_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1486.00000 |          672.95 | 
D_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      2703.78000 |          369.85 | 
D_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      1877.70000 |          532.57 | 
D_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      4214.93000 |          237.25 | 
S_XYZ_NC        | PCL             |            Null |              50 |             100 |         1.00000 |      1946.52000 |          513.74 | 
S_XYZRGBN_NC    | PCL             |            Null |              50 |             100 |         1.00000 |      3152.13000 |          317.25 | 
S_XYZ_C         | PCL             |            Null |              50 |             100 |         1.00000 |      2270.91000 |          440.35 | 
S_XYZRGBN_C     | PCL             |            Null |              50 |             100 |         1.00000 |      4663.37000 |          214.44 | 
D_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.71159 |      1057.42000 |          945.70 | 
D_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.76752 |      2075.21000 |          481.88 | 
D_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.75317 |      1414.23000 |          707.10 | 
D_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.85230 |      3592.40000 |          278.37 | 
S_XYZ_NC        | Proposed        |            Null |              50 |             100 |         0.75352 |      1466.75000 |          681.78 | 
S_XYZRGBN_NC    | Proposed        |            Null |              50 |             100 |         0.74048 |      2334.09000 |          428.43 | 
S_XYZ_C         | Proposed        |            Null |              50 |             100 |         0.80901 |      1837.18000 |          544.31 | 
S_XYZRGBN_C     | Proposed        |            Null |              50 |             100 |         0.82524 |      3848.41000 |          259.85 | 
Complete.
awk: unknown option -e ignored

@SergioRAgostinho
Copy link
Member

Just had another quick look at the PR and if it's ok from your side I'll just merge it.

@taketwo
Copy link
Member Author

taketwo commented Jun 4, 2018

Thanks for giving it a try. From my side it's ready.

@SergioRAgostinho SergioRAgostinho merged commit cf5667d into PointCloudLibrary:master Jun 4, 2018
@frozar
Copy link
Contributor

frozar commented Jun 4, 2018

In this PR, 2 tests fail for APPVEYOR under the PLATFORM=x86. These are about:

  • a_octree_test
  • surface_concave

I did check yet if it is directly relative to the modifications provided by this PR (because recently Appveyor fail regularly...) but as these fails don't seems to be those as usual, I though that I should say it here.

I hope to find time to investigate these fails soon.

Cheers guys.

@taketwo taketwo deleted the optimize-transforms branch June 5, 2018 12:39
@taketwo
Copy link
Member Author

taketwo commented Jun 5, 2018

Thanks for letting us know. Since the dedicated tests for transform functions are running green, I'd expect that these failures are precision-related (i.e. the epsilons are too tight).

@frozar
Copy link
Contributor

frozar commented Jun 5, 2018

For me, it doesn't seem like an epsilon error, because the message is something like this:
unknown file: error: SEH exception with code 0xc0000005 thrown in the test body.

It seems to be a Windows/Visual Studio specific error accrodingly to this link.

I tried to reproduce the error on my Ubuntu laptop but unsuccessfully. Also, when I check the different AppVeyor logs, I don't always get the same errors. In some way, there is a bit of random there.

Without a Windows machine (ARCHITECTURE=x86) with Visual Studio which compiles PCL in Debug mode, these bugs seem intracktable for me 😥 ...

@SergioRAgostinho
Copy link
Member

@UnaNancyOwen can you reproduce the results from these tests?

@UnaNancyOwen
Copy link
Member

@SergioRAgostinho Yes, These tests seems to failed.

@frozar
Copy link
Contributor

frozar commented Jun 9, 2018

Maybe these tests should be fixed in a dedicated PR.

@frozar
Copy link
Contributor

frozar commented Jun 20, 2018

My comment is not directly related to this PR but I regularly facing some issue with the Appveyor checker. I cannot reproduce the tests locally because I don't have a Windows machine.

My question is: Do you know a way to run Windows test under Ubuntu? I was thinking about virtual machine or docker, do you think that's possible (easily)?

@taketwo
Copy link
Member Author

taketwo commented Jun 20, 2018

I think running Windows in a virtual machine is the easiest way. Microsoft even provides images with developer environment setup: https://developer.microsoft.com/en-us/windows/downloads/virtual-machines.

@frozar
Copy link
Contributor

frozar commented Jun 20, 2018

This is a valuable information, thank you. I'll give it a try as soon as possible 😉

@SergioRAgostinho SergioRAgostinho changed the title Improve speed of transformPointCloud/WithNormals() functions Improve speed of transformPointCloud/WithNormals() functions Aug 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants