Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf improvements for floating point math #852

Merged
merged 3 commits into from
Jul 15, 2024
Merged

Conversation

heshpdx
Copy link
Contributor

@heshpdx heshpdx commented Jul 13, 2024

This completes the work from #790, where we started the removal of "long double" types.

Additionally, there is a easy performance improvement opportunity through changing some FDIV's into FMUL's. In modern CPUs, divides usually takes 3 to 4 times as long to complete compared to multiply, so we can convert the high impact divide operations by defining literals where the inverse is pre-computed. Removing divides from loops has a big impact. I measured a 30% speedup in cellToLatLng and cellToBoundary on my machine. Please see what you can achieve on yours. Thank you!

- Convert all the remaining "long double" literals to "double".
- Define new literals for some inverse values, and use them to change
  divide operations into multiply operations, since that is generally
  faster for most CPUs.
@CLAassistant
Copy link

CLAassistant commented Jul 13, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Collaborator

@dfellis dfellis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very surprised that modern C compilers aren't making these optimizations by default with the performance impact you mentioned, but very excited at improving the performance of key functions in H3. :)

@coveralls
Copy link

coveralls commented Jul 13, 2024

Coverage Status

coverage: 98.826%. remained the same
when pulling e570b03 on heshpdx:master
into ecc0d25 on uber:master.

@grim7reaper
Copy link
Contributor

grim7reaper commented Jul 13, 2024

I've ported the use-mul-instead-of-div changes to h3o because the 30% speedup was very attractive, but I haven't noticed any noticeable performance improvement.
Maybe M1 CPU have fast division already or LLVM is already doing this optimization under the hood for Rust.

Edit: cannot repro with the benchmark of this repo either. Must be HW dependent then.

@isaacbrodsky
Copy link
Collaborator

isaacbrodsky commented Jul 14, 2024

I wasn't able to reproduce quite the reported performance improvements on Linux x64 w/ GCC, but I'm happy to retest on ARM later.

edit: I see performance improving by more around 10~15%

Before

build-master-jul14$ make benchmarks
[  0%] Formatting sources
[  0%] Built target format
[ 27%] Built target h3
[ 36%] Built target benchmarkPolygon
	-- pointInsideGeoLoopSmall: 0.165765 microseconds per iteration (100000 iterations)
	-- pointInsideGeoLoopLarge: 1.832082 microseconds per iteration (100000 iterations)
	-- bboxFromGeoLoopSmall: 0.128193 microseconds per iteration (100000 iterations)
	-- bboxFromGeoLoopLarge: 1.945774 microseconds per iteration (100000 iterations)
[ 36%] Built target bench_benchmarkPolygon
[ 45%] Built target benchmarkH3Api
	-- latLngToCell: 2.400742 microseconds per iteration (10000 iterations)
	-- cellToLatLng: 1.018848 microseconds per iteration (10000 iterations)
	-- cellToBoundary: 5.000979 microseconds per iteration (10000 iterations)
[ 45%] Built target bench_benchmarkH3Api
[ 54%] Built target benchmarkGridDiskCells
	-- gridDisk10: 30.648170 microseconds per iteration (10000 iterations)
	-- gridDisk20: 116.188511 microseconds per iteration (10000 iterations)
	-- gridDisk30: 274.647540 microseconds per iteration (10000 iterations)
	-- gridDisk40: 441.203441 microseconds per iteration (10000 iterations)
	-- gridDiskPentagon10: 613.105132 microseconds per iteration (500 iterations)
	-- gridDiskPentagon20: 5084.334198 microseconds per iteration (500 iterations)
	-- gridDiskPentagon30: 17323.867540 microseconds per iteration (50 iterations)
	-- gridDiskPentagon40: 40797.638900 microseconds per iteration (10 iterations)
[ 54%] Built target bench_benchmarkGridDiskCells
[ 54%] Built target benchmarkGridPathCells
	-- gridPathCellsNear: 58.487380 microseconds per iteration (10000 iterations)
	-- gridPathCellsFar: 2616.719411 microseconds per iteration (1000 iterations)
[ 54%] Built target bench_benchmarkGridPathCells
[ 63%] Built target benchmarkDirectedEdge
	-- directedEdgeToBoundary: 14.005060 microseconds per iteration (10000 iterations)
[ 63%] Built target bench_benchmarkDirectedEdge
[ 72%] Built target benchmarkVertex
	-- cellToVertexes: 10.162646 microseconds per iteration (10000 iterations)
	-- cellToVertexesPent: 0.217632 microseconds per iteration (10000 iterations)
	-- cellToVertexesRing: 157.010829 microseconds per iteration (10000 iterations)
	-- cellToVertexesRingPent: 154.470410 microseconds per iteration (10000 iterations)
[ 72%] Built target bench_benchmarkVertex
[ 81%] Built target benchmarkIsValidCell
	-- pentagonChildren_2_8: 7074.462316 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14: 8923.350511 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_2: 5023.494634 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_10: 8218.255006 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_100: 8942.472348 microseconds per iteration (1000 iterations)
[ 81%] Built target bench_benchmarkIsValidCell
[ 90%] Built target benchmarkCellsToLinkedMultiPolygon
	-- cellsToLinkedMultiPolygonRing2: 108.960790 microseconds per iteration (10000 iterations)
	-- cellsToLinkedMultiPolygonDonut: 38.634417 microseconds per iteration (10000 iterations)
	-- cellsToLinkedMultiPolygonNestedDonuts: 158.458785 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellsToLinkedMultiPolygon
[ 90%] Built target benchmarkCellToChildren
	-- cellToChildren1: 0.241202 microseconds per iteration (10000 iterations)
	-- cellToChildren2: 1.332053 microseconds per iteration (10000 iterations)
	-- cellToChildren3: 7.849704 microseconds per iteration (10000 iterations)
	-- cellToChildren4: 52.471268 microseconds per iteration (10000 iterations)
	-- cellToChildren5: 369.739713 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellToChildren
[100%] Built target benchmarkPolygonToCells
	-- polygonToCellsSF: 4029.634296 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda: 6255.191586 microseconds per iteration (500 iterations)
	-- polygonToCellsSouthernExpansion: 188593.924100 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCells
[100%] Built target benchmarkPolygonToCellsExperimental
	-- polygonToCellsSF_Center: 2265.643132 microseconds per iteration (500 iterations)
	-- polygonToCellsSF_Full: 7476.944652 microseconds per iteration (500 iterations)
	-- polygonToCellsSF_Overlapping: 8589.903528 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Center: 5523.648154 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Full: 15981.319740 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Overlapping: 20323.545974 microseconds per iteration (500 iterations)
	-- polygonToCellsSouthernExpansion_Center: 116890.366500 microseconds per iteration (10 iterations)
	-- polygonToCellsSouthernExpansion_Full: 379016.690500 microseconds per iteration (10 iterations)
	-- polygonToCellsSouthernExpansion_Overlapping: 590245.006200 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCellsExperimental
[100%] Built target benchmarks

After

build-branch-jul14$ make benchmarks
[  0%] Formatting sources
[  0%] Built target format
[ 27%] Built target h3
[ 36%] Built target benchmarkPolygon
	-- pointInsideGeoLoopSmall: 0.174684 microseconds per iteration (100000 iterations)
	-- pointInsideGeoLoopLarge: 1.706215 microseconds per iteration (100000 iterations)
	-- bboxFromGeoLoopSmall: 0.113044 microseconds per iteration (100000 iterations)
	-- bboxFromGeoLoopLarge: 1.853511 microseconds per iteration (100000 iterations)
[ 36%] Built target bench_benchmarkPolygon
[ 45%] Built target benchmarkH3Api
	-- latLngToCell: 2.095765 microseconds per iteration (10000 iterations)
	-- cellToLatLng: 1.015881 microseconds per iteration (10000 iterations)
	-- cellToBoundary: 4.406268 microseconds per iteration (10000 iterations)
[ 45%] Built target bench_benchmarkH3Api
[ 54%] Built target benchmarkGridDiskCells
	-- gridDisk10: 31.002723 microseconds per iteration (10000 iterations)
	-- gridDisk20: 115.963878 microseconds per iteration (10000 iterations)
	-- gridDisk30: 255.184783 microseconds per iteration (10000 iterations)
	-- gridDisk40: 446.646353 microseconds per iteration (10000 iterations)
	-- gridDiskPentagon10: 620.174954 microseconds per iteration (500 iterations)
	-- gridDiskPentagon20: 5127.692764 microseconds per iteration (500 iterations)
	-- gridDiskPentagon30: 17360.673460 microseconds per iteration (50 iterations)
	-- gridDiskPentagon40: 41154.405900 microseconds per iteration (10 iterations)
[ 54%] Built target bench_benchmarkGridDiskCells
[ 54%] Built target benchmarkGridPathCells
	-- gridPathCellsNear: 59.351578 microseconds per iteration (10000 iterations)
	-- gridPathCellsFar: 2677.547189 microseconds per iteration (1000 iterations)
[ 54%] Built target bench_benchmarkGridPathCells
[ 63%] Built target benchmarkDirectedEdge
	-- directedEdgeToBoundary: 14.106074 microseconds per iteration (10000 iterations)
[ 63%] Built target bench_benchmarkDirectedEdge
[ 72%] Built target benchmarkVertex
	-- cellToVertexes: 9.734607 microseconds per iteration (10000 iterations)
	-- cellToVertexesPent: 0.215882 microseconds per iteration (10000 iterations)
	-- cellToVertexesRing: 160.913600 microseconds per iteration (10000 iterations)
	-- cellToVertexesRingPent: 156.779922 microseconds per iteration (10000 iterations)
[ 72%] Built target bench_benchmarkVertex
[ 81%] Built target benchmarkIsValidCell
	-- pentagonChildren_2_8: 7027.019166 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14: 8806.731603 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_2: 4965.449012 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_10: 8126.078029 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_100: 8706.736355 microseconds per iteration (1000 iterations)
[ 81%] Built target bench_benchmarkIsValidCell
[ 90%] Built target benchmarkCellsToLinkedMultiPolygon
	-- cellsToLinkedMultiPolygonRing2: 110.695771 microseconds per iteration (10000 iterations)
	-- cellsToLinkedMultiPolygonDonut: 39.187226 microseconds per iteration (10000 iterations)
	-- cellsToLinkedMultiPolygonNestedDonuts: 160.627655 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellsToLinkedMultiPolygon
[ 90%] Built target benchmarkCellToChildren
	-- cellToChildren1: 0.211110 microseconds per iteration (10000 iterations)
	-- cellToChildren2: 1.388388 microseconds per iteration (10000 iterations)
	-- cellToChildren3: 8.871911 microseconds per iteration (10000 iterations)
	-- cellToChildren4: 56.922808 microseconds per iteration (10000 iterations)
	-- cellToChildren5: 391.073105 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellToChildren
[100%] Built target benchmarkPolygonToCells
	-- polygonToCellsSF: 3899.409916 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda: 6277.127410 microseconds per iteration (500 iterations)
	-- polygonToCellsSouthernExpansion: 188710.784900 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCells
[100%] Built target benchmarkPolygonToCellsExperimental
	-- polygonToCellsSF_Center: 2175.312946 microseconds per iteration (500 iterations)
	-- polygonToCellsSF_Full: 7408.483802 microseconds per iteration (500 iterations)
	-- polygonToCellsSF_Overlapping: 8448.251498 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Center: 5296.558980 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Full: 15343.415832 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Overlapping: 19566.347054 microseconds per iteration (500 iterations)
	-- polygonToCellsSouthernExpansion_Center: 113208.269200 microseconds per iteration (10 iterations)
	-- polygonToCellsSouthernExpansion_Full: 363013.989700 microseconds per iteration (10 iterations)
	-- polygonToCellsSouthernExpansion_Overlapping: 559297.645200 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCellsExperimental
[100%] Built target benchmarks

@dfellis
Copy link
Collaborator

dfellis commented Jul 14, 2024

@isaacbrodsky your benchmark does show an improvement on latlngToCell from 2.4us to 2.1us. Assuming that's significant and reproducible, it's a 14% perf boost.

@isaacbrodsky
Copy link
Collaborator

@isaacbrodsky your benchmark does show an improvement on latlngToCell from 2.4us to 2.1us. Assuming that's significant and reproducible, it's a 14% perf boost.

Sorry, I was imprecise. I did see performance improvements in many benchmarks, but more on the order of 10~15% rather than the 30% reported.

@heshpdx
Copy link
Contributor Author

heshpdx commented Jul 14, 2024

The benefit is definitely microarchitecture specific based on how the FPU is implemented, and latency and throughput of individual operations. Also, most CPUs implement "early-out" divides, so if the computation is like {N/1, 0/N, N/N, N<<2, etc} then it doesn't incur the full latency (e.g. if unit tests have zero dividend there will be no perf benefit) . I just ran "make benchmarks" and pulled a few which looked significant:

old  -- latLngToCell: 2.366658 microseconds per iteration (10000 iterations)
new  -- latLngToCell: 1.635445 microseconds per iteration (10000 iterations)
    
old  -- cellToChildren1: 0.404193 microseconds per iteration (10000 iterations)
new  -- cellToChildren1: 0.147156 microseconds per iteration (10000 iterations)

old  -- cellToChildren2: 1.099871 microseconds per iteration (10000 iterations)
new  -- cellToChildren2: 0.750266 microseconds per iteration (10000 iterations)

That's {1.4x, 2.7x, 1.5x}, as measured on my Ampere AltraMax. The 1.3x I cited was from our SPEC CPU input. Thanks for considering this PR.

@isaacbrodsky
Copy link
Collaborator

I get similar or even better (40% on cellToLatLng) performance improvements when I test on Linux ARM:

Before

~/oss/h3/build $ make benchmarks
[  0%] Formatting sources
[  0%] Built target format
[ 27%] Built target h3
[ 36%] Built target benchmarkPolygon
	-- pointInsideGeoLoopSmall: 0.237791 microseconds per iteration (100000 iterations)
	-- pointInsideGeoLoopLarge: 1.953805 microseconds per iteration (100000 iterations)
	-- bboxFromGeoLoopSmall: 0.221790 microseconds per iteration (100000 iterations)
	-- bboxFromGeoLoopLarge: 2.608292 microseconds per iteration (100000 iterations)
[ 36%] Built target bench_benchmarkPolygon
[ 45%] Built target benchmarkH3Api
	-- latLngToCell: 6.158289 microseconds per iteration (10000 iterations)
	-- cellToLatLng: 3.538159 microseconds per iteration (10000 iterations)
	-- cellToBoundary: 16.000204 microseconds per iteration (10000 iterations)
[ 45%] Built target bench_benchmarkH3Api
[ 54%] Built target benchmarkGridDiskCells
	-- gridDisk10: 46.712590 microseconds per iteration (10000 iterations)
	-- gridDisk20: 172.776119 microseconds per iteration (10000 iterations)
	-- gridDisk30: 379.537284 microseconds per iteration (10000 iterations)
	-- gridDisk40: 665.536855 microseconds per iteration (10000 iterations)
	-- gridDiskPentagon10: 974.917548 microseconds per iteration (500 iterations)
	-- gridDiskPentagon20: 7932.902812 microseconds per iteration (500 iterations)
	-- gridDiskPentagon30: 27031.574120 microseconds per iteration (50 iterations)
	-- gridDiskPentagon40: 65397.877600 microseconds per iteration (10 iterations)
[ 54%] Built target bench_benchmarkGridDiskCells
[ 54%] Built target benchmarkGridPathCells
	-- gridPathCellsNear: 67.016416 microseconds per iteration (10000 iterations)
	-- gridPathCellsFar: 3043.141366 microseconds per iteration (1000 iterations)
[ 54%] Built target bench_benchmarkGridPathCells
[ 63%] Built target benchmarkDirectedEdge
	-- directedEdgeToBoundary: 40.614495 microseconds per iteration (10000 iterations)
[ 63%] Built target bench_benchmarkDirectedEdge
[ 72%] Built target benchmarkVertex
	-- cellToVertexes: 13.928412 microseconds per iteration (10000 iterations)
	-- cellToVertexesPent: 0.383176 microseconds per iteration (10000 iterations)
	-- cellToVertexesRing: 216.126529 microseconds per iteration (10000 iterations)
	-- cellToVertexesRingPent: 224.302782 microseconds per iteration (10000 iterations)
[ 72%] Built target bench_benchmarkVertex
[ 81%] Built target benchmarkIsValidCell
	-- pentagonChildren_2_8: 13482.154379 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14: 13888.525799 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_2: 7786.916335 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_10: 12766.925168 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_100: 13777.683675 microseconds per iteration (1000 iterations)
[ 81%] Built target bench_benchmarkIsValidCell
[ 90%] Built target benchmarkCellsToLinkedMultiPolygon
	-- cellsToLinkedMultiPolygonRing2: 423.303284 microseconds per iteration (10000 iterations)
	-- cellsToLinkedMultiPolygonDonut: 157.237177 microseconds per iteration (10000 iterations)
	-- cellsToLinkedMultiPolygonNestedDonuts: 625.338030 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellsToLinkedMultiPolygon
[ 90%] Built target benchmarkCellToChildren
	-- cellToChildren1: 0.244395 microseconds per iteration (10000 iterations)
	-- cellToChildren2: 1.357393 microseconds per iteration (10000 iterations)
	-- cellToChildren3: 9.080074 microseconds per iteration (10000 iterations)
	-- cellToChildren4: 63.147554 microseconds per iteration (10000 iterations)
	-- cellToChildren5: 441.493719 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellToChildren
[100%] Built target benchmarkPolygonToCells
	-- polygonToCellsSF: 10539.029034 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda: 14892.152532 microseconds per iteration (500 iterations)
	-- polygonToCellsSouthernExpansion: 455600.007400 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCells
[100%] Built target benchmarkPolygonToCellsExperimental
	-- polygonToCellsSF_Center: 7021.455078 microseconds per iteration (500 iterations)
	-- polygonToCellsSF_Full: 26996.973598 microseconds per iteration (500 iterations)
	-- polygonToCellsSF_Overlapping: 28265.139666 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Center: 13734.053836 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Full: 51138.265554 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Overlapping: 58866.366632 microseconds per iteration (500 iterations)
	-- polygonToCellsSouthernExpansion_Center: 304419.850000 microseconds per iteration (10 iterations)
	-- polygonToCellsSouthernExpansion_Full: 1275601.226200 microseconds per iteration (10 iterations)
	-- polygonToCellsSouthernExpansion_Overlapping: 1790633.328600 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCellsExperimental
[100%] Built target benchmarks

After

~/oss/h3-copy/build $ make benchmarks
[  0%] Formatting sources
[  0%] Built target format
[ 27%] Built target h3
[ 36%] Built target benchmarkPolygon
	-- pointInsideGeoLoopSmall: 0.242731 microseconds per iteration (100000 iterations)
	-- pointInsideGeoLoopLarge: 1.989570 microseconds per iteration (100000 iterations)
	-- bboxFromGeoLoopSmall: 0.223000 microseconds per iteration (100000 iterations)
	-- bboxFromGeoLoopLarge: 2.658519 microseconds per iteration (100000 iterations)
[ 36%] Built target bench_benchmarkPolygon
[ 45%] Built target benchmarkH3Api
	-- latLngToCell: 3.780628 microseconds per iteration (10000 iterations)
	-- cellToLatLng: 2.141569 microseconds per iteration (10000 iterations)
	-- cellToBoundary: 10.879162 microseconds per iteration (10000 iterations)
[ 45%] Built target bench_benchmarkH3Api
[ 54%] Built target benchmarkGridDiskCells
	-- gridDisk10: 46.536392 microseconds per iteration (10000 iterations)
	-- gridDisk20: 173.230969 microseconds per iteration (10000 iterations)
	-- gridDisk30: 380.076526 microseconds per iteration (10000 iterations)
	-- gridDisk40: 666.374863 microseconds per iteration (10000 iterations)
	-- gridDiskPentagon10: 980.303592 microseconds per iteration (500 iterations)
	-- gridDiskPentagon20: 7948.988960 microseconds per iteration (500 iterations)
	-- gridDiskPentagon30: 27231.112900 microseconds per iteration (50 iterations)
	-- gridDiskPentagon40: 66191.866500 microseconds per iteration (10 iterations)
[ 54%] Built target bench_benchmarkGridDiskCells
[ 54%] Built target benchmarkGridPathCells
	-- gridPathCellsNear: 67.183286 microseconds per iteration (10000 iterations)
	-- gridPathCellsFar: 3054.412760 microseconds per iteration (1000 iterations)
[ 54%] Built target bench_benchmarkGridPathCells
[ 63%] Built target benchmarkDirectedEdge
	-- directedEdgeToBoundary: 30.176533 microseconds per iteration (10000 iterations)
[ 63%] Built target bench_benchmarkDirectedEdge
[ 72%] Built target benchmarkVertex
	-- cellToVertexes: 13.611636 microseconds per iteration (10000 iterations)
	-- cellToVertexesPent: 0.385624 microseconds per iteration (10000 iterations)
	-- cellToVertexesRing: 212.934427 microseconds per iteration (10000 iterations)
	-- cellToVertexesRingPent: 224.648723 microseconds per iteration (10000 iterations)
[ 72%] Built target bench_benchmarkVertex
[ 81%] Built target benchmarkIsValidCell
	-- pentagonChildren_2_8: 13472.980062 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14: 13887.771011 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_2: 7781.522597 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_10: 12761.149156 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_100: 13773.922437 microseconds per iteration (1000 iterations)
[ 81%] Built target bench_benchmarkIsValidCell
[ 90%] Built target benchmarkCellsToLinkedMultiPolygon
	-- cellsToLinkedMultiPolygonRing2: 320.794363 microseconds per iteration (10000 iterations)
	-- cellsToLinkedMultiPolygonDonut: 124.114011 microseconds per iteration (10000 iterations)
	-- cellsToLinkedMultiPolygonNestedDonuts: 492.473339 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellsToLinkedMultiPolygon
[ 90%] Built target benchmarkCellToChildren
	-- cellToChildren1: 0.255307 microseconds per iteration (10000 iterations)
	-- cellToChildren2: 1.386753 microseconds per iteration (10000 iterations)
	-- cellToChildren3: 9.292348 microseconds per iteration (10000 iterations)
	-- cellToChildren4: 64.225439 microseconds per iteration (10000 iterations)
	-- cellToChildren5: 443.989882 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellToChildren
[100%] Built target benchmarkPolygonToCells
	-- polygonToCellsSF: 7519.098276 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda: 11145.530170 microseconds per iteration (500 iterations)
	-- polygonToCellsSouthernExpansion: 351837.750500 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCells
[100%] Built target benchmarkPolygonToCellsExperimental
	-- polygonToCellsSF_Center: 4643.820966 microseconds per iteration (500 iterations)
	-- polygonToCellsSF_Full: 17948.688888 microseconds per iteration (500 iterations)
	-- polygonToCellsSF_Overlapping: 18913.791116 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Center: 9732.431998 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Full: 34826.282658 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Overlapping: 40562.522346 microseconds per iteration (500 iterations)
	-- polygonToCellsSouthernExpansion_Center: 209794.639100 microseconds per iteration (10 iterations)
	-- polygonToCellsSouthernExpansion_Full: 855543.199300 microseconds per iteration (10 iterations)
	-- polygonToCellsSouthernExpansion_Overlapping: 1222980.075300 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCellsExperimental
[100%] Built target benchmarks

@isaacbrodsky
Copy link
Collaborator

@heshpdx Thanks for improving the performance here!

@isaacbrodsky isaacbrodsky merged commit a7845a7 into uber:master Jul 15, 2024
34 checks passed
isaacbrodsky added a commit to isaacbrodsky/h3 that referenced this pull request Jul 15, 2024
grim7reaper added a commit to HydroniumLabs/h3o that referenced this pull request Jul 15, 2024
@@ -66,7 +66,7 @@ void _hex2dToCoordIJK(const Vec2d *v, CoordIJK *h) {
a2 = fabsl(v->y);

// first do a reverse conversion
x2 = a2 / M_SIN60;
x2 = a2 * M_RSIN60;
x1 = a1 + x2 / 2.0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just spotted this. I'm not sure if it matters since powers of two are quick anyway, but I figured I would document that we could change it to x1 = a1 + x2 * 0.5;
Or since this is the only usage of M_RSIN60, just craft a M_RSIN60_DIV_BY_2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that the same assembly is produced.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the same assembly is produced it sounds like it's fine to leave as-is because compiler optimizations take care of it for us?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, any compiler at -O1 or higher opt figures it out.

isaacbrodsky added a commit that referenced this pull request Aug 25, 2024
* add #852 to changelog

* others
isaacbrodsky pushed a commit that referenced this pull request Sep 19, 2024
* Further performance improvements for FP math

More FDIV->FMUL opportunities unlocked, following in the
spirit of #852

* Formatting fix

* Update src/h3lib/lib/localij.c

Co-authored-by: Nick Rabinowitz <public@nickrabinowitz.com>

* Add #905 to CHANGELOG.md

* Save one fdiv and maybe a cosine

---------

Co-authored-by: Nick Rabinowitz <public@nickrabinowitz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants