Skip to content

Latest commit

 

History

History
852 lines (663 loc) · 33.9 KB

OPTO.md

File metadata and controls

852 lines (663 loc) · 33.9 KB

Optimization

After initial porting of javascript to Rust, the code was quite slow. About ten times slower.

After much messing around with the C++ port of earcut, it was clear the C++ version was extremely fast. The benchmarks even had optimization turned off. Turning it on sometimes was 10x speedup or more.

Through profiling and some optimization work, improvements were made to the Rust code so that it was comparable to optimized C++.

Here are some highlights of that process, along with the benchmarks of the water* shapes as the code slowly gets faster and faster.

C++ vs javascript

The C++ version made a few decisions that by default made it faster than javascript:

  • C++ only accepts two dimensions

This eliminates several "div" (division) from the main hot loop which is executed millions of times for a good sized shape.

Rust vs C++

  • is_earcut_hashed() is hot: Profilers reveal that on bigger shapes the vast majority of time is spent inside is_earcut_hashed(), which is determining whether an ear is "really an ear" so that it may be cut without damaging the polygon.

  • inline is important: Most of the time in C++ you can assume the compiler figures out inlining. Here, however, the point_in_triangle and area function take up a lot of time. Since this port has them inside a single 'earchecker' function, that function was marked 'inline' resulting in a good speed boost.

  • Zorder is also hot: The second major speed boost comes from Callgrind/kcachegrind in particular revealed that the zorder() function was a source of some consternation. In particular the conversion from floating point 64 bit numbers in the input arguments, to the 32 bit integer, can be tricky since the conversion can be optimized in various unusual ways

  • Floating point to integer is hot:

By transforming

let mut x: i32 = 32767 * ((xf- minx) * invsize).round() as i32;
let mut y: i32 = 32767 * ((yf- miny) * invsize).round() as i32;

Into this:

let mut x: i32 = ( 32767.0 *   ((xf - minx) * invsize)) as i32;
let mut y: i32 = ( 32767.0 *   ((yf - miny) * invsize)) as i32;

A 13x speedup was achieved for the 'water' benchmark.

  • dimensions: c++ earcut assumes 2 dimensions, which can save a few percent. for example the main earcut loop, which is called for every single ear, has three 'div' inside of it if you have generic dimensions. by optimizing for the case of 2 dimensions by saying point.i/2, the compiler can change that into a >> 2

  • linked list vs nodes in a vector: c++ earcut uses 'real linked lists' instead of this code, which fakes a linked list using bounds-checked index into a vector of nodes. some of this can be replaced with get_unchecked but experiments only showed a tiny speed boost within the margin of error, except for small shapes like water3b which was about 500 ns, but not worth an unsafe{} block. during development the program has crashed many times due to out of bounds checking, and thanks to that checking it was extremely easy to diagnose and fix the problems. unsafe would remove that ability.

  • NULL index vs Option(Index). Since we use a vector of nodes instead of a linked list, there needs to be a translation of the NULL concept like in the javascript code for the linked list node pointers. We could use 'Option' but what ends up happening is that you have to unwrap your option every single time. And what risk is it stopping? A bad index into a vector - which will get caught by the bounds checker if its too low or high, or worst case produce garbled output by going to invalid nodes. I tried to convert NULL to Option None, everything was about twice as slow in benchmarks.

  • linked list 'next' vs iterator: this version of rust code uses an iterator to cycle through the nodes, which is slightly slower than a real linked list since the custom iterator is checking 4 conditions instead of 1 (is null)

  • small shapes: for water3b in particular, the rust code is spending more time in earcut_linked than in is_ear, which is opposite from c++

  • maths: there is a small possibility that c++ earcut appears to be optimizing the point_in_tria ngle math differently than Rust,

  • Subtraction that is unnoticeable: what about eliminating the 'subtraction' in point_in_triangle by translating the entire polygon so minx and miny are 0? Tried it, the difference is within margin of measurement error. In other words, even though its removing millions of instructions from the code line, the way things fit through the CPU floating point cache it didnt matter. Amazing.

  • Vector Indexing and bounds checking in Rust: You can test this by replacing [] inside of the node! next! and prev! macros with .get_unchecked() and/or .get_unchecked_mut(), the answer is a tiny speedup, almost too small to measure.

  • Iteration vs loops. This code has converted several javascript for loops into Rust iteration. For example the entire leftmost() function was replaced by a single line of iterator adaptor code. However. Each iteration involves calling Some() and checking against None which may be slower than doing a straight and ordinary loop. How much slower?

a few iter_range() replaced with loop{}: 
test bench_water                ... bench:   2,044,420 ns/iter (+/- 23,388)
test bench_water2               ... bench:   1,865,605 ns/iter (+/- 13,123)
test bench_water3               ... bench:      77,158 ns/iter (+/- 483)
test bench_water3b              ... bench:       6,845 ns/iter (+/- 35)
test bench_water4               ... bench:     530,067 ns/iter (+/- 14,930)
test bench_water_huge           ... bench:  30,693,084 ns/iter (+/- 578,848)
test bench_water_huge2          ... bench:  62,481,656 ns/iter (+/- 1,934,261)

Ahh. Well. For testing, a few iterations were replaced with loop{}. the small shape water3b it is several hundred nanoseconds faster per iteration using loop{}. However for larger shapes the speed is within error of measurement.

Furthermore, consider maintainability, legibility, likelihood of bugs, and elegance. Let's compare is_ear() which is used heavily by small shapes (since larger shapes only use is_ear_hashed()). Compare the loop{} version with the iteration version:

Loop: 10 lines of code. Normal C style maintainability, and bug 
issues (off by one in the loop? breaking too soon or too late? not 
cleaning up properly before return? not a big deal, but can we do 
better?)

            let mut p = c.next_idx;
            loop {
                if point_in_triangle(&a,&b,&c,&node!(ll,p)) &&
                area(&prev!(ll, p), &node!(ll,p), &next!(ll, p)) >= 0.0 {
                    return false;
                }
                p = node!(ll,p).next_idx;
                if p==a.idx { break; };
            };
            return true;

Iteration: 4 lines. Easier to read (once you learn the basics of 
functional languages), easier to maintain, easier to debug, easier to 
write, "off by one" only applies to my iteration inputs c.next and a, 
and we don't have to manually break or return at all.

          !ll.iter_range(c.next_idx..a.idx).any(|p| {
            point_in_triangle(&a, &b, &c, &p)
                && (area(&prev!(ll, p.idx), &p, &next!(ll, p.idx)) >= 0.0)
		  }),

Taking into consideration the balance and tradeoff between speed and maintainability, legibility, and elegance, iterators have been left in. The speed is unnoticeable on the vast majority of benchmarks and it's only a tiny percentage on others.

Note that part of this is because there is no place for iterator to be used in the 'hot code', where most time is spent - is_ear_hashed().

Profiling results 2/2 - special sauce

Profiling above showed the zorder() function is hot. Can we cool it down a bit?

It begins like this:

fn zorder(xf: f64, yf: f64, minx: f64, miny: f64, invsize: f64) -> i32 {
    // coords are transformed into non-negative 15-bit integer range
    let mut x: i32 = (32767.0 * ((xf - minx) * invsize)) as i32;
    let mut y: i32 = (32767.0 * ((yf - miny) * invsize)) as i32;

Now, I find its easier to read the assembly by actually using 'stat' than making rust dump assembly, so trying this:

make sure Cargo.toml has

[profile.release] 
debug = true

then

don@serebryanya:~/src/earcutr$ cargo bench water2
   Compiling earcutr v0.1.0 (/home/don/src/earcutr)                             
    Finished release [optimized + debuginfo] target(s) in 9.11s                 
     Running target/release/deps/earcutr-7c59a70327baf37c
     Running target/release/deps/speedtest-c6de42429f2eb5ef
don@serebryanya:~/src/earcutr$ sudo perf stat target/release/deps/speedtest-c6de42429f2eb5ef water2
don@serebryanya:~/src/earcutr$ sudo perf record target/release/deps/speedtest-c6de42429f2eb5ef water2
don@serebryanya:~/src/earcutr$ sudo perf report

hit 'h' to learn about keyboard commands
drill down into is_earcut_hashed
scroll to the top
hit / for search, type 'invsize'_ZN7earcutr13is_ear_hashed17haea87f38a9e0c9d3E():                 ▒
  0.02 │       movsd  0x61056(%rip),%xmm8        # 82ea0 <ryu::d2s_full_table::▒_ZN7earcutr6zorder17hab1d711e708e67cdE():                         ▒
       │         let mut x: i32 = (32767.0 * ((xf - minx) * invsize)) as i32;  ▒
       │       subsd  %xmm0,%xmm6                                              ▒
       │       mulsd  %xmm2,%xmm6                                              ▒
  0.69 │       mulsd  %xmm8,%xmm6                                              ▒
  0.63 │       cvttsd2si %xmm6,%eax                                            ▒
       │         let mut y: i32 = (32767.0 * ((yf - miny) * invsize)) as i32;  ▒
  0.02 │       subsd  %xmm1,%xmm5                                              ▒
  0.54 │       mulsd  %xmm2,%xmm5                                              ▒
  0.31 │       mulsd  %xmm8,%xmm5                                              ▒
  0.27 │       cvttsd2si %xmm5,%esi

What is this saying? Googleing subsd, mulsd, and cvttsd2si shows this:

Basically it is subtracting one float from another, then multiplying it, then multiplying it again. Then it's rounding to integer.

But what is 32767.0 all about? invsize is multiplied by it... twice... couldn't we pre-computer invsize*32767.0 once instead of twice?

Well maybe the optimizer has some reason. But ... what if we never had to multiply it at all? invsize is hardly ever used. In fact, pulling out trust nano 'ctrl-w' invsize, it reveals that invsize is never ever used except inside zorder, right here in the hot zone that im looking at!

Now... where is invsize really set up? It's in calc_invsize(), note that i heavily modded that function while porting the javascript to try to understand what was going on better.

// into integers for z-order calculation
fn calc_invsize(minx: f64, miny: f64, maxx: f64, maxy: f64) -> f64 {
    let invsize = f64::max(maxx - minx, maxy - miny);
    match invsize == 0.0 {
        true => 0.0,
        false => 1.0 / invsize,
    }
}

So invsize is actually 1/ itself. But ok. What if we pre-multiply by 32767.0 right here?

// into integers for z-order calculation
fn calc_invsize(minx: f64, miny: f64, maxx: f64, maxy: f64) -> f64 {
    let invsize = f64::max(maxx - minx, maxy - miny);
    match invsize == 0.0 {
        true => 0.0,
        false => 32767.0 / invsize,
    }
}

Now what does the ASM become?

       │     _ZN7earcutr6zorder17hab1d711e708e67cdE():                         ▒
       │         let mut x: i32 = ( ((xf - minx) * invsize)) as i32;           ▒
subsd  %xmm0,%xmm6
mulsd  %xmm2,%xmm6
  0.71cvttsd2si %xmm6,%eax
       │         let mut y: i32 = ( ((yf - miny) * invsize)) as i32;           ▒
  0.26subsd  %xmm1,%xmm5
  0.19mulsd  %xmm2,%xmm5
  0.97cvttsd2si %xmm5,%esi

Wow... i got rid of a whole mulsd instruction inside a hot zone. What are the results? First, cargo test showed there were no test failures. Then, cargo bench water:

test bench_water                ... bench:   2,054,921 ns/iter (+/- 31,695)
test bench_water2               ... bench:   1,854,223 ns/iter (+/- 227,359)
test bench_water3               ... bench:      76,377 ns/iter (+/- 595)
test bench_water3b              ... bench:       7,053 ns/iter (+/- 17)
test bench_water4               ... bench:     530,306 ns/iter (+/- 24,262)
test bench_water_huge           ... bench:  30,679,819 ns/iter (+/- 588,042)
test bench_water_huge2          ... bench:  62,861,730 ns/iter (+/- 1,542,146)

Well, not great, its well within margin of error. But i'll take it. Unlike optimizations above which reduced safety or took up more code, this is actually reducing code by a few symbols. I like that type of optimization better than removing safety.

  • translate to zero

The zorder also has 'xf - minx'. What is Minx? Its the 'left most' point on the polygon input data. But why do we have to have it? Can we translate the whole polygon so minx is 0? Then we don't need this subtraction.

Abit of work and, again perf reveals another instruction can be eliminated from zorder, the first two lines are just a single multiplication now:

       │     _ZN7earcutr6zorder17hf6b08d8ab5a82e32E():                         ◆
       │         let mut x: i32 = ( ((xf ) * invsize)) as i32;                 ▒
  0.35mulsd  %xmm0,%xmm3
  0.09cvttsd2si %xmm3,%eax
       │     _ZN4core3f6421_$LT$impl$u20$f64$GT$3max17h0592a962afedea39E():    ▒
  0.24orpd   %xmm4,%xmm2
       │     _ZN7earcutr6zorder17hf6b08d8ab5a82e32E():                         ▒
       │         let mut y: i32 = ( ((yf ) * invsize)) as i32;                 ▒
  0.50mulsd  %xmm0,%xmm2
  0.39cvttsd2si %xmm2,%esi 

not even sure what 'orpd' is doing in there... apparently the optimizer felt it could slip it in there without a problem

Benchmark results? Well tbh no visible change. It proves that this eliminating instructions is not always going to save time in the running.

But can we go further?

Look at the zorder function.. it's using 32 bit numbers and shifting.

    x = (x | (x << 8)) & 0x00FF00FF;
    x = (x | (x << 4)) & 0x0F0F0F0F;
    x = (x | (x << 2)) & 0x33333333;
    x = (x | (x << 1)) & 0x55555555;

    y = (y | (y << 8)) & 0x00FF00FF;
    y = (y | (y << 4)) & 0x0F0F0F0F;
    y = (y | (y << 2)) & 0x33333333;
    y = (y | (y << 1)) & 0x55555555;

What if i combined these?

    let mut x: i64 = ( ((xf ) * invsize)) as i64;
    let mut y: i64 = ( ((yf ) * invsize)) as i64;
    let mut xy:i64 = x << 32 | y;

    xy = (xy | (xy << 8)) & 0x00FF00FF00FF00FF;
    xy = (xy | (xy << 4)) & 0x0F0F0F0F0F0F0F0F;
    xy = (xy | (xy << 2)) & 0x3333333333333333;
    xy = (xy | (xy << 1)) & 0x5555555555555555;

    let x32: i32 =(  xy >> 32 ) as i32;
    let y32: i32 =(  xy & 0x00000000FFFFFFFF ) as i32;
  0.09mov    %rax,%rsi
  0.46shl    $0x8,%rsi
  0.70or     %rax,%rsi
  0.29and    %rbx,%rsi
       │         xy = (xy | (xy << 4)) & 0x0F0F0F0F0F0F0F0F;                   ◆
  0.04mov    %rsi,%rax
  0.26shl    $0x4,%rax
  0.70or     %rsi,%rax
  0.13and    %r14,%rax
       │         xy = (xy | (xy << 2)) & 0x3333333333333333;                   ▒
  0.31lea    0x0(,%rax,4),%rsi
  0.40or     %rax,%rsi
  0.29and    %r11,%rsi
       │         xy = (xy | (xy << 1)) & 0x5555555555555555;                   ▒
  0.33lea    (%rsi,%rsi,1),%rax
  0.24or     %rsi,%rax
       │             let x32: i32 =(  xy >> 32 ) as i32;                       ▒
  0.02mov    %rax,%rsi
  0.24shr    $0x20,%rsi
  0.57and    $0x55555555,%esi
       │         x32 | (y32 << 1)                                              ▒
and    $0x55555555,%eax
test bench_water                ... bench:   2,056,546 ns/iter (+/- 31,190)
test bench_water2               ... bench:   1,789,347 ns/iter (+/- 11,558)
test bench_water3               ... bench:      78,216 ns/iter (+/- 637)
test bench_water3b              ... bench:       7,170 ns/iter (+/- 59)
test bench_water4               ... bench:     522,485 ns/iter (+/- 5,929)
test bench_water_huge           ... bench:  30,443,739 ns/iter (+/- 730,965)
test bench_water_huge2          ... bench:  61,979,392 ns/iter (+/- 1,704,953)

again, not great, but i'll take it.

  • freelist

I thought i would be clever and during remove_node, create a list of removed nodes. Turns out, even though we are faking a linked list inside a vector... we dont need a freelist of removed nodes. Except for debugging and development when dumping the internal structure of the nodes to printout, we dont need freelist at all. Removing it saved a few percent.

  • NULL

For Null pointer, well, I made an index of 0x7774917.. That's "magic" in l33tspeak. However. Internal CPU sometimes prefer numbers like 0 or FFFFFFFFFF to compare bits. Changing this to ffffffff... brought another few percent.

test bench_water                ... bench:   2,033,630 ns/iter (+/- 40,087)
test bench_water2               ... bench:   1,778,138 ns/iter (+/- 133,746)
test bench_water3               ... bench:      74,565 ns/iter (+/- 695)
test bench_water3b              ... bench:       6,588 ns/iter (+/- 68)
test bench_water4               ... bench:     514,063 ns/iter (+/- 2,866)
test bench_water_huge           ... bench:  30,305,755 ns/iter (+/- 749,417)
test bench_water_huge2          ... bench:  61,937,131 ns/iter (+/- 2,259,247)
  • the main earcut loop

This is the 'next most' hot section after is_earcut_hashed and zorder. Why is it so long though? It has numerous ifs, thens, elses. Can it be simplified?

It turns out, yes. One optimization of C++ earcut is that if there are more than 80 points, they do not use the hashed_earcut. This results in if then else. What if we just ignore that and always use the hash version? For tiny shapes like water3b we lose a little speed, but for big shapes ... we gain.

test bench_water                ... bench:   1,979,939 ns/iter (+/- 43,625)
test bench_water2               ... bench:   1,686,064 ns/iter (+/- 11,793)
test bench_water3               ... bench:      88,216 ns/iter (+/- 14,474)
test bench_water3b              ... bench:       6,393 ns/iter (+/- 27)
test bench_water4               ... bench:     497,426 ns/iter (+/- 17,692)
test bench_water_huge           ... bench:  30,083,359 ns/iter (+/- 746,826)
test bench_water_huge2          ... bench:  61,794,920 ns/iter (+/- 2,600,562)

For some shapes we are now within a percent of C++.

  • NULL Node

In which we eliminate "is this a pointer to NULL" branch conditional in several pieces of linked list code.

Instead of NULL being ffffffff.. what if it was 000000.. and what if... dereferencing a null pointer to a node was ... well.. perfectly valid? What if there was, in fact, an object at address 0x00000000.. in our case we are using a vector, not RAM itself, so we can even put an object at 0x000000000. We can put a node there. We can call it the 'null node'. It is just like any other node, except that it is never part of a ring.

Null is typically 'bad' in that it's encoded as a pointer to the memory address of 0x000000000. Computers don't like people poking around at address 0, so they typically stop the program (Segfault / crash) if your program tries to do it.

Here however, we dont have pointers to nodes. We have indexes into a Vector. The 'null pointer' is still 0.. but we can actually stick an extra node in at 0 so that the program wont crash when dereferencing.

Why? Well, a normal C++ program or Javascript program will have a lot of conditional branches and if statements to make sure a pointer isnt zero before accessing whatever it points to, so to avoid crashing.

But those branches make the program slower. CPUs are designed to be fastest without branches. So the null node allows us to eliminate branches. We dont care if remove_node sets the NULL node's next_index to a number, so we dont have to have a conditional branch inside remove_node

What does this do? Well, it eliminates two branch conditions from remove_node()!

    let prz = node!(self, p).prevz_idx;
    let nxz = node!(self, p).nextz_idx;
if prz!=NULL {nodemut!(self, prz).nextz_idx = nxz;}
    if prz!=NULL {nodemut!(self, nxz).prevz_idx = prz;}

becomes

    let prz = node!(self, p).prevz_idx;
    let nxz = node!(self, p).nextz_idx;
nodemut!(self, prz).nextz_idx = nxz;
    nodemut!(self, nxz).prevz_idx = prz;

So if someone removes a node adjacent to the null node, the null node's nextz_idx and prevz_idx are modified... but it doesnt matter because they are never used.

  • NULL node optimization in is_ear_hashed
    while p!=NULL && node!(ll, p).z >= min_z {
        if earcheck(&a,&b,&c,prevref!(ll, p),&ll.nodes[p],nextref!(ll, p),
        ) {
            return false;
        }
        p = node!(ll, p).prevz_idx;
    }

can be changed, eliminating a conditional check that is computed millions of times for a polygon with a few thousand points. Here is how:

    nodemut!(ll, NULL).z = min_z - 1;
    while node!(ll, p).z >= min_z {
        if earcheck(&a,&b,&c,prevref!(ll, p),&ll.nodes[p],nextref!(ll, p),
        ) {
            return false;
        }
        p = node!(ll, p).prevz_idx;
    }

What we did was allow the code to dereference the NULL 'pointer', since NULL is not a pointer, it's an index into a Vector. And we have put a 'null node' at vector position 0, so vec[NULL] becomes vec[0] which is totally valid. This eliminates a branch which allows the CPU to go faster, as it only has one thing to check now, not two.

  • iterator optimization

In which we change complicated for loops into chained iteration with filters, folds, and other functional language features.

My old iterator, first attempt, did several conditional checks per loop, it was checking if it was done, if it had reached a null node, if it reached the end, if it has a 'single node' chain, etc etc.

i changed it to preload itself with data, called 'pending_result', then on each iteration it returns the pending result, and increments the 'current node' to the next node. now we only have one conditional - if current is end, make the 'pending result' None.

  • loop into iterator

after optimizing my iterator as above, i investigated some of the loop vs iterator tests i did before:

fn intersects_polygon(ll: &LinkedLists, a: &Node, b: &Node) -> bool {
    let mut p = a.idx;
    loop {
        if &ll.nodes[p].i != a.i
            && next!(ll, p).i != a.i
            && &ll.nodes[p].i != b.i
            && next!(ll, p).i != b.i
            && pseudo_intersects(&ll.nodes[p], nextref!(ll, p), a, b)
        {
            return true;
        }
        p = next!(ll, p).idx;
        if p == a.idx {
            break;
        };
   }
   return false
running 7 tests
test bench_water                ... bench:   1,990,314 ns/iter (+/- 44,007)
test bench_water2               ... bench:   1,503,751 ns/iter (+/- 180,270)
test bench_water3               ... bench:      70,347 ns/iter (+/- 469)
test bench_water3b              ... bench:       7,712 ns/iter (+/- 27)
test bench_water4               ... bench:     499,915 ns/iter (+/- 17,212)
test bench_water_huge           ... bench:  30,275,308 ns/iter (+/- 659,648)
test bench_water_huge2          ... bench:  62,049,947 ns/iter (+/- 1,785,038

my iterator version is now, actually, the same speed basically, but a bit less code and easier to read( if you know functional program basics)

        ll.iter_range(a.idx..a.idx).any(|p| {
        p.i != a.i
            && next!(ll, p.idx).i != a.i
            && p.i != b.i
            && next!(ll, p.idx).i != b.i
            && pseudo_intersects(&p, nextref!(ll, p.idx), a, b)
    })
running 7 tests
test bench_water                ... bench:   1,972,265 ns/iter (+/- 63,654)
test bench_water2               ... bench:   1,496,676 ns/iter (+/- 19,040)
test bench_water3               ... bench:      69,862 ns/iter (+/- 524)
test bench_water3b              ... bench:       7,666 ns/iter (+/- 23)
test bench_water4               ... bench:     495,992 ns/iter (+/- 5,783)
test bench_water_huge           ... bench:  30,045,589 ns/iter (+/- 539,955)
test bench_water_huge2          ... bench:  60,316,824 ns/iter (+/- 2,158,995)
  • Itrating through the vector of nodes instead of through the linked list indexes

As this 'simulates' a linked list, we can and do iterate through using the next / previous pointers (which are indexes into a vector).

However once in a while we need to do something that we can do better by just iterating through the vector instead. If we need to touch every node for example to translate the polygon to 0,0 for minx and miny, we did that over the vector of nodes, instead of traversing the linked list.

As you may know, Vector in RAM is basically an array,and most CPU and memory circuitry is designed for doing fast work on arrays of data, for various reasons. I suppose in theory you could design a linked list computer that was faster at traversing linked lists, but most machines really have optimization built in for arrays.

  • Bounding Box

In which we compute the bounding box directly from the vector holding the node information, instead of iterating through the linked list.

During initial linked list setup we have to create a bounding box, find the minimum x and y, and the maximum x and y, for all the points. We can do this by iterating through the original data, or by iterating through nodes.

rustfmt expands the first a huge amount taking up many lines of space, becasue we have to skip(1) to hit the y data and clone() to deal with the way the data is stored...or...

    let maxx = ll.nodes.iter().fold(std::f64::MIN, |m, n| f64::max(m, n.x));
    let maxy = ll.nodes.iter().fold(std::f64::MIN, |m, n| f64::max(m, n.y));
    let minx = ll.nodes.iter().fold(std::f64::MAX, |m, n| f64::min(m, n.x));
    let miny = ll.nodes.iter().fold(std::f64::MAX, |m, n| f64::min(m, n.y));
    let invsize = calc_invsize(minx, miny, maxx, maxy);

this is much shorter than C++, previous rust, javascript, etc, and it runs in basically the same time.

again we are using the Vector to our advantage, instead of iterating through as a linked list, iterate as a vector. CPUs and RAM circuitry are designed for vectors, for various reasons.

test bench_water                ... bench:   1,965,894 ns/iter (+/- 47,469)
test bench_water2               ... bench:   1,493,086 ns/iter (+/- 13,284)
test bench_water3               ... bench:      67,764 ns/iter (+/- 398)
test bench_water3b              ... bench:       7,581 ns/iter (+/- 27)
test bench_water4               ... bench:     492,737 ns/iter (+/- 7,801)
test bench_water_huge           ... bench:  29,906,773 ns/iter (+/- 525,834)
test bench_water_huge2          ... bench:  60,744,602 ns/iter (+/- 2,544,614)
  • Vector::with_capacity

In which we pre-compute vector capacity, saving a few 'resize'/reallocate operations.

We store two big chunks of data in Vectors, the nodes and the resulting triangles.

The nodes, well, they are nodes. They have point coordinates, and links to other nodes, and the z-order hash value.

The triangles? Well, they are just indexes into the original data.

But we construct both of these vectors at runtime. We have to grow the vectors as the data is added to them, first during construction of the nodes, then as the nodes are cut off by the earcutter, we have to grow the triangles vector.

Or do we? Rust Vectors, like most vectors in computer lamnguages, have 'capacity'. You can reserve a big chunk of RAM for your vector even if you dont use it all right away. Then, subsequent 'push' operations are really just filing in RAM you already had reserved for yourself.

We can do this with our data... ok. I did that a while back, forgot to even mention it.

But... we can also do it for our triangles. And we get a tiny little boost.. stranegly enough on the smaller shapes :

running 7 tests
test bench_water                ... bench:   1,972,024 ns/iter (+/- 53,100)
test bench_water2               ... bench:   1,483,504 ns/iter (+/- 11,269)
test bench_water3               ... bench:      66,898 ns/iter (+/- 518)
test bench_water3b              ... bench:       7,276 ns/iter (+/- 6)
test bench_water4               ... bench:     489,625 ns/iter (+/- 2,844)
test bench_water_huge           ... bench:  29,874,170 ns/iter (+/- 494,521)
test bench_water_huge2          ... bench:  60,463,148 ns/iter (+/- 2,343,911)
  • index_curve

In which we try to speedup the space filling curve code, and fail.

tried to optimize with vector iteration, failed.

it iterates through a ring, z-indexing each point.

the werd thing is that as the algorithm overall progresses, the number of 'unused' vector nodes shrinks.

the issue is that if you use vector iteration, you wind up iterating over a bunch of nodes that have been removed from the polygon. cut already into ears. so overall its slower.

even if you 'mark' removed nodes, it winds up being slower.

not much slower. and only shows on the huge shapes but still. not worth it.

  • find_hole_bridge into iterator... uhm wow

In which we replace another for loop with iterators + functional code.

This is a very complicated function that does something simple in idea, find a line between the left most point of a hole and the outer polygon to connect it to.

The original code is something like this:

    loop {
        let (px, py) = (&ll.nodes[p].x, &ll.nodes[p].y);
        if (hy <= py) && (hy >= next!(ll, p).y) && (next!(ll, p).y != py) {
            let x = px + (hy - py) * (next!(ll, p).x - px) / (next!(ll, p).y - p
y);

            if (x <= hx) && (x > qx) {
                qx = x;

Into an iterator, after a few hours experimentation and work with iterators:

    let calcx =|p:&Node| p.x + (hy - p.y) * (next!(ll,p.idx).x - p.x) / (next!($
    for p in
        ll.iter_range(p..outer_node)
        .filter(|p| hy <= p.y)
        .filter(|p| hy >= next!(ll, p.idx).y)
        .filter(|p| next!(ll, p.idx).y != p.y)
        .filter(|p| calcx(p) <= hx )
    {

the speedup on huge and huge2 is ... apparently significant.

test bench_water                ... bench:   1,941,360 ns/iter (+/- 87,985)
test bench_water2               ... bench:   1,469,714 ns/iter (+/- 9,538)
test bench_water3               ... bench:      68,554 ns/iter (+/- 534)
test bench_water3b              ... bench:       7,269 ns/iter (+/- 34)
test bench_water4               ... bench:     487,715 ns/iter (+/- 7,659)
test bench_water_huge           ... bench:  28,610,478 ns/iter (+/- 803,598)
test bench_water_huge2          ... bench:  57,362,865 ns/iter (+/- 2,190,260)

Pulling the same trick with the bottom of the same function... uhm.. yeah.

before:

    while p != stop {
        let (px, py) = (&ll.nodes[p].x, &ll.nodes[p].y);
        if (hx >= px) && (px >= mx) && (hx != px) && point_in_triangle(&n1, &mp, &n2, &ll.nodes[p])
        {
            tan = (hy - py).abs() / (hx - px); // tangential
            if ((tan < tan_min) || ((tan == tan_min) && (px > &ll.nodes[m].x)))
                && locally_inside(ll, &ll.nodes[p], &hole)
            {
                m = p;
                tan_min = tan;
            }
        }
        p = next!(ll, p).idx;
    }
	return m

after:

    let calctan = |p: &Node| (hy - p.y).abs() / (hx - p.x); // tangential
    ll.iter_range(p..m)
        .filter(|p| hx > p.x && p.x >= mp.x)
                .filter(|p| point_in_triangle(&n1, &mp, &n2, &p))
        .fold((m, std::f64::INFINITY), |(m, tan_min), p| {
            if ((calctan(p) < tan_min) || (calctan(p) == tan_min && p.x > &ll.nodes[m].x))
                && locally_inside(ll, &p, &ll.nodes[hole])
            {
                (p.idx, calctan(p))
            } else {
                (m, tan_min)
            }
        })
        .0

.... errr well.

running 7 tests
test bench_water                ... bench:   1,886,095 ns/iter (+/- 48,910)
test bench_water2               ... bench:   1,455,139 ns/iter (+/- 8,578)
test bench_water3               ... bench:      65,432 ns/iter (+/- 851)
test bench_water3b              ... bench:       7,236 ns/iter (+/- 30)
test bench_water4               ... bench:     504,135 ns/iter (+/- 29,033)
test bench_water_huge           ... bench:  26,974,404 ns/iter (+/- 569,252)
test bench_water_huge2          ... bench:  53,630,310 ns/iter (+/- 1,818,475)

we are now within 50,000 ns of optimized C++ for bench_water or about .05 ms, 5%. 600,000 ns (0.6ms) for water_huge and uhm. we beat C++ water water_huge2

For the full, final benchmark comparison to C++, see README.MD

  • Thanks

Thanks for reading.

Inspired by Michael Abrash's book from the 1990s about optimization.