Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote dealloc refactor. #138

Merged
merged 7 commits into from
Mar 10, 2020
Merged

Remote dealloc refactor. #138

merged 7 commits into from
Mar 10, 2020

Conversation

mjp41
Copy link
Member

@mjp41 mjp41 commented Mar 9, 2020

This performs a small refactor on the remote dealloc path to split the common case out.

@mjp41
Copy link
Member Author

mjp41 commented Mar 9, 2020

@darach, @Licenser, this might improve your performance. The perf trace you sent me showed 2.68% on this path, so hopefully you might see a difference.

@Licenser
Copy link

Licenser commented Mar 9, 2020

Nice! We’ll bench this tomorrow:) thanks!

@mjp41 mjp41 requested a review from davidchisnall March 9, 2020 19:10
mjp41 added 2 commits March 9, 2020 20:13

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Change remote to count down 0, so fast path does not need a constant.

Use signed value so that branch does not depend on addition.
@mjp41
Copy link
Member Author

mjp41 commented Mar 9, 2020

On x64 Linux this is now, 15 instructions to the branch to decide it is a remote dealloc, and then a further 13 instructions (including one branch, not taken in common case) to complete the remote deallocation. If it actually posts it is a lot more instructions, but that is very infrequent.

Previously, it was a tangled mess of assembly, and probably twice as many instructions.

src/mem/alloc.h Outdated Show resolved Hide resolved
@mjp41 mjp41 merged commit 76eaf1a into master Mar 10, 2020
@Licenser
Copy link

Licenser commented Mar 10, 2020

I know this is merged already but I wanted to add some results from our side:

old

#[Mean       =    109568.44, StdDeviation   =     96808.94]
#[Max        =       430079, Total count    =     68218199]
#[Buckets    =           30, SubBuckets     =         3968]


Throughput: 409.3 MB/s

pre

new

#[Mean       =    109794.14, StdDeviation   =     92536.16]
#[Max        =       374783, Total count    =     69517595]
#[Buckets    =           30, SubBuckets     =         3968]


Throughput: 417.1 MB/s

post

@mjp41
Copy link
Member Author

mjp41 commented Mar 10, 2020

Interesting, so the perf numbers show it is spending much less time in remote dealloc, 2.68% -> 0.92%, but this hasn't translated into a throughput win for you? How much noise is there in the "Throughput" number?

@Licenser
Copy link

Sorry I did a copy & paste oopsie, when pasting the numbers the throughput for old and new were swapped

@Licenser
Copy link

the new code is ~2% faster

@mjp41
Copy link
Member Author

mjp41 commented Mar 10, 2020

Phew, you had me questioning my ability to guess asm speed (which obviously is pretty questionable ;-) ). 2% that is pretty good. More than I expected.

@Licenser
Copy link

Ja 2% is really nice! I think snmalloc is a really really good fit for tremor :D

@mjp41 mjp41 deleted the optimise_remote_dealloc branch March 10, 2020 11:03
@Licenser
Copy link

Licenser commented Mar 10, 2020

adding a latency plot:

image

(edit: fixed the units)

@darach
Copy link

darach commented Mar 10, 2020

The profile is good, but the Latency axis should read nanos. We use HDR Histogram https://hdrhistogram.github.io/HdrHistogram/plotFiles.html to plot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants