Speed-up profile management. #729

ghemawat · 2022-10-31T16:58:42Z

Time taken for "top" listing for a large (34MB) profile drops by 15%:

name    old time/op  new time/op  delta
Top-12   13.2s ± 3%   11.2s ± 2%  -14.72%  (p=0.008 n=5+5)

Furthermore, the time taken to merge/diff 34MB profiles drops by 53%:

Merge/2-12   7.74s ± 2%   3.63s ± 2%  -53.09%  (p=0.008 n=5+5)

Details follow:

The cost of a trivial merge was very high (4s for 34MB profile). We now just skip such a merge and save the 4s.

Only create a Sample the first time a sample key is seen.
Faster ID to *Location mapping by creating a dense array that handles small IDs (this is almost always true).
Faster sampleKey generation during merging by emitting binary encoding of numbers and using a strings.Builder instead of repeated fmt.Sprintf.

The preceding changes drop the cost of merging two copies of the same 34MB profile by 53%:

name        old time/op  new time/op  delta
Merge/2-12   7.74s ± 2%   3.63s ± 2%  -53.09%  (p=0.008 n=5+5)

Use temporary storage when decoding to reduce allocations.
Pre-allocate space for all locations in one shot when creating a Profile.

The preceding speed up decoding by 13% and encoding by 7%:

name      old time/op  new time/op  delta
Parse-12   2.00s ± 4%   1.74s ± 3%  -12.99%  (p=0.008 n=5+5)
Write-12   679ms ± 2%   629ms ± 1%   -7.44%  (p=0.008 n=5+5)

When used in interactive mode, each command needs to make a fresh copy of the profile since a command may mutate the profile. This used to be done by serializing/compressing/decompressing/deserializing the profile per command. We now store the original data in serialized uncompressed form so that we just need to deserialize the profile per command. This change can be seen in the improvement in the time needed to generate the "top" output:

name    old time/op  new time/op  delta
Top-12   13.2s ± 3%   12.4s ± 0%  -5.84%  (p=0.008 n=5+5)

Avoid filtering cost when there are no filters to apply.
Avoid location munging when there are no tag roots or leaves to add.
Faster stack entry pruning by caching the result of demangling and regexp matching for a given function name.

name    old time/op  new time/op  delta
Top-12   13.2s ± 3%   12.3s ± 2%  -6.33%  (p=0.008 n=5+5)

Added benchmarks for profile parsing, serializing, merging.
Added benchmarks for a few web interface endpoints.
Added a large profile (1.2MB) to proftest/testdata. This profile is from a synthetic program that contains ~23K functions that are exercised by a combination of stack traces so that we end up with a larger profile than typical. Note that the benchmarks above are from an even larger profile (34MB) from a real system, but that profile is too big to be added to the repository.

Time taken for "top" listing for a large (34MB) profile drops by 15%: ``` name old time/op new time/op delta Top-12 13.2s ± 3% 11.2s ± 2% -14.72% (p=0.008 n=5+5) ``` Furthermore, the time taken to merge/diff 34MB profiles drops by 53%: ``` Merge/2-12 7.74s ± 2% 3.63s ± 2% -53.09% (p=0.008 n=5+5) ``` Details follow: The cost of a trivial merge was very high (4s for 34MB profile). We now just skip such a merge and save the 4s. * Only create a Sample the first time a sample key is seen. * Faster ID to *Location mapping by creating a dense array that handles small IDs (this is almost always true). * Faster sampleKey generation during merging by emitting binary encoding of numbers and using a strings.Builder instead of repeated fmt.Sprintf. The preceding changes drop the cost of merging two copies of the same 34MB profile by 53%: ``` name old time/op new time/op delta Merge/2-12 7.74s ± 2% 3.63s ± 2% -53.09% (p=0.008 n=5+5) ``` * Use temporary storage when decoding to reduce allocations. * Pre-allocate space for all locations in one shot when creating a Profile. The preceding speed up decoding by 13% and encoding by 7%: ``` name old time/op new time/op delta Parse-12 2.00s ± 4% 1.74s ± 3% -12.99% (p=0.008 n=5+5) Write-12 679ms ± 2% 629ms ± 1% -7.44% (p=0.008 n=5+5) ``` When used in interactive mode, each command needs to make a fresh copy of the profile since a command may mutate the profile. This used to be done by serializing/compressing/decompressing/deserializing the profile per command. We now store the original data in serialized uncompressed form so that we just need to deserialize the profile per command. This change can be seen in the improvement in the time needed to generate the "top" output: ``` name old time/op new time/op delta Top-12 13.2s ± 3% 12.4s ± 0% -5.84% (p=0.008 n=5+5) ``` * Avoid filtering cost when there are no filters to apply. * Avoid location munging when there are no tag roots or leaves to add. * Faster stack entry pruning by caching the result of demangling and regexp matching for a given function name. ``` name old time/op new time/op delta Top-12 13.2s ± 3% 12.3s ± 2% -6.33% (p=0.008 n=5+5) ``` * Added benchmarks for profile parsing, serializing, merging. * Added benchmarks for a few web interface endpoints. * Added a large profile (1.2MB) to proftest/testdata. This profile is from a synthetic program that contains ~23K functions that are exercised by a combination of stack traces so that we end up with a larger profile than typical. Note that the benchmarks above are from an even larger profile (34MB) from a real system, but that profile is too big to be added to the repository.

ghemawat force-pushed the speedup branch from 5151634 to 0730752 Compare October 31, 2022 17:13

ghemawat assigned aalexand Oct 31, 2022

aalexand approved these changes Nov 2, 2022

View reviewed changes

aalexand merged commit 76f304f into google:main Nov 2, 2022

ghemawat deleted the speedup branch November 2, 2022 15:41

salarali mentioned this pull request Feb 27, 2024

Update pprof dependency felixge/fgprof#27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed-up profile management. #729

Speed-up profile management. #729

ghemawat commented Oct 31, 2022

Speed-up profile management. #729

Speed-up profile management. #729

Conversation

ghemawat commented Oct 31, 2022