Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the number of memory allocations in def-use #4904

Merged
merged 16 commits into from
Sep 11, 2024
Merged

Conversation

asl
Copy link
Contributor

@asl asl commented Sep 10, 2024

As stated in #4872 we are seeing high peak memory usage during def-use run. def-use creates lots of temporary objects and expects GC to clean them. However, it seems this does not happen reliably and therefore the peak memory usage could be very high. Especially given that def-use internal state objects (AllDefinitions) could be quite large. This PR tries to improve this situation in many aspects:

  • Shrink the lifetime of transient objects (this seems to be enough to reduce the memory consumption in the downstream testcase from 16+ Gb down to 2 Gb)
  • Reduce the amount of leaks - less work for GC and in many cases less allocation as we are allocating transient things on stack
  • Some improvements in internal structures as well

As a result:

  • The peak memory allocation in Elevated memory usage in def-use #4872 dropped down to 2 Gb (due to AllDefinitions being cleared)
  • We are seeing nice improvements in gtestp4c --gtest_filter=P4CParserUnroll.switch_20160512 testcase:
  1. 2% runtime improvements with GC both on and off:
Command Mean [s] Min [s] Max [s] Relative
gtestp4c-gc-main --gtest_filter=P4CParserUnroll.switch_20160512 4.506 ± 0.088 4.365 4.659 1.02 ± 0.03
gtestp4c-gc --gtest_filter=P4CParserUnroll.switch_20160512 4.418 ± 0.086 4.194 4.502 1.00
Command Mean [s] Min [s] Max [s] Relative
gtestp4c-nogc-main --gtest_filter=P4CParserUnroll.switch_20160512 3.041 ± 0.024 3.004 3.099 1.01 ± 0.01
gtestp4c-nogc --gtest_filter=P4CParserUnroll.switch_20160512 2.996 ± 0.032 2.964 3.095 1.00

But notice that GC takes 1.5 seconds out of overall 4.5 seconds runtime (!)

  1. We reduced the amount of memory allocations by 10% as well.
    Before:
    image

After:
image

Notice that # of allocations reduced from 27.6M down to 25.7M, the peak memory consumption also reduced from 3.22 GB down to 3.15 Gb

@asl asl added core Topics concerning the core segments of the compiler (frontend, midend, parser) run-validation Use this tag to trigger a Validation CI run. run-sanitizer Use this tag to run a Clang+Sanitzers CI run. labels Sep 10, 2024
Signed-off-by: Anton Korobeynikov <anton@korobeynikov.info>
Signed-off-by: Anton Korobeynikov <anton@korobeynikov.info>
Signed-off-by: Anton Korobeynikov <anton@korobeynikov.info>
Signed-off-by: Anton Korobeynikov <anton@korobeynikov.info>
locations are immutable otherwise

Signed-off-by: Anton Korobeynikov <anton@korobeynikov.info>
…field

Signed-off-by: Anton Korobeynikov <anton@korobeynikov.info>
Signed-off-by: Anton Korobeynikov <anton@korobeynikov.info>
Signed-off-by: Anton Korobeynikov <anton@korobeynikov.info>
Signed-off-by: Anton Korobeynikov <anton@korobeynikov.info>
Signed-off-by: Anton Korobeynikov <anton@korobeynikov.info>
Signed-off-by: Anton Korobeynikov <anton@korobeynikov.info>
Signed-off-by: Anton Korobeynikov <anton@korobeynikov.info>
Signed-off-by: Anton Korobeynikov <anton@korobeynikov.info>
Signed-off-by: Anton Korobeynikov <anton@korobeynikov.info>
Signed-off-by: Anton Korobeynikov <anton@korobeynikov.info>
@asl
Copy link
Contributor Author

asl commented Sep 10, 2024

Overall, the pass manager is pretty hostile to GC as it forces roots to be held for very long time. Here is why:

  • Consider a pass with large internal state (say def-use)
  • If the internal state is not explicitly cleared (and clear must be "deep" with memory release and zeroing) then it will only be cleared when the corresponding pass is destructed...
  • This only happens when PassManager itself is deallocated. Essentially – at the very end of the frontend.
  • The situation is more severe with midends / backends as these pass managers are there till the end of the p4c run.

@asl asl requested a review from grg September 10, 2024 00:07
Signed-off-by: Anton Korobeynikov <anton@korobeynikov.info>
@@ -0,0 +1,273 @@
/*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can push #4713 forward if flat_map is required for this PR. What about other, currently available implementations?

Copy link
Contributor Author

@asl asl Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I used flat_map with inlined storage for WithFieldsLocation (https://github.com/p4lang/p4c/pull/4904/files#diff-f9cda7b4f8eff14d76f4290655d1678d9a25e18f55c1b24cded8239cab74eea6R124) Here we were able to have space for 4 map entries inline (w/o additional memory allocation) while keeping sizeof(WithFieldsLocation) == 112). WithFieldsLocation are used essentially for structs and usually structs do not contain lots of entries (headers might contain much more, true, but still they are tens not thousands). I do not see any performance implication from this change, but we reduced memory allocations here for quite some amount.

I am not aware about viable alternatives currently.

Copy link
Contributor

@ChrisDodd ChrisDodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall

friend class StorageFactory;
WithFieldsLocation(const IR::Type *type, cstring name) : StorageLocation(type, name) {}
flat_map<cstring, const StorageLocation *, std::less<>,
absl::InlinedVector<std::pair<cstring, const StorageLocation *>, 4>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very surprised that a flat_map is faster than an hvec_map here -- for that to be true, almost all the structs/headers in the program have no more than 2-3 fields?

Copy link
Contributor Author

@asl asl Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not about "fast". It is about memory consumption. Here we do not allocate memory at all for any structs with 4 or less fields. And we resort to memory allocation for anything more. But yes, normally small flat maps have comparable speed to hash maps and, as far as I can see, most structs are small. There are certainly large ones but they are much rare. And we normally do not have struct larger than, say, 16 fields.

@asl asl added this pull request to the merge queue Sep 11, 2024
Merged via the queue into p4lang:main with commit fcc23a9 Sep 11, 2024
18 checks passed
@asl asl deleted the def-use-mem branch September 11, 2024 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Topics concerning the core segments of the compiler (frontend, midend, parser) run-sanitizer Use this tag to run a Clang+Sanitzers CI run. run-validation Use this tag to trigger a Validation CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants