Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counting VM-allocated pages into heap size. #866

Merged
merged 8 commits into from
Jul 20, 2023
Merged

Conversation

wks
Copy link
Collaborator

@wks wks commented Jul 14, 2023

Some VMs allocate memory outside the MMTk heap, using malloc or other allocation methods. Those memory can usually be reclaiming by the finalizers of dead object in the MMTk heap. This commit allows the VM to report the amount of such off-heap memory so that MMTk can trigger GC more promptly to reclaim such memory.

to-do list:

  • Ensure it works with dynamic heap size.

Related work

The "malloc_counted_size" feature: Introduced in #608, that feature was intended for implementing malloc/free in MMTk's spaces, although the current implementation simply wraps the malloc function from libc. This PR, on the other hand, is agnostic of how the VM allocates memory outside the MMTk heap. This PR is useful if the VM already does similar accounting. For example, Ruby accounts for memory allocated by xmalloc, a wrapper of malloc.

wks added 2 commits July 14, 2023 17:35
Some VMs allocate memory outside the MMTk heap, using `malloc` or other
allocation methods.  Those memory can usually be reclaiming by the
finalizers of dead object in the MMTk heap.  This commit allows the VM
to report the amount of such off-heap memory so that MMTk can trigger GC
more promptly to reclaim such memory.
@wks wks marked this pull request as ready for review July 17, 2023 08:42
@wks
Copy link
Collaborator Author

wks commented Jul 19, 2023

MSRV-related tests are failing, but we already worked around that issue previously.

@wks wks requested a review from qinsoon July 19, 2023 05:33
Comment on lines 115 to 116
/// Return the amount of memory (in bytes) the VM allocated (but not released yet) outside the
/// MMTk heap which can be released by doing GC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Return the amount of memory (in bytes) the VM allocated (but not released yet) outside the
/// MMTk heap which can be released by doing GC.
/// Return the amount of memory (in bytes) the VM allocated outside the
/// MMTk heap and would like to include in the MMTk heap size. Usually
/// it is memory that is kept alive by in-heap objects.

The words about 'not released yet' and 'can be released' are confusing. I would rather use 'kept alive by heap objects'. Free feel to change the wording.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. "Kept alive by heap objects" sounds better.

/// elements using `malloc`, then those buffers should be included in this amount. When the GC
/// finds such an array dead, its finalizer shall `free` the buffer and reduce this amount.
///
/// This function is a hint for the MMTk core to trigger GC, therefore does not have to be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not merely a hint. We use it to decide our actual heap size, and metrics like min heap for benchmarks. I think we can just remove this paragraph.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the "min heap size" is an empirical value, too. It depends on the plan, the number of CPUs (which determines the number of GC threads) and other specifics of the machine the program is running on. I think it is OK to say it is a hint, given that the exact set of "objects kept alive by heap objects" is not precisely defined. Another important point is that the VM should not spend too much time computing the exact value because this function is called very often (by gc_poll). An approximate value should be good enough for MMTk to trigger GC promptly and keep the memory footprint under control.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. The value does not need to be precise for the sake of performance. But I woudn't say it is a 'hint'. A 'hint' means something like an suggestion. Like Java's System.gc() is a hint, we may do or may not do a GC. Julia's --heap-size-hint is a hint, the actual heap size can be larger than the given value. But in our case, the number returned by the method is not a hint, and we do count it into our heap size and we will always count it -- that is the contract of this method, the memory is counted into our heap size.

You have another paragraph talking about performance. If you would like to emphasis further on the tradeoff between performance and being precise, you can expand that section.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll remove the "hint" part and merge it with the paragraph about performance.

/// VM uses `malloc` and the implementation of `malloc` is opaque to the VM), the VM binding
/// can simply return the total number of bytes of those off-heap objects as an approximation.
///
/// The default implementation which returns 0 should also work if the VM only allocates a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph can be removed as well. You just need some comments in the default impl, to say that by default we assume VM does not allocate off-heap memory, or does not desire to include the off-heap memory in MMTk heap size.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That's reasonable.

/// MMTk heap which can be released by doing GC.
///
/// MMTk core will add this amount to the amount of memory allocated or reserved in MMTk
/// spaces, and will trigger a GC if the sum exceeds a certain limit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to trigger a GC depends on the internal GC trigger code. You could simply say that MMTk core will consider the reported memory as part of MMTk heap for the purpose of heap size accounting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is the GC trigger that determines when to trigger GC, but I want the VM binding developer to know that the return value of this function is additive (rather than multiplicative or having other non-linear relation) to MMTk core's "used pages" when considered to trigger GC, so that the developer can have a basic idea of what effect the return value of this function will have on triggering GC. At least the current implementation is additive for both fixed heap size and the MemBalancer. The MemBalancer changes the heap limit dynamically, but the way to compute occupied pages is still adding the used pages of MMTk spaces and the result of this function. I can emphasise that being additive is only a detail of the current implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want the VM binding developer to know that the return value of this function is additive (rather than multiplicative or having other non-linear relation) to MMTk core's "used pages" when considered to trigger GC

I think this is incorrect. GC triggering does not necessarily relate to heap sizes. We could implement a GC trigger that uses allocation volume -- in which case, we do not directly use the vm allocated pages returned by the method, we may use a diff of value or simply ignore it. Sticky immix does not trigger nursery GC based on heap sizes, LXR has similar behaviors as well.

self.immix.immix_space.get_pages_allocated() > self.options().get_max_nursery_pages();

This is what I suggest about the contract of this method.

MMTk core will consider the reported memory as part of MMTk heap for the purpose of heap size accounting.

But we do not guarantee that GC triggering is related with heap size.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the second paragraph and actually give recommendations for VMs about what to include instead of saying "MMTk doesn't specify what to include except that number is considered by the GC trigger". In this way, newer GC triggers can decide whether to take that reported number into consideration when triggering GC.

Copy link
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

/// elements using `malloc`, then those buffers should be included in this amount. When the GC
/// finds such an array dead, its finalizer shall `free` the buffer and reduce this amount.
///
/// If possible, the VM should account off-heap memory by pages. That is, count the number of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// If possible, the VM should account off-heap memory by pages. That is, count the number of
/// If possible, the VM should account off-heap memory in pages. That is, count the number of

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe "in terms of pages used" is more appropriate. It is more verbose though. Maybe "as pages used" instead? Same for comment below

/// page granularity, the occupied pages (instead of individual objects) determine the memory
/// footprint of a process, and how much memory MMTk spaces can obtain from the OS.
///
/// However, if the VM is incapable of accouting off-heap memory by pages (for example, if the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// However, if the VM is incapable of accouting off-heap memory by pages (for example, if the
/// However, if the VM is incapable of accounting off-heap memory in pages (for example, if the

/// # Performance note
///
/// This function will be called when MMTk polls for GC. It happens every time the MMTk
/// allocators allocated a certain amount of memory, usually one or a few blocks. Because this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// allocators allocated a certain amount of memory, usually one or a few blocks. Because this
/// allocators have allocated a certain amount of memory, usually one or a few blocks. Because this

let used_pages = self.get_used_pages();
let collection_reserve = self.get_collection_reserved_pages();
let vm_live_bytes = <Self::VM as VMBinding>::VMCollection::vm_live_bytes();
let vm_live_pages = conversions::bytes_to_pages_up(vm_live_bytes);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a one line comment here just saying that the live bytes may not be actual pages?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That's reasonable because the live bytes may be occupied bytes instead of actual pages, and may sometimes be an approximate value.

@wks
Copy link
Collaborator Author

wks commented Jul 19, 2023

@k-sareen I have made changes according to your suggestions.

/// Get the number of pages that are reserved, including used pages and pages that will
/// be used (e.g. for copying).
/// Get the number of pages that are reserved, including pages used by MMTk spaces, pages that
/// will be used (e.g. for copying), and live pages allocated by the VM as reported by the VM
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not live pages allocated by the VM necessarily. Maybe slightly rephrase it to say something like "We also take into account the live bytes as reported by the VM. For more details, refer to the [vm_live_bytes] function."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased it a bit.

@wks wks merged commit 47ee126 into mmtk:master Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants