Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-architect Compiler Memory Infrastructure #4783

Open
dsouzai opened this issue Jan 28, 2020 · 9 comments
Open

Re-architect Compiler Memory Infrastructure #4783

dsouzai opened this issue Jan 28, 2020 · 9 comments

Comments

@dsouzai
Copy link
Contributor

dsouzai commented Jan 28, 2020

The current architecture of the memory manager in OMR is quite complex and non-trivial to understand [1]. When you look at an extending project, the story can get even more complicated [2]. As a prerequisite to implement eclipse-openj9/openj9#5836, @mstoodle went about refactoring the memory infrastructure. His efforts led to a discussion and eventual design by @jdmpapin, @Leonardo2718, and (to a much lesser extent) myself. Attached is a PDF [3] of a visual depiction of the design; the rest of the issue will be a description of the design. It is important to note that this redesign only involves the TR::RawAllocator and the various segment providers; it does not involve TR::Region or TR::PersistentAllocator. The attached PDF shows how OpenJ9 fits into this new design as a real example, but it also shows how other projects might fit in.

The design starts with two API classes. TR::RawAllocator and TR::SegmentAllocator. These two are abstract classes that other classes will implement. The attached PDF shows how various the various classes will implement this API.

Regarding TR::RawAllocator:

  • Malloc RA will be used to wrap malloc/free - this is what TR::RawAllocator does currently
  • Debug RA will be used to wrap mmap/munmap
  • J9 RA is an example of how OpenJ9 would implement its own TR::RawAllocator.
  • Custom RA is an example of how some other project could create their own TR::RawAllocator by providing their own function pointers.

Regarding TR::SegmentAllocator:

  • Trivial SA just uses TR::RawAllocator
  • Limiting SA is used to check if there is a limit specified for how much memory TR::SegmentAllocator is allowed to allocate
  • J9 SA is an example of how OpenJ9 would use use its own segment allocation APIs
  • FreeList SA would be used to keep a free list and allocate from it if possible
  • Stashed SA would be used to maintain N segments - ie those N segments are not returned to wherever they were allocated from.

All of these might seem like an odd set of implementers. However, the way all of these fit in is via nested composition. The full visual depiction of this is in the attached PDF. However, to give a concrete example in code of how the "OMR Default Segment Allocator" might look:

namespace TR {

/* Raw Allocators */
class RawAllocator
   {
   void *allocate(size_t size) = 0;
   void  deallocate(void *ptr) = 0;
   };

class MallocRA : public RawAllocator
   {
   void *allocate(size_t size) { return malloc(size); }
   void  deallocate(void *ptr) { free(size); }
   };
   

/* Segment Allocators */
class SegmentAllocator
   {
   MemorySegment &allocate(size_t size) = 0;
   void  deallocate(MemorySegment &segment) = 0;
   SegmentAllocator &_nextSA;
   };
   
class TrivialSA : public SegmentAllocator
   {
   TrivialSA(TR::RawAllocator &rawAllocator) : _rawAllocator(rawAllocator) {}
   MemorySegment &allocate(size_t size) { /* allocate using _rawAllocator */ }
   void  deallocate(MemorySegment &segment) { /* deallocate using _rawAllocator */ }
   RawAllocator &_rawAllocator;
   };
   
class LimitingSA : public SegmentAllocator
   {
   LimitingSA(TR::SegmentAllocator &nextSA, uint32_t limit) : _nextSA(nextSA), _limit(limit), _allocated(0) {}
   MemorySegment &allocate(size_t size)
      {
      _allocated += size;
      if (_allocated > limit)
         throw std::bad_alloc;
      return _nextSA.allocate(size);
      }
   void deallocate(MemorySegment &segment)
      {
      _allocated -= segment.size();
      _nextSA.deallocate(segment);
      }
   };

class DefaultSegmentAllocator : public SegmentAllocator
   {
   DefaultSegmentAllocator(TR::SegmentAllocator &nextSA) : _nextSA(nextSA) {}
   MemorySegment &allocate(size_t size) { return _nextSA.allocate(size); }
   void deallocate(MemorySegment &segment) { _nextSA.deallocate(segment); }
   }

/* Region Allocator */
class Region
   {
   Region(SegmentAllocator &segmentAllocator, RawAllocator &rawAllocator) {...}
   ...
   };
}

Then in compileMethodFromDetails:

/* Initialize Raw Allocator */
TR::MallocRA mallocRA;

/* Initialize Segment Allocator */
TR::TrivialSA trivialSA(mallocRA);
TR::LimitingSA limitingSA(trivialSA, limit);
TR::DefaultSegmentAllocator defaultSegmentAllocator(limitingSA);

/* Initialize Region */
TR::Region compilationRegion(defaultSegmentAllocator, mallocRA);

/* do compilation */

Note, the code above glosses over some details like what the default segment size should be.

This design doesn't deviate from the philosophy of the existing memory infrastructure, namely "Try to fulfill the request yourself or delegate". In the example above, LimitingRA will try to fulfill the request itself; however, since it doesn't have the means to do so, it delegates to _nextSA.

The benefit of this design though is every implementer of the APIs has a very specific purpose and is well named. The reason for composing the various TR::SegmentAllocator implementers is to maintain the invariant of "Try to fulfill the request yourself or delegate". A given class does not need to know the exact type of the class it delegates to; it knows it's a TR::SegmentAllocator and so has a well-defined API. This also makes the code much easier to modify.

For example, if we wanted to add something between LimitingSA and TrivialSA, say PaintingSA, then all we'd have to do is add the PaintingSA class definition and then in compileMethodFromDetails:

...
/* Initialize Segment Allocator */
TR::TrivialSA trivialSA(mallocRA);
TR::PaintingSA paintingSA(trivialSA);
TR::LimitingSA limitingSA(paintingSA, limit);
TR::DefaultSegmentAllocator defaultSegmentAllocator(limitingSA);
...

The majority of the classes described in the attached PDF can live in OMR, including implementers like the "Stashing SA" which only exists in OpenJ9 (currently implemented as the J9::SegmentCache).

We're opening this issue as a place where this design can be discussed.

FYI @mstoodle @lmaisons.


[1] https://github.com/eclipse/omr/blob/master/doc/compiler/memory/MemoryManager.md
[2] https://github.com/eclipse/openj9/blob/master/doc/compiler/memory/MemoryManager.md
[3] https://github.com/eclipse/omr/files/4124860/CompilerMemoryInfrastructureDesign.pdf

@jdmpapin
Copy link
Contributor

Note that the CustomRA shouldn't really be needed, since it would be preferable simply to subclass RawAllocator. Something like it could be useful for allowing custom allocators to be implemented via an API wrapper that does not directly allow subclassing, e.g. for extern "C", but any such wrapper can straightforwardly implement its own CustomRA-like type if/when desired.

@0xdaryl
Copy link
Contributor

0xdaryl commented Feb 5, 2020

To spur some discussion on this topic I would like to add this to the agenda of the Feb. 13 OMR Architecture Meeting. Would one of @dsouzai @Leonardo2718 @jdmpapin like to lead the discussion on it?

@dsouzai
Copy link
Contributor Author

dsouzai commented Feb 5, 2020

I'm fine leading the discussion.

@mstoodle
Copy link
Contributor

A few points from my earlier travels through this topic, most of which I mentioned on the call but due to our audio troubles maybe not all of it came across well:

  1. I had trouble creating a virtual hierarchy of RawAllocator due to interactions with typed_allocator and hence STL. The discussion at the architecture meeting suggests I may not have found the right reference wrapper approach to make that work, so possibly this is not a "real" concern, but I remember starting with a virtual hierarchy of RawAllocator and having to switch to add the raw allocator class as a template parameter on SegmentAllocator and up which I found distasteful and painful. Hopefully I just didn't see the right way forward at the time.

  2. I was also attracted by the simple nature of the delegating model, but I did not find it mapped well onto the existing OpenJ9 set of implementation constraints. In particular, I remember not being able to construct a simple linear chain for at least one scenario and the data to make e.g. a "limit" decision was not naturally represented at the point in the chain where you needed to make the decision (requiring duplicating information in the chain which strongly eroded my appreciation for the model). Overall, I came to the conclusion that the degree of flexibility this delegation model offered was not worth the complexity required to model all the niggly details in the OpenJ9 implementation. I switched approaches to try to make OMR's memory allocation architecture to take on the characteristics of the OpenJ9 architecture, rather than trying to build a nice abstract model that didn't fit well with how OpenJ9 worked. Real world is messy, I guess is the point here.

  3. I did not realize until late that PersistentAllocator was entangled in this part of the memory allocator. In retrospect, that should have been obvious, but I remember needing to change my mental model after that realization. Whoever takes a crack at this should definitely make sure they take the peculiarities of PersistentAllocator into account.

  4. OpenJ9's memory allocation implementation is a performance sensitive system, not only from a "how much memory does the compiler use" standpoint but also from a "how much time does a compile take". That puts an increased onus on performance testing before switching (as @andrewcraik mentioned on the call) as well as increased risk if you want to make anything in OpenJ9 work differently than it does today. See 2, above.

  5. It think it would be unrealistic to expect the memory architecture to change atomically via a PR merge. So there needs to be a model for how the new and old schemes can co-exist while we evaluate any new scheme. I don't think flipping it on/off dynamically is required (and would be hardest to manage), but being able to build one or the other fairly easily should be considered a requirement.

  6. I strongly suggest whoever takes on this task to allocate a solid block of time to work on it without too many distractions. When I made my last attempt here, I was able to put maybe 2-3 hours a day into it and I spent substantial time paging the details in because there are a LOT of touchpoints for this stuff across two different projects and some very subtle details to maintain as you operate in this code.

  7. (since it was mentioned and I believe it's a small example of a larger issue with OMR as a whole)
    CustomAllocator is an example of a very simple thing we can provide to make it easier for a consuming project to hook up to the OMR compiler without having to learn how to write new OMR classes. True, writing a class that implements a couple of functions to allocate and free memory isn't a lot of work. But if we consistently take this position that "it's not hard to write a class" in 100 different places, then OMR consumers have to do a lot of work before than can even try out OMR. For the cost of one simple class that nobody is forced to use, I think it's worth providing this "easier consumption" path. And I hope we'll consider making this choice in more places so that we overtly try not make the OMR compiler any harder to integrate with other systems.

@0xdaryl
Copy link
Contributor

0xdaryl commented Feb 14, 2020

Thanks for summarizing your feedback in the issue @mstoodle. What we decided during the meeting was that @dsouzai will spend some timeboxed amount of time exploring some of the issues @mstoodle raised to see if they are actually problems in the solution proposed. If not, the work sizing that was discussed during the meeting was on the order of 2 weeks, but that may be refined depending on the outcome of the investigations.

Unfortunately, we only have about 10 minutes of usable audio at the beginning of the call and then a few minutes at the end. I'm not sure if this was a phone issue in the main meeting room or a Webex call-in issue. Regardless, I regret that we were unable to fix the audio issues satisfactorily during the call and I apologize for the inconvenience for the remote (or absent) participants. I may still edit the recording to include just the first 10 minutes of @dsouzai's presentation and post that. It is incomplete but may fill in some of the background for those that don't have it.

@dsouzai
Copy link
Contributor Author

dsouzai commented Feb 25, 2020

I managed to get the POC working both with OpenJ9 and OMR. You can find the branches here:

https://github.com/dsouzai/omr/commits/memory
https://github.com/dsouzai/openj9/commits/memory

All the new code is guarded with the Macro NEW_MEMORY.

After going through this process, here's some smaller tasks that should be done as a precursor to this work:

  1. Pass TR::RawAllocator by reference instead of by value
  2. Pass TR::Allocator by reference instead of by value (see Re-architect Compiler Memory Infrastructure #4783 (comment))
  3. Update typed_allocator and getTypedAllocator to use references (see Re-architect Compiler Memory Infrastructure #4783 (comment))

These will still allow the old memory infrastructure to run, but it will make life a lot easier should the switch to the new infrastructure happen. Additionally, using references does provide the nice guarantee of non-NULLness.

Some unfortunate decisions:

  1. [1] TestAlloc::TRJ9MemorySegment. In order to maintain the invariant that all TestAlloc::SegmentAllocators return a TR::MemorySegment & and because TestAlloc::J9SA gets back a J9MemorySegment * from the VM, I had to wrap the J9MemorySegment * in a TestAlloc::TRJ9MemorySegment - the former does have fields to track segment memory usage, but in order to simplify my life, I just set J9MemorySegment alloc field to top, and just used the wrapper TestAlloc::TRJ9MemorySegment to keep track of segment memory usage. The consequence of this is that in the javacore, the segment will be reported as fully used.
  2. [2] Even though TestAlloc::J9SA::deallocate is passed in a TR::MemorySegment &, it has to just "know" that the segment is really of type TestAlloc::TRJ9MemorySegment. While this isn't terrible (since the allocation is also done in the same file, and this knowledge isn't leaked anywhere else), it's not as pretty.
  3. [3] Even though one of the constructors of TestAlloc::J9RA takes in a const TestAlloc::RawAllocator &, it has to just "know" that the allocator is really of the type TestAlloc::J9RA &.
  4. [4] Because of the way the existing J9SegmentCache does its thing and because with the switch to references, checking for NULL doesn't make sense, I had to add a new bool to tell the destructor of TestAlloc::J9SegmentCache not to free the cached segment if it was going to donate that segment to another TestAlloc::J9SegmentCache.
  5. [5] PersistentAllocator needs to check if an existing segment it holds can fulfill a request. The check to do so would essentially return a J9MemorySegment * (which could be NULL if there were none). In order to deal with this in the proposed world, the check returns an iterator. This seems less than ideal because while the allocation method is locked, we do still have to think about whether the iterator could ever be invalidated - in this case it couldn't but if findUsableSegment is called elsewhere in a manner that wasn't safe, the list could be modified and the iterator returned could be invalidated.

These decisions were made because the point of my efforts was to show that the new memory infrastructure can work; if we go forward with making this effort real, then we need to address these decisions. As it turns out though, all of these complexities are found in OpenJ9; the OMR side of things was surprisingly simple (except for changing the existing allocators to not be passed by value...).

fyi @mstoodle @jdmpapin @Leonardo2718 @aviansie-ben


[1] dsouzai/openj9@dea02bd#diff-c4d0c52b4c192efe63c8d8bc683d6dc1R80
[2] dsouzai/openj9@dea02bd#diff-c4d0c52b4c192efe63c8d8bc683d6dc1R90
[3] dsouzai/openj9@dea02bd#diff-aa37060c41f0a1a097488393378d0bfdR25
[4] dsouzai/openj9@dea02bd#diff-c4d0c52b4c192efe63c8d8bc683d6dc1R185-R186
[5] dsouzai/openj9@e38af6e#diff-6dc3fe6ee8cbeec4cfa9ce832f610ff8R139

@dsouzai
Copy link
Contributor Author

dsouzai commented Mar 5, 2020

@Leonardo2718, @aviansie-ben, and I met up and discussed the roadmap @0xdaryl requested in the architecture meeting. The general plan should go as follows:

  1. In the current code base, change TR::RawAllocator to be passed by reference rather than by value. When it comes to providing STL with TR::RawAllocator, use TR::ReferenceWrapper<TR::RawAllocator>. This will prevent the need for having to change TR::Allocator to be passed by reference, as well has having to change TR::typed_allocator. This is going to have to be a coordinated OMR/OpenJ9 set of PRs.
  2. Implement the API classes TestAlloc::RawAllocator and TestAlloc::SegmentAllocator. The namespace used for these classes don't matter, but it should not be in the TR namespace.
  3. Implement the various implementers of the two API classes; this is relatively simple, and can, for the most part, be copied from my prototype. The names should be changed so they're more descriptive.
  4. Guard declarations of existing allocators in order to allow switching to the new memory infrastructure via a build flag1.
  5. Implement the necessary changes in TR::Region; this will also have to be guarded with the same ifdef as the previous point. (dsouzai@c526847).
  6. Implement the necessary changes in TR::PersistentAllocator (both OpenJ9 and OMR); these will also have to be guarded with the same ifdef as the previous point. (dsouzai@ccee066 & dsouzai/openj9@e38af6e)
  7. Add necessary changes to use the new memory infrastructure (dsouzai@f008265 & dsouzai/openj9@634851c) - JITServer will have to also be taken into consideration. This code might also need to be guraded with the same ifdef as the previous point.

Once these are implemented, both memory infrastructures can co-exist. Once the new infrastructure is sufficiently tested, we can make the switch to use the new infrastructure and the old one can be deleted.


  1. @Leonardo2718 or @aviansie-ben will provide more detail on this, as well as how the switch would actually take place, as it will be a little non-trivial because the dependence of OpenJ9 on OMR.

@0xdaryl
Copy link
Contributor

0xdaryl commented Apr 26, 2020

Could someone articulate the scenario that motivates the "stashed segment allocator" please? I realize this was inspired from the J9SegmentCache in OpenJ9 and that cache was introduced to work around a performance problem, so I suppose my request is to articulate what that problem is and why the segment cache solves it. Thanks.

@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 27, 2020

In OpenJ9, this is what the lifetimes looks like:

// Compilation Thread Process Entries Scope
{
   J9SegmentCache;

   // Compilation Scope
   {
      TR::SystemSegmentProvider;
      {
         TR::Compilation::compile();
      }
   }

}

You can think of the Compilation Scope as being a loop, where the comp threads pull requests off the compilation queue and work on them. TR::SystemSegmentProvider uses J9SegmentCache as its backing provider. As you can see, TR::SystemSegmentProvider goes out of scope before J9SegmentCache; when the former goes out of scope, it releases all of its memory to its backing provider (the J9SegmentCache), and the latter (the J9SegmentCache) releases that memory back to the OS - except for one segment that not released until it (the J9SegmentCache) goes out of scope.

This one segment that is held on to is the cache. Originally, when the current memory infra was put into place to replace the previous one, there was no J9SegmentCache. However, this caused regressions in compilation time, and therefore startup, because in the previous infra, the comp thread would hold on to some memory for some period of time across compilations. The J9SegmentCache was introduced to provide similar behaviour and prevent the comp time / startup regressions.

The reason not caching some memory causes a problem is because most compilations during startup in OpenJ9 don't take more than ~30 MB of scratch memory (and often less), so the increased system calls to acquire memory before the start of every compilation added up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants