Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add persistence to the JITServer AOT cache #15848

Closed
cjjdespres opened this issue Sep 8, 2022 · 31 comments · Fixed by #16075, #16228 or #15949
Closed

Add persistence to the JITServer AOT cache #15848

cjjdespres opened this issue Sep 8, 2022 · 31 comments · Fixed by #16075, #16228 or #15949
Labels
comp:jitserver Artifacts related to JIT-as-a-Service project

Comments

@cjjdespres
Copy link
Contributor

Unlike the main AOT cache, the current JITServer AOT cache lacks any persistence; only an in-memory cache is supported. An option should be added to allow for JITServer AOT cache persistence.

@cjjdespres
Copy link
Contributor Author

Attn @mpirvu

@mpirvu mpirvu added the comp:jitserver Artifacts related to JIT-as-a-Service project label Sep 11, 2022
@mpirvu
Copy link
Contributor

mpirvu commented Sep 11, 2022

In a cluster there could be several JITServer instances, each with its own in-memory AOT cache. Ideally, there would be a single AOT cache that is shared by all the server instances. Unfortunately, synchronization would be a very expensive proposition, especially considering that the servers could be running on different nodes. You would need something like a distributed shared memory mechanism that is difficult to build and maintain, not to say that overhead could be prohibitive.

The second best idea is to just implement AOT cache persistence and a snapshot-restore mechanism. The snapshot operation needs to serialize the AOT cache and write it to a file. The restore operation just re-instantiates an AOT file from a snapshot.
We should keep in mind that a JITServer deployment can have multiple JITServer pods, each with its own version of cache.
Things to consider:

  1. Do we allow merging/updates for a given AOT cache snapshot?
  2. Who initiates the AOT cache snapshot?

Do we allow merging/updates for a given AOT cache snapshot?
Complications can arise if a server instance tries to merge its data with a snapshot that may not be 100% compatible (same key but different values). We can implement a versioning scheme where a new server instance remembers the version it has initiated its own cache from and be allowed to update the AOT cache snapshot only if the snapshot still has that version. This scheme might disallow other server instances to save their own information, even though they might have a lot of information.
A simpler idea is to overwrite an existing snapshot with a new one. There is no guarantee that the new snapshot contains more data (or better data), but we can use heuristics that strive for this goal.

Who initiates the AOT cache snaphot?
One idea is for the server itself to do it periodically and at shutdown. Only at shutdown is not good, because additional server instances launched by autoscaler will not benefit from the data accumulated in the running jitserver instances.
Another idea is to push all these operations onto the JITServer operator. JITServer could open another HTTP port for snapshot requests from the operator. The operator could also determine where the snapshot is to be stored (likely in a volume that the JITServer deployment defines). We should allow the deletion of AOT caches based on their names. The operator needs to query the N JITServer pods from a deployment, find which one has the most information and select that pod to serialize the AOT cache to the persistent volume (assuming we overwrite snapshots completely, i.e. no merges). Need to make sure that a new instance does not try to read a snapshot while it is being overwritten by another instance (can the JITServer operator guarantee that?).

@AlexeyKhrabrov
Copy link
Contributor

AlexeyKhrabrov commented Sep 12, 2022

I think we should start with the simple case of immutable snapshot files. Later we can consider whether it's worth it to implement a merge operation.

It's easy to merge (append) new data into an existing snapshot that hasn't been modified since the current JITServer instance has loaded it. But this is equivalent to simply overwriting the snapshot with the new one, and the merge operation is only an optimization, which may or may not make much of a difference to overall perfromance.

Merging data created concurrently by separate JITServer instances is more challenging since it requires re-assigning record IDs and choosing which compiled methods to keep if there are multiple versions. I think all these things are doable, but will probably need quite a bit of effort.

It should be possible to atomically overwrite an existing snapshot file in place by writing the new snapshot into a separate uniquely named file and then renaming it. Rename operations are supposed by be atomic on local file systems in Linux, hopefully that applies to container volumes.

I think we need to support taking a snapshot while the JITServer is running (periodically or for external requests), without suspending compilation threads. This will require some careful synchronization, but hopefully nothing too complicated.

@AlexeyKhrabrov
Copy link
Contributor

AlexeyKhrabrov commented Sep 13, 2022

Here is an outline of how storing and loading a snapshot can be implemented (since I've already sketched out the design in the past but never implemented it).

Store:

  1. Write the snapshot header. The exact information that goes in there depends on the extent to which we want to verify the consistency of the data when we load it, but at a minimum it needs to store the numbers of records of each type.
  2. Iterate through all AOTCacheRecords in (partial) dependency order, e.g.: class loader records, class records, method records, class chain records, well-known classes records, AOT header records, serialized methods. For each record, write out the AOTSerializationRecord portion of it (as is, assuming we don't care about endianness, no other data serialization is needed).
  3. Synchronization with concurrent cache usage. Once the snapshot starts, any newly created records should be marked as "do not persist" so that we can skip them as we write out the records. We can simply remember how many records of each type exist at the start of the snapshot, and as we iterate each list, stop after that number of records (any records after that were added after the start of the snapshot). If the snapshot is initiated in the middle of a compilation, we'll have some "orphaned" records in the snapshot, but I don't think that's an issue. We can keep the list of newly added records during a snapshot so that we can clear their "do not persist" flags at the end of the snapshot. We should only allow one snapshot operation on each AOT cache at a time. Note that the current implementation uses std::unordered_map (PersistentUnorderedMap) that cannot be safely traversed without locking the whole map while other threads can modify it. We need to have an alternative way to traverse them, e.g. a non-blocking linked list (assuming we don't delete records; otherwise we'll need to figure out something else).

Load:

  1. Read the header. Reserve space in the maps based on the numbers of records of each type. In addition to the key-record maps, create and reserve space for temporary (scratch) vectors that will map record IDs to AOTCacheRecord pointers.
  2. For each record type (which we now read back in dependency order), for each record, read the AOTSerializationRecord from the file, create the corresponding AOTCacheRecord and add it to its key-record map and its id-record vector (if necessary). Obtain any needed sub-record pointers (which are guaranteed to already have been read) by ID from the corresponding vectors.

One of the remaining questions is when to load the snapshots. Some of the options are:

  1. At JITServer start. We need to pass it the list of AOT cache names/paths to load, e.g. as command line arguments. The concern here is JITServer start time for a large number/size of caches.
  2. On demand when the cache with a given name is first requested by a client. The concern here is the latency of the first AOT compilation request. One way to avoid it is to load the cache asynchronously while serving compilation requests without the cache.

@cjjdespres
Copy link
Contributor Author

At the moment individual fresh AOT cache instances are created and held separately for each client. How would that interact with the persistence mechanism?

@AlexeyKhrabrov
Copy link
Contributor

At the moment individual fresh AOT cache instances are created and held separately for each client. How would that interact with the persistence mechanism?

Depends on when/how we load the cache snapshots. If eagerly at JITServer start, then each pre-loaded AOT cache instance will already be created and loaded by the time the first client requests it, so there is no interaction. If we load them on demand (which I think makes more sense), when an AOT cache with a given name is first requested by a client, the JITServer will check if a snapshot with this name exists and will initiate the load. To avoid delaying compilation requests while the snapshot is being loaded, we need to serve compilation requests without the cache until the load is complete.

One more thing to consider is whether we want to support replacing an already active AOT cache instance at runtime with a "better" (e.g. bigger) snapshot, in response to an operator request. That can be done relatively easily for future clients that connect after the snapshot is loaded. This is mostly equivalent to directing new clients to a new JITServer instance that uses the new snapshot, but more efficient. Replacing an AOT cache instance for live clients would be more challenging since that requires clearing some caches on both the server and the client.

@mpirvu
Copy link
Contributor

mpirvu commented Sep 20, 2022

A JITServer can have several AOT caches, each with its own name. It's not good idea to load all those caches on start-up, because the server may never use many of them. It's much better to load a cache only when a client asks for it.
So, server receives the location of cache snapshots through a command line option -XX:JITServerAOTCacheDir=...
When the server receives a client request for a cache named "foo", if that cache does not already exist in-memory, the server will search the cache directory for a file with that name and load it if available.
I was concerned about a server instance overwriting a cache file while another instance reads it from disk. From what I am reading, this is not a problem on Linux: once a process opens a file, the file cannot be truly deleted from disk (only the association between the filename and inode gets deleted). So, the server instance that wants to create a new snapshot (overwriting an old one) can simply write to a temporary file and when the snapshot is done, just rename the temporary file (this renaming has no effect on another instance that already started to read the old version of the snapshot).

As Alexey mentioned, replacing the in-memory cache of a server with a better one introduces some complexity which is to be avoided in the first version of this tech.

@cjjdespres
Copy link
Contributor Author

We need to have an alternative way to traverse them, e.g. a non-blocking linked list

When new records are added to a map, all their dependencies should already exist in the cache, right? So we could just keep that list around and append new records to it as they are created and that list will remain sorted by that partial order.

@AlexeyKhrabrov
Copy link
Contributor

When new records are added to a map, all their dependencies should already exist in the cache, right? So we could just keep that list around and append new records to it as they are created and that list will remain sorted by that partial order.

That would work, but I think it might be better to keep a separate list for each record type. That way the snapshot file will be more structured (divided into sections containing records of one type) enabling more integrity checks at load time. It might also make the loads a bit faster because of better locality (populating one map at a time instead of all at once).

@AlexeyKhrabrov
Copy link
Contributor

This also made me think of a simpler way to synchronize taking a snapshot with concurrent additions of new records without marking the newly added records. We can simply remember how many records of each type exist at the start of the snapshot, and as we iterate each list, stop after that number of records (any records after that were added after the start of the snapshot).

@mpirvu
Copy link
Contributor

mpirvu commented Sep 21, 2022

Is the plan to add a linked list for each hashtable we have today, and to insert a pointer to a record both in hashtable and the linked list?

@AlexeyKhrabrov
Copy link
Contributor

Is the plan to add a linked list for each hashtable we have today, and to insert a pointer to a record both in hashtable and the linked list?

Yes. New records will be added to the tail of the list, and the snapshot writer will traverse it from head to tail.

@cjjdespres
Copy link
Contributor Author

I believe so

@mpirvu
Copy link
Contributor

mpirvu commented Sep 21, 2022

I am ok with it, but we need to quantify the memory increase this change brings.

@AlexeyKhrabrov
Copy link
Contributor

I am ok with it, but we need to quantify the memory increase this change brings.

8 bytes per record, which amounts to ~220 KB for AcmeAir with a total cache memory usage of ~50 MB, or less than 0.5%.

@mpirvu
Copy link
Contributor

mpirvu commented Sep 21, 2022

We need to add the overhead of the list infrastructure (two extra pointers per node), so ~660 KB. It's manageable though.

@AlexeyKhrabrov
Copy link
Contributor

We need to add the overhead of the list infrastructure (two extra pointers per node), so ~660 KB. It's manageable though.

One next pointer in each record (intrusive linked list) is all we need at this point. We don't support deleting records, which would be the only reason I can think of for a doubly-linked list (but even then a singly linked list might be sufficient depending on how exactly we delete records).

@mpirvu
Copy link
Contributor

mpirvu commented Sep 21, 2022

One next pointer in each record (intrusive linked list) is all we need at this point.

ok. I was thinking at using std::list to store pointers to records, rather than modifying the records themselves.

@mpirvu
Copy link
Contributor

mpirvu commented Sep 21, 2022

For the snapshot header I propose we include:

  • Some eye-catcher to indicate that this is indeed a snapshot
  • Version number of the snapshot structure (protect against future changes in record structures)
  • Version number of JITServer (don't know yet if this absolutely necessary, but is nice to have)
  • UID of the server that wrote the snapshot
  • Checksum for the entire file to verify integrity?
  • Number of records for each type
  • Ideally we would have offsets to the beginning of sections where a new record type starts

@cjjdespres
Copy link
Contributor Author

The records can depend on other records of different types. Since the record types will be kept in their own sections, can the record types be ordered so that the resulting snapshot file will still be sorted?

@AlexeyKhrabrov
Copy link
Contributor

Since the record types will be kept in their own sections, can the record types be ordered so that the resulting snapshot file will still be sorted?

Yes, see #15848 (comment).

@cjjdespres
Copy link
Contributor Author

Sorry, missed that. Thanks.

@mpirvu
Copy link
Contributor

mpirvu commented Sep 23, 2022

Some thoughts about JITServer instances periodically overwriting an AOT snapshot.

Saving a snapshot should be attempted when enough AOT compilations have been added to the in-memory cache since the last snapshot attempt. What constitutes "enough" is debatable. It could be anywhere between 100-500 methods. I am also thinking of imposing a time restriction; we shouldn't attempt to write snapshots every second even if enough AOT methods have been added.

Saving the snapshot: search the snapshot directory for a file named AOTSnapshot..bin (or something like that). If file exists, open the file and read the header. From the header we can read the number of AOT methods and decide whether we want to overwrite ( we should do it only if our snapshot has "significantly" more methods).
There is a question of what do to if the AOTHeader does not match: should we overwrite the snapshot or not? Similar question if the snapshot version does not match. The conservative answer is to do nothing, though I can see some problems with leftover/stale snapshots. We could also create several variations: AOTSnapshot...bin. However if we try to do the same for the AOTHeader, there could be just too many combinations.

How do we avoid two JITServer instances writing their own snapshot at about same time. The danger is that the last one to save the snapshot may write one with fewer AOT methods. Maybe the servers should remember the timestamp of the existing snapshot, write their own snapshot in a temporary file and just before doing the rename, check again the timestamp. If it changed, we need to read the header again and if we have fewer entries, abort the snapshot operation by deleting the temporary file.

@AlexeyKhrabrov
Copy link
Contributor

Once we have the snapshot mechanism implementation we'll be able to measure the overhead (in particular wall-clock and CPU time to write one) which will help guide some of the policy decisions (e.g. how often to take snapshots).

There is a question of what do to if the AOTHeader does not match: ...

It's even a bit more complicated than that. Currently an AOT cache instance can have multiple AOT headers and separate sets of compiled methods for each AOT header. The motivation for that was to avoid duplicating all the metadata (class records etc.) when multiple clients run the same application on diverse hardware or a diverse set of JVM settings (e.g. heap size ranges). Overwriting a snapshot potentially means losing the cache for multiple AOT header versions. This might actually be a worthwhile use case for merging snapshots. We need to think about this more.

Similar question if the snapshot version does not match. ...

As far as I understand, the main use case for sharing AOT cache snapshots is efficiently auto scaling a JITServer deployment by launching new instances with a warm AOT cache and supporting scaling down to zero without losing the AOT cache. I think we can reasonably expect all JITServer instances in a single deployment to run the same version (in most cases). Maybe this can even be enforced at the JITServer operator level. Or yes, we can keep multiple versions around with different file names, and eventually purge stale ones.

Also for long running JITServer deployments we might want to think about purging stale (not reused in a long time) methods (and their metadata dependencies) from the cache to avoid growing the snapshot sizes. One simple way is to skip such entries when writing a snapshot.

@AlexeyKhrabrov
Copy link
Contributor

After giving it a bit more thought, merging snapshots doesn't seem too complicated, at least algorithmically.

Here is a high level sketch of the algorithm to merge two snapshots. It can be easily generalized to N snapshots if that ever becomes useful. For simplicity let's assume that both caches C1 and C2 are in memory, but not concurrently used or modified, and we're merging a "smaller" new cache C2 into a "larger" base cache C1.

Iterate through all records in C2 in dependency order: all class loader records, then all class records, etc. For each record R, perform a lookup in the corresponding key-record map in C1. If the lookup is successful, then R is a duplicate of an existing record in C1, otherwise it's a new unique record. In case of a duplicate, replace the ID of the record with the C1 version. In case of a new record, assign it a new ID (using a counter starting at to the number of records of this type in C1). Then for each sub-record that R refers to, read its ID (which is guaranteed to already be updated) and replace the corresponding ID field in R with it.

In practice though, only one of the caches is in memory, and the other one is in an existing snapshot file. The in-memory cache also needs to stay active, i.e. support concurrent lookup requests and modifications. As a result, we actually have to merge the existing snapshot file into the in-memory cache (regardless of which one is larger), and write the result into a new file.

During a "merging" snapshot, after writing out the records of a given type stored in the in-memory cache (as in regular snapshot operation), we then read through the records of the same type in the existing snapshot file, and look them up in the key-record map of the in-memory cache, skipping duplicates and writing the new unique records (with reassigned IDs) into the output file. To be able to update the sub-record IDs, we need to maintain the maps from the "old" IDs (used in the existing snapshot file) to the "new" IDs for each record type. These maps are temporary and only needed during the merge operation.

Since the in-memory cache can gain new records concurrently, we need to remember the numbers of records (i.e. maximum IDs) of each type at the start of the snapshot/merge, and treat record key matches that map to in-memory records created after the start of the snapshot (ones with IDs higher than the remembered maximum) as new unique records rather than duplicates. Also each lookup needs to be done while holding the corresponding lock that protects the map.

All of the above only applied to metadata records, and the main remaining question is what to do with the actual serialized methods in case of conflicts (duplicate keys). One simple policy could be to keep the more recent version, either always the in-memory one (which is more likely to be fresh), or based on the compilation timestamp that we can store in each serialized method.

@mpirvu
Copy link
Contributor

mpirvu commented Sep 28, 2022

Thanks for the idea @AlexeyKhrabrov. It's something we should definitely pursue after we get the simpler implementation ready.

@mpirvu
Copy link
Contributor

mpirvu commented Oct 5, 2022

Regarding loading the snapshot in the background while other compilations that might want to use it are progressing we have two options:

  1. Create a dedicated thread that handles all the snapshot file reads
  2. Reuse one of the compilation threads

I am leaning more to the second approach. I looked at how this can be implemented and sketched some high level design where the compilation thread that calls getOrCreateAOTCache() is going to queue a special high priority fake compilation request with a NULL _stream (or any other value that can be distinguished from a real _stream value). At the same time we add to a queue the cache name that we want to load. When a compilation thread picks this fake compilation request, it will extract from the queue the name of the AOT cache to load, it will perform the load and then go back to sleep waiting for another compilation request (if there are several entries in the queue, it will process them all). We need to avoid several threads trying to trigger such fake compilation requests, so we need to keep some state around.

@AlexeyKhrabrov
Copy link
Contributor

Regarding loading the snapshot in the background while other compilations that might want to use it are progressing we have two options:

1. Create a dedicated thread that handles all the snapshot file reads

2. Reuse one of the compilation threads

I am leaning more to the second approach. ...

Agreed. Since loading a snapshot can be I/O bound (if the snapshot file is not in the OS page cache), we don't want the single loader thread become a bottleneck if the JITServer needs to load multiple caches at the same time.

... . We need to avoid several threads trying to trigger such fake compilation requests, so we need to keep some state around.

Just to clarify, do you mean multiple requests for the same cache name from multiple compilation threads? To handle that, we can store the requests in a PersistentUnorderedSet keyed by the cache name, and link the requests structs into a queue to maintain the FIFO order.

@mpirvu
Copy link
Contributor

mpirvu commented Oct 5, 2022

do you mean multiple requests for the same cache name from multiple compilation threads?

Yes, that's why I said that we need to keep some state around.

@mpirvu
Copy link
Contributor

mpirvu commented Oct 13, 2022

Currently, if the client does not ask for an AOT cache by name, a nameless AOT cache will be created at the server.
I propose we use a default cache name (something like default_aotcache) rather than having a nameless cache.

@AlexeyKhrabrov
Copy link
Contributor

Currently, if the client does not ask for an AOT cache by name, a nameless AOT cache will be created at the server. I propose we use a default cache name (something like default_aotcache) rather than having a nameless cache.

Agreed, but I think simply default would be a better default name. This will also need a documentation update.

Since we include cache names into their snapshot file names, we should reject names that contain any characters invalid in file names (/ on Linux, more characters on other platforms like Windows if we ever support them). Another potential concern is case-insensitive file systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:jitserver Artifacts related to JIT-as-a-Service project
Projects
None yet
3 participants