-
Notifications
You must be signed in to change notification settings - Fork 519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-4427. Avoid ContainerCache in ContainerReader at Datanode startup #1549
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 thanks the @sodonnel the patch and the clear description
Based on my understanding it's a safe (and useful ;-) ) modification, the cache is turned off only during the startup.
((Wouldn't like to commit it yet, let's give opportunity for other timezones to comment it))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sodonnel for the optimization. This looks good to me pending CI.
This change is proving slightly tricky in the tests. I have fixed the unit test problems, but the integration tests are more difficult to fix. In the problems I have looked at, the issue is driven by ContainerCache being a singleton / static. This means the same cache is shared by all DNs in a mini-cluster and when a DN is restarted in the mini-cluster, the RocksDB instances are still open in the cache. When the DN is starting up, it fails to open them "uncached" as they are already open in the cache. The solution is either:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added a minor comment and question inline after the test changes.
.../test/java/org/apache/hadoop/ozone/container/common/TestSchemaOneBackwardsCompatibility.java
Outdated
Show resolved
Hide resolved
...ontainer-service/src/main/java/org/apache/hadoop/ozone/container/metadata/DatanodeStore.java
Outdated
Show resolved
Hide resolved
The fact that ContainerCache is static gives some difficult / impossible problems to resolve in any tests which use the mini-Cluster. With a mini-cluster, as there is only 1 ContainerCache, it is shared by all DN instances. If you restart a DN instance, it will reuse the same cached RocksDB instance it put into the cache earlier. In the earlier version of this patch, this caused the DN to fail to startup correctly, as it found open RocksDBs where it did not expect. To fix this, I changed the code slightly. Now it will try to get a new "uncached RocksDB instance" at startup time. If getting that instance throws an exception, then I try to get an instance from the cache. The second scenario should only be seen in the tests. At startup time, the DN ContainerCache must be empty and hence the RocksDBs should never fail to open due to it already being open. @elek Could you have another look at the changes please? |
…e uncached instance fails to open
@elek Just checking again you are happy for me to merge this change? There were some changes since you approved it earlier, but the overall approach remains the same. Thanks. |
Thanks the update @sodonnel. I am not sure if I fully understand the new patch, but you will see from my comments:
Can we use uncached db in both cases? In that case It would be possible to always created uncached data. If not, we can add |
@elek thanks for taking a look. Are you concerned about the exception handler:
Or that we may have an uncached instance DatanodeStore or a cached Store, which comes out as a ReferenceCountedDB? The code messiness is all driven from the static container cache.
Yes, the intention is to always use uncachedDB in both these cases, and provided the Datanode is started from a cold JVM or the container has never been loaded yet (import case) all will be fine. The original version did this and it was much cleaner, but lots of tests failed due to the static containerCache. If the RocksDB is already open in the cache, you cannot open another handle to it or it will throw an exception. So in the test case with MiniClusters, the code will try to get an uncachedDB, but it may fail due to being open already. With the current logic, when it fails it will get the instance from the cache. This makes the cleanup in the finally block messy, as we have to handle both cases. I did have another attempt at this change, where I try to always return a ReferenceCountedDB from getUncached and handle the exception within that method - however at the end you need to decrement the reference and close the DB, but only if it did not come from the cache. If the handle came from the cache, you should just decrement and not close. At that point, you don't know which is which and cannot make a decision.
That probably won't make it any cleaner. We always want to pass an uncachedDB, but in the test case, we will need to sometimes fall back to the cache. Then we need to increment and decrement the reference count and then maybe call close. It just pushes the same messy logic elsewhere. Another thing I thought of, was to check if the RocksDB is already cached. However that involves taking a read lock on the ContainerCache. It also does not have an existing method in the interface to make this check, and calling get(...) returns or creates a new entry in the cache. All this logic would be needed only for the tests and would introduce a small contention on the lock, and avoiding that is the point of this change. In other words, we would be checking for existance at runtime and never find anything, unless we are in a mini-cluster. |
@elek Any further thoughts here? I feel it is hard to make this change much cleaner due to the static ContainerCache. |
Thanks the detailed explanation @sodonnel, I appreciate it.
It was the key sentence to understand the problem. Now I got it. If there is another open db instance somewhere in the system (=ContainerCache) we have no possible way to open it again. Now I hate the static property of ContainerCache even better. And I have no better idea for here. Just hope that we will have time to fix ContainerCache in general ;-) I have some bad feeling that this code also changes the code path of the container import during closed container replication. But I tried to think hard and couldn't find any edge cases where it can really fail. As the container is not visible until added to the So let me repeat my +1 and merge it. |
* HDDS-3698-upgrade: (46 commits) HDDS-4468. Fix Goofys listBucket large than 1000 objects will stuck forever (apache#1595) HDDS-4417. Simplify Ozone client code with configuration object -- addendum (apache#1581) HDDS-4476. Improve the ZH translation of the HA.md in doc. (apache#1597) HDDS-4432. Update Ratis version to latest snapshot. (apache#1586) HDDS-4488. Open RocksDB read only when loading containers at Datanode startup (apache#1605) HDDS-4478. Large deletedKeyset slows down OM via listStatus. (apache#1598) HDDS-4452. findbugs.sh couldn't be executed after a full build (apache#1576) HDDS-4427. Avoid ContainerCache in ContainerReader at Datanode startup (apache#1549) HDDS-4448. Duplicate refreshPipeline in listStatus (apache#1569) HDDS-4450. Cannot run ozone if HADOOP_HOME points to Hadoop install (apache#1572) HDDS-4346.Ozone specific Trash Policy (apache#1535) HDDS-4426. SCM should create transactions using all blocks received from OM (apache#1561) HDDS-4399. Safe mode rule for piplelines should only consider open pipelines. (apache#1526) HDDS-4367. Configuration for deletion service intervals should be different for OM, SCM and datanodes (apache#1573) HDDS-4462. Add --frozen-lockfile to pnpm install to prevent ozone-recon-web/pnpm-lock.yaml from being updated automatically (apache#1589) HDDS-4082. Create ZH translation of HA.md in doc. (apache#1591) HDDS-4464. Upgrade httpclient version due to CVE-2020-13956. (apache#1590) HDDS-4467. Acceptance test fails due to new Hadoop 3 image (apache#1594) HDDS-4466. Update url in .asf.yaml to use TLP project (apache#1592) HDDS-4458. Fix Max Transaction ID value in OM. (apache#1585) ...
* HDDS-3698-upgrade: (47 commits) HDDS-4468. Fix Goofys listBucket large than 1000 objects will stuck forever (apache#1595) HDDS-4417. Simplify Ozone client code with configuration object -- addendum (apache#1581) HDDS-4476. Improve the ZH translation of the HA.md in doc. (apache#1597) HDDS-4432. Update Ratis version to latest snapshot. (apache#1586) HDDS-4488. Open RocksDB read only when loading containers at Datanode startup (apache#1605) HDDS-4478. Large deletedKeyset slows down OM via listStatus. (apache#1598) HDDS-4452. findbugs.sh couldn't be executed after a full build (apache#1576) HDDS-4427. Avoid ContainerCache in ContainerReader at Datanode startup (apache#1549) HDDS-4448. Duplicate refreshPipeline in listStatus (apache#1569) HDDS-4450. Cannot run ozone if HADOOP_HOME points to Hadoop install (apache#1572) HDDS-4346.Ozone specific Trash Policy (apache#1535) HDDS-4426. SCM should create transactions using all blocks received from OM (apache#1561) HDDS-4399. Safe mode rule for piplelines should only consider open pipelines. (apache#1526) HDDS-4367. Configuration for deletion service intervals should be different for OM, SCM and datanodes (apache#1573) HDDS-4462. Add --frozen-lockfile to pnpm install to prevent ozone-recon-web/pnpm-lock.yaml from being updated automatically (apache#1589) HDDS-4082. Create ZH translation of HA.md in doc. (apache#1591) HDDS-4464. Upgrade httpclient version due to CVE-2020-13956. (apache#1590) HDDS-4467. Acceptance test fails due to new Hadoop 3 image (apache#1594) HDDS-4466. Update url in .asf.yaml to use TLP project (apache#1592) HDDS-4458. Fix Max Transaction ID value in OM. (apache#1585) ...
What changes were proposed in this pull request?
Testing on a dense datanode (200k containers, 45 disks) I see contention around the ContainerCache. Most of the time most threads are running in parallel, but we see slowdowns where most threads get blocked waiting on the ContainerCache lock.
Examining JStacks, we can see the runnable thread blocking others is typically evicting a RocksDB instance from the cache:
The slowness seems to be driven by the RocksDB close call. It is generally fast, but is often around 1ms. Eg, here are some timings from that call after adding instrumentation to the code:
The timer was only at ms precision, so that is why many are zero. Even at 1ms per close, we can only close 1000 per second and this point of the code is serialized.
At startup time, there is no value in caching the open containers. All containers on the node need to be read in parallel, therefore we should simply open and close each container without caching the instance.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-4427
How was this patch tested?
Existing tests and small changes to TestContainerReader.java.