Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-4427. Avoid ContainerCache in ContainerReader at Datanode startup #1549

Merged
merged 6 commits into from
Nov 19, 2020

Conversation

sodonnel
Copy link
Contributor

@sodonnel sodonnel commented Nov 3, 2020

What changes were proposed in this pull request?

Testing on a dense datanode (200k containers, 45 disks) I see contention around the ContainerCache. Most of the time most threads are running in parallel, but we see slowdowns where most threads get blocked waiting on the ContainerCache lock.

Examining JStacks, we can see the runnable thread blocking others is typically evicting a RocksDB instance from the cache:

"Thread-37" #131 prio=5 os_prio=0 tid=0x00007f8f49219800 nid=0x1c5e9 runnable [0x00007f86f7e78000]
   java.lang.Thread.State: RUNNABLE
        at org.rocksdb.RocksDB.closeDatabase(Native Method)
        at org.rocksdb.RocksDB.close(RocksDB.java:468)
        at org.apache.hadoop.hdds.utils.RocksDBStore.close(RocksDBStore.java:389)
        at org.apache.hadoop.ozone.container.common.utils.ReferenceCountedDB.cleanup(ReferenceCountedDB.java:79)
        at org.apache.hadoop.ozone.container.common.utils.ContainerCache.removeLRU(ContainerCache.java:106)
        at org.apache.commons.collections.map.LRUMap.addMapping(LRUMap.java:242)
        at org.apache.commons.collections.map.AbstractHashedMap.put(AbstractHashedMap.java:284)
        at org.apache.hadoop.ozone.container.common.utils.ContainerCache.getDB(ContainerCache.java:167)
        at org.apache.hadoop.ozone.container.keyvalue.helpers.BlockUtils.getDB(BlockUtils.java:63)
        at org.apache.hadoop.ozone.container.keyvalue.helpers.KeyValueContainerUtil.parseKVContainerData(KeyValueContainerUtil.java:165)
        at org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.verifyAndFixupContainerData(ContainerReader.java:183)
        at org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.verifyContainerFile(ContainerReader.java:160)
        at org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.readVolume(ContainerReader.java:137)
        at org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.run(ContainerReader.java:91)
        at java.lang.Thread.run(Thread.java:748)

The slowness seems to be driven by the RocksDB close call. It is generally fast, but is often around 1ms. Eg, here are some timings from that call after adding instrumentation to the code:

grep -a "metric: closing DB took" ozone-datanode.log | cut -d ":" -f 6 | sort -n | uniq -c
61940 0
128155 1
2786 2
236 3
53 4
42 5
17 6
10 7
8 8
15 9

The timer was only at ms precision, so that is why many are zero. Even at 1ms per close, we can only close 1000 per second and this point of the code is serialized.

At startup time, there is no value in caching the open containers. All containers on the node need to be read in parallel, therefore we should simply open and close each container without caching the instance.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4427

How was this patch tested?

Existing tests and small changes to TestContainerReader.java.

Copy link
Member

@elek elek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 thanks the @sodonnel the patch and the clear description

Based on my understanding it's a safe (and useful ;-) ) modification, the cache is turned off only during the startup.

((Wouldn't like to commit it yet, let's give opportunity for other timezones to comment it))

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sodonnel for the optimization. This looks good to me pending CI.

@sodonnel
Copy link
Contributor Author

sodonnel commented Nov 4, 2020

This change is proving slightly tricky in the tests.

I have fixed the unit test problems, but the integration tests are more difficult to fix. In the problems I have looked at, the issue is driven by ContainerCache being a singleton / static. This means the same cache is shared by all DNs in a mini-cluster and when a DN is restarted in the mini-cluster, the RocksDB instances are still open in the cache. When the DN is starting up, it fails to open them "uncached" as they are already open in the cache.

The solution is either:

  1. Make ContainerCache not be static and create a new instance somewhere in the DN which is reused.

  2. On DN shutdown / startup, the cache should be purged, but as the cache is shared by all DNs in the JVM, that impacts the other running DNs.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a minor comment and question inline after the test changes.

@sodonnel
Copy link
Contributor Author

sodonnel commented Nov 9, 2020

The fact that ContainerCache is static gives some difficult / impossible problems to resolve in any tests which use the mini-Cluster. With a mini-cluster, as there is only 1 ContainerCache, it is shared by all DN instances. If you restart a DN instance, it will reuse the same cached RocksDB instance it put into the cache earlier. In the earlier version of this patch, this caused the DN to fail to startup correctly, as it found open RocksDBs where it did not expect.

To fix this, I changed the code slightly. Now it will try to get a new "uncached RocksDB instance" at startup time. If getting that instance throws an exception, then I try to get an instance from the cache. The second scenario should only be seen in the tests. At startup time, the DN ContainerCache must be empty and hence the RocksDBs should never fail to open due to it already being open.

@elek Could you have another look at the changes please?

@sodonnel
Copy link
Contributor Author

@elek Just checking again you are happy for me to merge this change? There were some changes since you approved it earlier, but the overall approach remains the same. Thanks.

@elek
Copy link
Member

elek commented Nov 16, 2020

Thanks the update @sodonnel. I am not sure if I fully understand the new patch, but you will see from my comments:

  1. I agree: static is evil. Especially together with MiniOzoneCluster
  2. I have some strange feeling about the exception throwing part in KeyValueContainerUtil.parseKVContainerData. It doesn't look very nice and would like to understand if it's really neded.

KeyValueContainerUtil.parseKVContainerData are called only from two places: 1. the initial import after boot, 2. the import after closed replication.

Can we use uncached db in both cases? In that case It would be possible to always created uncached data.

If not, we can add DatanodeStore as a parameter of KeyValueContainerUtil.parseKVContainerData and you can always decide which store is used (cached / uncached)

@sodonnel
Copy link
Contributor Author

@elek thanks for taking a look.

Are you concerned about the exception handler:

} finally {
      if (cachedDB != null) {
        // If we get a cached instance, calling close simply decrements the
        // reference count.
        cachedDB.close();
      } else if (store != null) {
        // We only stop the store if cacheDB is null, as otherwise we would
        // close the rocksDB handle in the cache and the next reader would fail
        try {
          store.stop();
        } catch (IOException e) {
          throw e;
        } catch (Exception e) {
          throw new RuntimeException("Unexpected exception closing the " +
              "RocksDB when loading containers", e);
        }
      }
}

Or that we may have an uncached instance DatanodeStore or a cached Store, which comes out as a ReferenceCountedDB?

The code messiness is all driven from the static container cache.

Can we use uncached db in both cases? In that case It would be possible to always created uncached data.

Yes, the intention is to always use uncachedDB in both these cases, and provided the Datanode is started from a cold JVM or the container has never been loaded yet (import case) all will be fine. The original version did this and it was much cleaner, but lots of tests failed due to the static containerCache. If the RocksDB is already open in the cache, you cannot open another handle to it or it will throw an exception. So in the test case with MiniClusters, the code will try to get an uncachedDB, but it may fail due to being open already. With the current logic, when it fails it will get the instance from the cache. This makes the cleanup in the finally block messy, as we have to handle both cases.

I did have another attempt at this change, where I try to always return a ReferenceCountedDB from getUncached and handle the exception within that method - however at the end you need to decrement the reference and close the DB, but only if it did not come from the cache. If the handle came from the cache, you should just decrement and not close. At that point, you don't know which is which and cannot make a decision.

If not, we can add DatanodeStore as a parameter of KeyValueContainerUtil.parseKVContainerData and you can always decide which store is used (cached / uncached)

That probably won't make it any cleaner. We always want to pass an uncachedDB, but in the test case, we will need to sometimes fall back to the cache. Then we need to increment and decrement the reference count and then maybe call close. It just pushes the same messy logic elsewhere.

Another thing I thought of, was to check if the RocksDB is already cached. However that involves taking a read lock on the ContainerCache. It also does not have an existing method in the interface to make this check, and calling get(...) returns or creates a new entry in the cache. All this logic would be needed only for the tests and would introduce a small contention on the lock, and avoiding that is the point of this change. In other words, we would be checking for existance at runtime and never find anything, unless we are in a mini-cluster.

@sodonnel
Copy link
Contributor Author

@elek Any further thoughts here? I feel it is hard to make this change much cleaner due to the static ContainerCache.

@elek
Copy link
Member

elek commented Nov 19, 2020

Thanks the detailed explanation @sodonnel, I appreciate it.

If the RocksDB is already open in the cache, you cannot open another handle to it or it will throw an exception.

It was the key sentence to understand the problem. Now I got it. If there is another open db instance somewhere in the system (=ContainerCache) we have no possible way to open it again.

Now I hate the static property of ContainerCache even better. And I have no better idea for here. Just hope that we will have time to fix ContainerCache in general ;-)

I have some bad feeling that this code also changes the code path of the container import during closed container replication. But I tried to think hard and couldn't find any edge cases where it can really fail. As the container is not visible until added to the containerSet AFTER the import, it should work.

So let me repeat my +1 and merge it.

@elek elek merged commit 58ec7f4 into apache:master Nov 19, 2020
errose28 added a commit to errose28/ozone that referenced this pull request Nov 24, 2020
* HDDS-3698-upgrade: (46 commits)
  HDDS-4468. Fix Goofys listBucket large than 1000 objects will stuck forever (apache#1595)
  HDDS-4417. Simplify Ozone client code with configuration object -- addendum (apache#1581)
  HDDS-4476. Improve the ZH translation of the HA.md in doc. (apache#1597)
  HDDS-4432. Update Ratis version to latest snapshot. (apache#1586)
  HDDS-4488. Open RocksDB read only when loading containers at Datanode startup (apache#1605)
  HDDS-4478. Large deletedKeyset slows down OM via listStatus. (apache#1598)
  HDDS-4452. findbugs.sh couldn't be executed after a full build (apache#1576)
  HDDS-4427. Avoid ContainerCache in ContainerReader at Datanode startup (apache#1549)
  HDDS-4448. Duplicate refreshPipeline in listStatus (apache#1569)
  HDDS-4450. Cannot run ozone if HADOOP_HOME points to Hadoop install (apache#1572)
  HDDS-4346.Ozone specific Trash Policy (apache#1535)
  HDDS-4426. SCM should create transactions using all blocks received from OM (apache#1561)
  HDDS-4399. Safe mode rule for piplelines should only consider open pipelines. (apache#1526)
  HDDS-4367. Configuration for deletion service intervals should be different for OM, SCM and datanodes (apache#1573)
  HDDS-4462. Add --frozen-lockfile to pnpm install to prevent ozone-recon-web/pnpm-lock.yaml from being updated automatically (apache#1589)
  HDDS-4082. Create ZH translation of HA.md in doc. (apache#1591)
  HDDS-4464. Upgrade httpclient version due to CVE-2020-13956. (apache#1590)
  HDDS-4467. Acceptance test fails due to new Hadoop 3 image (apache#1594)
  HDDS-4466. Update url in .asf.yaml to use TLP project (apache#1592)
  HDDS-4458. Fix Max Transaction ID value in OM. (apache#1585)
  ...
errose28 added a commit to errose28/ozone that referenced this pull request Nov 25, 2020
* HDDS-3698-upgrade: (47 commits)
  HDDS-4468. Fix Goofys listBucket large than 1000 objects will stuck forever (apache#1595)
  HDDS-4417. Simplify Ozone client code with configuration object -- addendum (apache#1581)
  HDDS-4476. Improve the ZH translation of the HA.md in doc. (apache#1597)
  HDDS-4432. Update Ratis version to latest snapshot. (apache#1586)
  HDDS-4488. Open RocksDB read only when loading containers at Datanode startup (apache#1605)
  HDDS-4478. Large deletedKeyset slows down OM via listStatus. (apache#1598)
  HDDS-4452. findbugs.sh couldn't be executed after a full build (apache#1576)
  HDDS-4427. Avoid ContainerCache in ContainerReader at Datanode startup (apache#1549)
  HDDS-4448. Duplicate refreshPipeline in listStatus (apache#1569)
  HDDS-4450. Cannot run ozone if HADOOP_HOME points to Hadoop install (apache#1572)
  HDDS-4346.Ozone specific Trash Policy (apache#1535)
  HDDS-4426. SCM should create transactions using all blocks received from OM (apache#1561)
  HDDS-4399. Safe mode rule for piplelines should only consider open pipelines. (apache#1526)
  HDDS-4367. Configuration for deletion service intervals should be different for OM, SCM and datanodes (apache#1573)
  HDDS-4462. Add --frozen-lockfile to pnpm install to prevent ozone-recon-web/pnpm-lock.yaml from being updated automatically (apache#1589)
  HDDS-4082. Create ZH translation of HA.md in doc. (apache#1591)
  HDDS-4464. Upgrade httpclient version due to CVE-2020-13956. (apache#1590)
  HDDS-4467. Acceptance test fails due to new Hadoop 3 image (apache#1594)
  HDDS-4466. Update url in .asf.yaml to use TLP project (apache#1592)
  HDDS-4458. Fix Max Transaction ID value in OM. (apache#1585)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants