Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-10246. Remove KeyValueHandler.checkContainerIsHealthy to improve read performance #6127

Merged
merged 2 commits into from
Feb 1, 2024

Conversation

whbing
Copy link
Contributor

@whbing whbing commented Jan 30, 2024

What changes were proposed in this pull request?

As described in HDDS-10146, all of the ChunkReader threads are blocked in the KeyValueHandler#checkContainerIsHealthy method, waiting for the KeyValueContainer#readLock.

Remove readLock from KeyValueHandler.checkContainerIsHealthy to improve performance

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10246

How was this patch tested?

  • No new test and pass exist unit tests
  • Works well in our cluster

@whbing
Copy link
Contributor Author

whbing commented Jan 30, 2024

@ChenSammi PTAL, Thanks !

} finally {
kvContainer.readUnlock();
// No kvContainer.readLock() for performance optimization
if (kvContainer.getContainerData().getState() == State.UNHEALTHY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you seeing performance problems when the container is unhealthy, along with a lot of log messages?

This lock is on a container, not "all containers at the same time", so do you also have a very high volume of requests for the same block?

At the very least, we should not log under a lock, as logging can be slow, but I would like to understand more about the problem before just removing the lock. On first look, it does not seem like it should be a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private ContainerBackgroundTaskResult handleDeleteTask() throws Exception {
ContainerBackgroundTaskResult crr;
final Container container = ozoneContainer.getContainerSet()
.getContainer(containerData.getContainerID());
container.writeLock();

BlockDeletingTask#handleDeleteTask will hold KeyValueContainer.writeLock(), and sometimes it takes a long time, as follow logs:

2024-01-04 16:30:16,752 [BlockDeletingService#8] INFO org.apache.hadoop.ozone.container.keyvalue.statemachine.background.BlockDeletingTask: Max lock hold time (1000 ms) reached after 1390 ms. Completed 1 transactions, deleted 0 blocks. In container: 5455053. Releasing lock and resuming deletion later.
2024-01-04 17:01:16,897 [BlockDeletingService#1] INFO org.apache.hadoop.ozone.container.keyvalue.statemachine.background.BlockDeletingTask: Max lock hold time (1000 ms) reached after 1292 ms. Completed 1 transactions, deleted 0 blocks. In container: 5458979. Releasing lock and resuming deletion later.

This will result in a long waiting when the data in this container is read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sodonnel , you can find the detail info under HDDS-10146. The scenario is when there are a lot of block deletions happening, read other blocks in the same container as being deleted block will be blocked due to unable to query container's read lock. The lock is hold by the thread doing the block deletion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think removing this lock will help much because:

isContainerHealthy() calls kvContainer.getContainerData().getState() which looks like:

  public synchronized ContainerDataProto.State getState() {
    return state;
  }

Which is actually worse than the read lock, as the synchronized method prevents even concurrent readers. Even after removing this read lock, you will need to deal with the synchronized method. There are quite a few methods on that class which are synchronized.

All this method is doing is logging a message if the container is unhealthy - do we need to do that? The method is called in only 4 places, always in this sort of pattern:

    try {
      BlockID blockID = BlockID.getFromProtobuf(
          request.getGetBlock().getBlockID());
      checkContainerIsHealthy(kvContainer, blockID, Type.GetBlock);
      responseData = blockManager.getBlock(kvContainer, blockID)
          .getProtoBufMessage();
      final long numBytes = responseData.getSerializedSize();
      metrics.incContainerBytesStats(Type.GetBlock, numBytes);

    } catch (StorageContainerException ex) {
      return ContainerUtils.logAndReturnError(LOG, ex, request);
    } catch (IOException ex) {
      return ContainerUtils.logAndReturnError(LOG,
          new StorageContainerException("Get Key failed", ex, IO_EXCEPTION),
          request);
    }

    return getBlockDataResponse(request, responseData);
  }

If the container is unhealthy, as we don't get an error on the read, do we really care if it is unhealthy? Perhaps checkContainerHealthy could be moved into the exception block to log only if there problems?

Another question is - does ContainerData.getState() even need to be synchronized? As I understand it, updating a reference in Java is atomic - and the state is just a reference which can be changed by updateState() - perhaps it is fine for getState() to not be synchronized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sodonnel Thanks for the advice.
Here is the jstack which readLock in WAITING state long time: dn.chunkreader-waiting.jstack.txt
Removing this read lock can solve the problem directly. From this point, it is effective.

All this method is doing is logging a message if the container is unhealthy - do we need to do that?

I did find some this unhealthy logs, but it did not affect reading. It seems that some optimization can indeed be done here.

Copy link
Contributor

@ChenSammi ChenSammi Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @sodonnel that we can actually remove the checkContainerIsHealthy call, as it just log one message, which is not much useful. @whbing , could you help to update the patch?

Regarding the synchronized functions in ContainerData, all of them are simple actions/checks against Container state. I feel the performance overhead is not that much. Need some data to judge that. Anyway, it's a new and another topic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized after I wrote the above, the the synchronization on the getter is probably not an issue. The problem root cause is the write lock being held for a long time by the delete thread. It would still be best to avoid the synchronization too if we can just remove the call entirely. The log message doesn't add much value compare to the cost of it via serialization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenSammi @sodonnel Updated commit with just removing checkContainerIsHealthy method and the call.

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @whbing for working on this patch.. Was having a quick look at the PR. And somewhat agree with @sodonnel org.apache.hadoop.ozone.container.common.impl.ContainerData#setState is synchronized and gets called from 2 callers inside ContainerData -> quasiCloseContainer and closeContainer and they are being called inside writeLock on container, so container state change is already inside writelock of container object. So agree that taking kvContainer.readLock inside checkContainerIsHealthy may be overwhelming as it is trying to wait for lock on whole container calling for getBlock or GetCommitedBlockLength calls, but replacing with kvContainer.getContainerData().getState() may also not be a right solution as this getState is fully synchronized method. We need to see if this method is really needed to be synchronized. I feel it is not need to be synchronized.

@whbing
Copy link
Contributor Author

whbing commented Jan 31, 2024

as this getState is fully synchronized method. We need to see if this method is really needed to be synchronized. I feel it is not need to be synchronized.

Not sure to what extent it affects performance. At least in practice, we haven't encountered any bottlenecks caused by synchronized getState yet. Maybe we can revisit it in the future if the problem occurs ?

Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ChenSammi ChenSammi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1.
Thanks @whbing for reporting the issue and also the fix, @sodonnel @devmadhuu for the review.

@ChenSammi ChenSammi changed the title HDDS-10246. Remove readLock from KeyValueHandler.checkContainerIsHealthy HDDS-10246. Remove KeyValueHandler.checkContainerIsHealthy to improve read performance Feb 1, 2024
@ChenSammi ChenSammi merged commit 528fc92 into apache:master Feb 1, 2024
36 checks passed
swamirishi pushed a commit to swamirishi/ozone that referenced this pull request Jun 10, 2024
…y to improve read performance (apache#6127)

(cherry picked from commit 528fc92)
Change-Id: Ie4438d8e95a1148e7050f16201b4c890ddbee400
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants