Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11284. refactor quota repair non-blocking while upgrade #7035

Merged
merged 7 commits into from
Aug 21, 2024

Conversation

sumitagrawl
Copy link
Contributor

What changes were proposed in this pull request?

Changes:

  1. refactor quota repair for running non-blocking
  • DB checkpoint is created before starting quota usages calculation
  • Diff quota usages is identified from checkpoint with original and new quota
  • For diff, quota update is send via ratis to correct quota usages
  • Status is added to be integrated for last run with CLI
  1. while upgrade case, change handling as non-blocking
  • quota repair trigger on FiNALIZE trigger for upgrade (as ratis dependency, its initialized during this phase only)
  • Just trigger, but actual run will happen async

The target for this refactor is to integrate with tool CLI option to trigger again and verify for correctness. This is tracked with Parent ID: HDDS-8824

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11284

How was this patch tested?

  • Unit test is updated to cover scenario

@sumitagrawl sumitagrawl marked this pull request as ready for review August 6, 2024 13:54
@sumitagrawl sumitagrawl requested a review from swamirishi August 7, 2024 10:10
try {
for (int i = 0; i < quotaRepairRequest.getBucketCountCount(); ++i) {
OzoneManagerProtocolProtos.BucketQuotaCount bucketCountInfo = quotaRepairRequest.getBucketCount(i);
updateBucketInfo(omMetadataManager, bucketCountInfo, transactionLogIndex, bucketMap);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need permission check. Suggest only allow admin to perform this operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is internal request, and hence do not need permission check. For exposed interface, its required. This is similar to other internal request like OMOpenKeysDeleteRequest, OMDirectoriesPurgeRequestWithFSO.

Once have CLI request, need have that validation.

Copy link
Contributor

@ChenSammi ChenSammi Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OM exposes the API publicly. It cannot prohibit requests from external users, that's why permission check is required for every public APIs.

We also should have permission check for OMOpenKeysDeleteRequest and OMDirectoriesPurgeRequestWithFSO, but it's out side of this jira.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

arg.checkLeaderStatus();
QuotaRepairTask quotaRepairTask = new QuotaRepairTask(arg);
quotaRepairTask.repair();
} catch (OMNotLeaderException | OMLeaderNotReadyException ex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMLeaderNotReadyException means this OM is leader but not ready. We can have some wait and retry logic here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be required as available for upgrade to 7.1.9 as one time activity, present as was earlier there, need go with CLI tool instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to keep this. But please change the LOG wording, it's misleading when exception is OMLeaderNotReadyException.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated logs.

BUCKET_LOCK, e.getVolumeName(), e.getBucketName()));
repairCount();
// thread pool with 3 Table type * (1 task each + 3 thread each)
executor = Executors.newFixedThreadPool(12);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the TASK_THREAD_CNT to form the expression instead of "12".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.setClientId(clientId.toString())
.build();
OzoneManagerProtocolProtos.OMResponse response = submitRequest(omRequest, clientId);
if (response != null && response.getSuccess()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response.getSuccess() - > !response.getSuccess()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

// update volume to support quota
builder.setSupportVolumeOldQuota(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's always true, then suggest to remove this flag in the proto definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a plan to trigger at bucket level, in that case, this is required. In this case, may not have usecase, but planning to have control with CLI option.

OzoneManagerProtocolProtos.OMResponse response = submitRequest(omRequest, clientId);
if (response != null && response.getSuccess()) {
LOG.error("update quota repair count response failed");
REPAIR_STATUS.updateStatus("Response for update DB is failed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return false here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@sumitagrawl sumitagrawl requested a review from ChenSammi August 14, 2024 11:05
@sumitagrawl sumitagrawl merged commit 86ccc48 into apache:master Aug 21, 2024
39 checks passed
errose28 added a commit to errose28/ozone that referenced this pull request Aug 21, 2024
* master: (50 commits)
  HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)
  HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102)
  HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103)
  HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974)
  HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035)
  HDDS-9790. Add tests for Overview page (apache#6983)
  HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074)
  HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098)
  HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099)
  HDDS-11340. Avoid extra PubBlock call when a full block is closed (apache#7094)
  HDDS-11155. Improve Volumes page UI (apache#7048)
  HDDS-11324. Negative value preOpLatencyMs in DN audit log (apache#7093)
  HDDS-11246. [Recon] Use optional chaining instead of explicit undefined check for Objects in Container and Pipeline Module. (apache#7037)
  HDDS-11323. Mark TestLeaseRecovery as flaky
  HDDS-11338. Bump zstd-jni to 1.5.6-4 (apache#7085)
  HDDS-11337. Bump Spring Framework to 5.3.39 (apache#7084)
  HDDS-11327. [hsync] Revert config default ozone.fs.hsync.enabled to false (apache#7079)
  HDDS-11325. Mark testWriteMoreThanMaxFlushSize as flaky
  HDDS-11336. Bump slf4j to 2.0.16 (apache#7086)
  HDDS-11335. Bump exec-maven-plugin to 3.4.1 (apache#7087)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
errose28 added a commit to errose28/ozone that referenced this pull request Aug 21, 2024
* master: (50 commits)
  HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)
  HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102)
  HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103)
  HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974)
  HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035)
  HDDS-9790. Add tests for Overview page (apache#6983)
  HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074)
  HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098)
  HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099)
  HDDS-11340. Avoid extra PubBlock call when a full block is closed (apache#7094)
  HDDS-11155. Improve Volumes page UI (apache#7048)
  HDDS-11324. Negative value preOpLatencyMs in DN audit log (apache#7093)
  HDDS-11246. [Recon] Use optional chaining instead of explicit undefined check for Objects in Container and Pipeline Module. (apache#7037)
  HDDS-11323. Mark TestLeaseRecovery as flaky
  HDDS-11338. Bump zstd-jni to 1.5.6-4 (apache#7085)
  HDDS-11337. Bump Spring Framework to 5.3.39 (apache#7084)
  HDDS-11327. [hsync] Revert config default ozone.fs.hsync.enabled to false (apache#7079)
  HDDS-11325. Mark testWriteMoreThanMaxFlushSize as flaky
  HDDS-11336. Bump slf4j to 2.0.16 (apache#7086)
  HDDS-11335. Bump exec-maven-plugin to 3.4.1 (apache#7087)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
errose28 added a commit to errose28/ozone that referenced this pull request Aug 26, 2024
…an-on-error

* HDDS-10239-container-reconciliation: (428 commits)
  HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091)
  HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089)
  Mark TestPipelineManagerMXBean#testPipelineInfo as flaky
  Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky
  HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky
  HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky
  HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky
  HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081)
  HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075)
  HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047)
  HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076)
  HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)
  HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102)
  HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103)
  HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974)
  HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035)
  HDDS-9790. Add tests for Overview page (apache#6983)
  HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074)
  HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098)
  HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainerCheck.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
errose28 added a commit to errose28/ozone that referenced this pull request Aug 28, 2024
…rrupt-files

* HDDS-10239-container-reconciliation: (61 commits)
  HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091)
  HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089)
  Mark TestPipelineManagerMXBean#testPipelineInfo as flaky
  Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky
  HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky
  HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky
  HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky
  HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081)
  HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075)
  HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047)
  HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076)
  HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)
  HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102)
  HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103)
  HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974)
  HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035)
  HDDS-9790. Add tests for Overview page (apache#6983)
  HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074)
  HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098)
  HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants