-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11284. refactor quota repair non-blocking while upgrade #7035
Conversation
try { | ||
for (int i = 0; i < quotaRepairRequest.getBucketCountCount(); ++i) { | ||
OzoneManagerProtocolProtos.BucketQuotaCount bucketCountInfo = quotaRepairRequest.getBucketCount(i); | ||
updateBucketInfo(omMetadataManager, bucketCountInfo, transactionLogIndex, bucketMap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need permission check. Suggest only allow admin to perform this operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is internal request, and hence do not need permission check. For exposed interface, its required. This is similar to other internal request like OMOpenKeysDeleteRequest, OMDirectoriesPurgeRequestWithFSO.
Once have CLI request, need have that validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OM exposes the API publicly. It cannot prohibit requests from external users, that's why permission check is required for every public APIs.
We also should have permission check for OMOpenKeysDeleteRequest and OMDirectoriesPurgeRequestWithFSO, but it's out side of this jira.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
arg.checkLeaderStatus(); | ||
QuotaRepairTask quotaRepairTask = new QuotaRepairTask(arg); | ||
quotaRepairTask.repair(); | ||
} catch (OMNotLeaderException | OMLeaderNotReadyException ex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OMLeaderNotReadyException means this OM is leader but not ready. We can have some wait and retry logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be required as available for upgrade to 7.1.9 as one time activity, present as was earlier there, need go with CLI tool instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to keep this. But please change the LOG wording, it's misleading when exception is OMLeaderNotReadyException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated logs.
BUCKET_LOCK, e.getVolumeName(), e.getBucketName())); | ||
repairCount(); | ||
// thread pool with 3 Table type * (1 task each + 3 thread each) | ||
executor = Executors.newFixedThreadPool(12); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the TASK_THREAD_CNT to form the expression instead of "12".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
.setClientId(clientId.toString()) | ||
.build(); | ||
OzoneManagerProtocolProtos.OMResponse response = submitRequest(omRequest, clientId); | ||
if (response != null && response.getSuccess()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
response.getSuccess() - > !response.getSuccess()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
// update volume to support quota | ||
builder.setSupportVolumeOldQuota(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's always true, then suggest to remove this flag in the proto definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a plan to trigger at bucket level, in that case, this is required. In this case, may not have usecase, but planning to have control with CLI option.
OzoneManagerProtocolProtos.OMResponse response = submitRequest(omRequest, clientId); | ||
if (response != null && response.getSuccess()) { | ||
LOG.error("update quota repair count response failed"); | ||
REPAIR_STATUS.updateStatus("Response for update DB is failed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return false here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
4d9df84
to
2c749fe
Compare
* master: (50 commits) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) HDDS-11340. Avoid extra PubBlock call when a full block is closed (apache#7094) HDDS-11155. Improve Volumes page UI (apache#7048) HDDS-11324. Negative value preOpLatencyMs in DN audit log (apache#7093) HDDS-11246. [Recon] Use optional chaining instead of explicit undefined check for Objects in Container and Pipeline Module. (apache#7037) HDDS-11323. Mark TestLeaseRecovery as flaky HDDS-11338. Bump zstd-jni to 1.5.6-4 (apache#7085) HDDS-11337. Bump Spring Framework to 5.3.39 (apache#7084) HDDS-11327. [hsync] Revert config default ozone.fs.hsync.enabled to false (apache#7079) HDDS-11325. Mark testWriteMoreThanMaxFlushSize as flaky HDDS-11336. Bump slf4j to 2.0.16 (apache#7086) HDDS-11335. Bump exec-maven-plugin to 3.4.1 (apache#7087) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
* master: (50 commits) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) HDDS-11340. Avoid extra PubBlock call when a full block is closed (apache#7094) HDDS-11155. Improve Volumes page UI (apache#7048) HDDS-11324. Negative value preOpLatencyMs in DN audit log (apache#7093) HDDS-11246. [Recon] Use optional chaining instead of explicit undefined check for Objects in Container and Pipeline Module. (apache#7037) HDDS-11323. Mark TestLeaseRecovery as flaky HDDS-11338. Bump zstd-jni to 1.5.6-4 (apache#7085) HDDS-11337. Bump Spring Framework to 5.3.39 (apache#7084) HDDS-11327. [hsync] Revert config default ozone.fs.hsync.enabled to false (apache#7079) HDDS-11325. Mark testWriteMoreThanMaxFlushSize as flaky HDDS-11336. Bump slf4j to 2.0.16 (apache#7086) HDDS-11335. Bump exec-maven-plugin to 3.4.1 (apache#7087) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
…an-on-error * HDDS-10239-container-reconciliation: (428 commits) HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091) HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089) Mark TestPipelineManagerMXBean#testPipelineInfo as flaky Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081) HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075) HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047) HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainerCheck.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
…rrupt-files * HDDS-10239-container-reconciliation: (61 commits) HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091) HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089) Mark TestPipelineManagerMXBean#testPipelineInfo as flaky Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081) HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075) HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047) HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
What changes were proposed in this pull request?
Changes:
The target for this refactor is to
integrate with tool CLI
option to trigger again and verify for correctness. This is tracked withParent ID:
HDDS-8824What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11284
How was this patch tested?