Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-10937. Ozone Recon - Handle startup failure and log reasons as error due to SCM non-HA scenario #6752

Merged
merged 6 commits into from
Jun 4, 2024

Conversation

devmadhuu
Copy link
Contributor

What changes were proposed in this pull request?

This PR handles Recon startup failure due to unexpected runtime error and shuts down silently without logging the actual cause of error.

Ozone Recon - Handle startup failure and log reasons as error due to SCM non-HA scenario

While ReconServer.start() method is called, if any runtime errors being thrown, those are not being logged and Recon startup fails silently without logging any reason of failure. E.g. In case of non-HA SCM case, few configurations made Recon fails to start. This scenario or any runtime exception needs to be handled for proper logging.
In non-HA SCM case, while trying to fetch SCM DB snapshot, SCM peer roles are not being returned correctly, so the logic needs to handle.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10937

How was this patch tested?

Tested manually and added additional integration tests in Recon for non HA SCM case.

@devmadhuu
Copy link
Contributor Author

@nandakumar131 @sumitagrawl kindly review.

@devmadhuu devmadhuu marked this pull request as draft May 30, 2024 14:55
@devmadhuu devmadhuu marked this pull request as ready for review May 31, 2024 06:21
Copy link
Contributor

@nandakumar131 nandakumar131 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devmadhuu for working on this. Overall the fix looks good.
Added some minor review comments.


@Test
public void testScmNonHASnapshot() throws Exception {
//ozoneCluster.getReconServer().getStorageContainerServiceProvider().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: This line can be removed.

// Start all services
start();
isStarted = true;
LOG.debug("Start of all services of Recon completed successfully !!!");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Do we need this log statement?

Comment on lines 189 to 192
if (!SCMHAUtils.isSCMHAEnabled(configuration)) {
SecurityUtil.doAsLoginUser(() -> {
try (InputStream inputStream = reconUtils.makeHttpCall(
connectionFactory, getScmDBSnapshotUrl(),
isOmSpnegoEnabled()).getInputStream()) {
FileUtils.copyInputStreamToFile(inputStream, targetFile);
}
return null;
});
fetchSCMDBSnapshotUsingHttpClient(targetFile);
LOG.info("Downloaded SCM Snapshot from SCM");
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get rid of the isSCMHAEnabled check and completely rely on the output of getScmInfo output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nandakumar131
Recon is derived from SCM, So many code functionality is implicitly available in recon from scm code. We may need fix Recon before startup to know if SCM is in HA or not for others also.

Also, this is used in freon , datanode, and cli tool via HAUtils for getting CA certificates. This can impact them as per code.

This needs different solution, getting rid of this from all places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumitagrawl I din't mean to remove isSCMHAEnabled the method completely.
We don't need to call the isSCMHAEnabled check here in this case.

Comment on lines 200 to 201
if (role.length <= 2) {
fetchSCMDBSnapshotUsingHttpClient(targetFile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of role.length check, we can do ratisRoles.size and enter the for loop only if the size is greater than 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When ratis is enabled for single node SCM, it will have roles and same can be used. Only problem is old SCM where ratis is not enabled (identified dynamically overridding this property).
We need consider the impact for Recon all places as its using same SCM code, but this property is not overloaded. We may face the problem whenever the funtionality hits. Need find better solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, created a separate JIRA HDDS-10957 to track.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumitagrawl @devmadhuu
The goal is to have Ratis enabled for all the cases whether it's HA or Non-HA.

Going forward, we should deprecate and remove the property ozone.scm.ratis.enable and use Ratis by default.
The reason for introducing the property in first place is to support upgrading of old clusters and be backward compatible.

We will have this property until we support the old Ozone releases, we can deprecate the property after that.
Even now, we cannot disable the property once it's enabled.

Copy link
Contributor

@nandakumar131 nandakumar131 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@nandakumar131 nandakumar131 merged commit f3a0dbd into apache:master Jun 4, 2024
39 checks passed
@nandakumar131
Copy link
Contributor

Thanks @devmadhuu for the fix and thanks to @sumitagrawl for the review!

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jun 15, 2024
…rror due to SCM non-HA scenario (apache#6752)

(cherry picked from commit f3a0dbd)
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants