-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-10937. Ozone Recon - Handle startup failure and log reasons as error due to SCM non-HA scenario #6752
Conversation
…ror due to SCM non-HA scenario.
@nandakumar131 @sumitagrawl kindly review. |
…ror due to SCM non-HA scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @devmadhuu for working on this. Overall the fix looks good.
Added some minor review comments.
|
||
@Test | ||
public void testScmNonHASnapshot() throws Exception { | ||
//ozoneCluster.getReconServer().getStorageContainerServiceProvider(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: This line can be removed.
// Start all services | ||
start(); | ||
isStarted = true; | ||
LOG.debug("Start of all services of Recon completed successfully !!!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Do we need this log statement?
if (!SCMHAUtils.isSCMHAEnabled(configuration)) { | ||
SecurityUtil.doAsLoginUser(() -> { | ||
try (InputStream inputStream = reconUtils.makeHttpCall( | ||
connectionFactory, getScmDBSnapshotUrl(), | ||
isOmSpnegoEnabled()).getInputStream()) { | ||
FileUtils.copyInputStreamToFile(inputStream, targetFile); | ||
} | ||
return null; | ||
}); | ||
fetchSCMDBSnapshotUsingHttpClient(targetFile); | ||
LOG.info("Downloaded SCM Snapshot from SCM"); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get rid of the isSCMHAEnabled
check and completely rely on the output of getScmInfo
output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nandakumar131
Recon is derived from SCM, So many code functionality is implicitly available in recon from scm code. We may need fix Recon before startup to know if SCM is in HA or not for others also.
Also, this is used in freon , datanode, and cli tool via HAUtils for getting CA certificates. This can impact them as per code.
This needs different solution, getting rid of this from all places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumitagrawl I din't mean to remove isSCMHAEnabled
the method completely.
We don't need to call the isSCMHAEnabled
check here in this case.
if (role.length <= 2) { | ||
fetchSCMDBSnapshotUsingHttpClient(targetFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of role.length
check, we can do ratisRoles.size
and enter the for
loop only if the size
is greater than 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When ratis is enabled for single node SCM, it will have roles and same can be used. Only problem is old SCM where ratis is not enabled (identified dynamically overridding this property).
We need consider the impact for Recon all places as its using same SCM code, but this property is not overloaded. We may face the problem whenever the funtionality hits. Need find better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, created a separate JIRA HDDS-10957 to track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumitagrawl @devmadhuu
The goal is to have Ratis enabled for all the cases whether it's HA or Non-HA.
Going forward, we should deprecate and remove the property ozone.scm.ratis.enable
and use Ratis by default.
The reason for introducing the property in first place is to support upgrading of old clusters and be backward compatible.
We will have this property until we support the old Ozone releases, we can deprecate the property after that.
Even now, we cannot disable the property once it's enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Thanks @devmadhuu for the fix and thanks to @sumitagrawl for the review! |
…rror due to SCM non-HA scenario (apache#6752) (cherry picked from commit f3a0dbd)
…rror due to SCM non-HA scenario (apache#6752)
What changes were proposed in this pull request?
This PR handles Recon startup failure due to unexpected runtime error and shuts down silently without logging the actual cause of error.
Ozone Recon - Handle startup failure and log reasons as error due to SCM non-HA scenario
While ReconServer.start() method is called, if any runtime errors being thrown, those are not being logged and Recon startup fails silently without logging any reason of failure. E.g. In case of non-HA SCM case, few configurations made Recon fails to start. This scenario or any runtime exception needs to be handled for proper logging.
In non-HA SCM case, while trying to fetch SCM DB snapshot, SCM peer roles are not being returned correctly, so the logic needs to handle.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10937
How was this patch tested?
Tested manually and added additional integration tests in Recon for non HA SCM case.