-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11187. Fix Event Handling in Recon OMDBUpdatesHandler to Prevent ClassCastException #6950
Conversation
…event ClassCastException in Recon Server.
@sumitagrawl Can you please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root problem on the OM side persists and requires fixing
Is there a jira to track this? If not we should create one . Also I'm not clear as to how OM sends a wrong event update to Recon? Are we sure OM is at fault here?
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/OMDBUpdatesHandler.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
OM is sending correct set of events, but there is a problem in recon side logic where map keeping track of previous event for PUT -> UPDATE event identification,
Since this is KeyTable event and expected type is KeyInfo, but since having RepeatedKeyInfo value type, it fails with ClassCastException. As solution, previous events are now kept wrt to "table" vs "key vs value" to avoid this conflict. |
Then this statement in the description is contradicting - "The root problem on the OM side persists and requires fixing". what is still required to fix apart from this patch? |
@sadanand48 I will add the UT for the changes now. |
How to Replicate the ClassCastException ➖To understand the unit test, it's crucial to understand how the
The updated value is fetched from the OM side, while the oldValue is fetched from an existing map inside the For example, consider the
Here, using the same name for a file and a directory will cause an error. If we execute the following commands, we will encounter a ClassCastException because the file name and directory name are the same:
Breakdown of the Error:
To prevent such issues, we implemented a safeguard (HDDS-8310) that checks for value mismatches and ignores such events. However, ignoring these events is not ideal, as it can lead to data inconsistency. For example, Recon would never know about the directory cc: @sadanand48 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ArafatKhan2198 for the explanation . I think the root cause for the mismatch should be fixed by this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
… ClassCastException. (apache#6950) (cherry picked from commit 86c4339)
…-delete * HDDS-10239-container-reconciliation: (184 commits) HDDS-10373. Implement framework for capturing Merkle Tree Metrics. (apache#6864) HDDS-11188. Initial setup for new UI layout and enable users to switch to new UI (apache#6953) HDDS-11120. Rich rebalancing status info (apache#6911) HDDS-11187. Fix Event Handling in Recon OMDBUpdatesHandler to Prevent ClassCastException. (apache#6950) HDDS-11213. Bump commons-daemon to 1.4.0 (apache#6971) HDDS-11212. Bump commons-net to 3.11.1 (apache#6973) HDDS-11211. Bump assertj-core to 3.26.3 (apache#6972) HDDS-11210. Bump log4j2 to 2.23.1 (apache#6970) HDDS-11150. Recon Overview page crashes due to failed API Calls (apache#6944) HDDS-11183. Keys from DeletedTable and DeletedDirTable of AOS should be deleted on batch operation while creating a snapshot (apache#6946) HDDS-11198. Fix Typescript configs for Recon (apache#6961) HDDS-11180. Simplify HttpServer2#inferMimeType return statement (apache#6963) HDDS-11194. OM missing audit log for upgrade (apache#6958) HDDS-10389. Implement a search feature for users to locate open keys within the Open Keys Insights section. (apache#6231) HDDS-10561. Dashboard for delete key metrics (apache#6948) HDDS-11192. Increase SPNEGO URL test coverage (apache#6956) HDDS-11179. DBConfigFromFile#readFromFile result of toIOException not thrown (apache#6957) HDDS-11186. First container log missing from bundle (apache#6952) HDDS-10844. Clarify snapshot create error message. (apache#6955) HDDS-11166. Switch to Rocky Linux-based ozone-runner (apache#6942) ...
What changes were proposed in this pull request?
Explanation of the Changes :-
Map Structure Change: The
omdbLatestUpdateEvents
map has been changed fromMap<Object, OMDBUpdateEvent>
toMap<String, Map<Object, OMDBUpdateEvent>>
. This nested map structure ensures that each table's events are stored separately, avoiding key collisions between different tables.Event Processing Logic: The
processEvent
method now uses the table name as the first-level key in the map. If a key already exists for a table, it retrieves the nested map and updates or adds the event. This change ensures that events from different tables with the same key structure are isolated and correctly processed.How it Fixes the Problem
This change ensures that events from different tables with the same key structure do not overwrite each other, preventing the
ClassCastException
caused by fetching incorrect event types. By segregating events by table name, we avoid the corruption caused by key collisions, leading to more reliable event processing in Recon.Logs and Future Fixes
We have previously added logs to capture events that could lead to
ClassCastException
. These logs help identify corrupted events generated from the OM side, which need to be fixed. Our changes in this patch address possible corruption in event creation on the Recon side. The root problem on the OM side persists and requires fixing. The logs added in the previous patch (HDDS-8310) will help report these issues, and we should wait for these logs to guide further fixes.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11187
How was this patch tested?
Existing UT's for
TestOMDBUpdatesHandler
passed successfully.