Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11187. Fix Event Handling in Recon OMDBUpdatesHandler to Prevent ClassCastException #6950

Merged
merged 3 commits into from
Jul 22, 2024

Conversation

ArafatKhan2198
Copy link
Contributor

@ArafatKhan2198 ArafatKhan2198 commented Jul 16, 2024

What changes were proposed in this pull request?

Explanation of the Changes :-

Map Structure Change: The omdbLatestUpdateEvents map has been changed from Map<Object, OMDBUpdateEvent> to Map<String, Map<Object, OMDBUpdateEvent>>. This nested map structure ensures that each table's events are stored separately, avoiding key collisions between different tables.

Event Processing Logic: The processEvent method now uses the table name as the first-level key in the map. If a key already exists for a table, it retrieves the nested map and updates or adds the event. This change ensures that events from different tables with the same key structure are isolated and correctly processed.

How it Fixes the Problem

This change ensures that events from different tables with the same key structure do not overwrite each other, preventing the ClassCastException caused by fetching incorrect event types. By segregating events by table name, we avoid the corruption caused by key collisions, leading to more reliable event processing in Recon.

Logs and Future Fixes

We have previously added logs to capture events that could lead to ClassCastException. These logs help identify corrupted events generated from the OM side, which need to be fixed. Our changes in this patch address possible corruption in event creation on the Recon side. The root problem on the OM side persists and requires fixing. The logs added in the previous patch (HDDS-8310) will help report these issues, and we should wait for these logs to guide further fixes.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11187

How was this patch tested?

Existing UT's for TestOMDBUpdatesHandler passed successfully.

@ArafatKhan2198 ArafatKhan2198 marked this pull request as ready for review July 16, 2024 08:32
@adoroszlai adoroszlai changed the title HDDS-11187. Fix Event Handling Corruption in OMDBUpdatesHandler to Prevent ClassCastException in Recon Server. HDDS-11187. Fix Event Handling in Recon OMDBUpdatesHandler to Prevent ClassCastException Jul 16, 2024
@ArafatKhan2198
Copy link
Contributor Author

@sumitagrawl Can you please take a look.

Copy link
Contributor

@sadanand48 sadanand48 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root problem on the OM side persists and requires fixing

Is there a jira to track this? If not we should create one . Also I'm not clear as to how OM sends a wrong event update to Recon? Are we sure OM is at fault here?

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sumitagrawl
Copy link
Contributor

sumitagrawl commented Jul 17, 2024

The root problem on the OM side persists and requires fixing

Is there a jira to track this? If not we should create one . Also I'm not clear as to how OM sends a wrong event update to Recon? Are we sure OM is at fault here?

OM is sending correct set of events, but there is a problem in recon side logic where map keeping track of previous event for PUT -> UPDATE event identification,

  • event 1: deleteKeyTable - key1 -- RepeatedKeyInfo data
  • event 2: keyTable - key1 -- KeyInfo
    for deletedKeyTable, event 1 is present in omdbLatestUpdateEvents as "key1 : RepeatedkeyInfo"
    when processing put event for keyTable, it check previous event from map, and found "key1 : RepeatedKeyInfo"

Since this is KeyTable event and expected type is KeyInfo, but since having RepeatedKeyInfo value type, it fails with ClassCastException.

As solution, previous events are now kept wrt to "table" vs "key vs value" to avoid this conflict.

@sadanand48
Copy link
Contributor

OM is sending correct set of events.

Then this statement in the description is contradicting - "The root problem on the OM side persists and requires fixing". what is still required to fix apart from this patch?

@ArafatKhan2198
Copy link
Contributor Author

@sadanand48
It has not yet been proven that OM is sending incorrect events. To investigate this possibility, we have added logs to help us determine if this is the case. If this issue arises in the future, the OmUpdateEventValidator should be able to identify and report it in the logs. However, I agree that it is too early to blame OM for the incorrect events.

I will add the UT for the changes now.

@ArafatKhan2198
Copy link
Contributor Author

How to Replicate the ClassCastException ➖

To understand the unit test, it's crucial to understand how the ClassCastException occurs and what bug in Recon led to it. When an event from OM comes to Recon, OMDBUpdatesHandler is responsible for packaging this information into an event object class called OMDBUpdateEvent, which has two essential fields: updatedValue and oldValue.

oldValue: This is the previous state of a database entry before an update (PUT) or delete (DELETE) operation. It represents what was stored in the database before the current operation.

newValue: This is the new state of a database entry after an update (UPDATE) or create (PUT) operation. It represents the current state being written to the database.

The updated value is fetched from the OM side, while the oldValue is fetched from an existing map inside the OMDBUpdatesHandler called omdbLatestUpdateEvents. The old map, omdbLatestUpdateEvents, was a simple Map<Object, OMDBUpdateEvent> that stored the latest database update events without distinguishing between different tables. This led to conflicts and potential corruption when different tables had the same key structure, causing issues like ClassCastException during event processing in Recon.

For example, consider the FileTable and the DirectoryTable. Both tables have the same key structure in RocksDB:

  • directoryTable: /volumeId/bucketId/parentId/dirName -> DirInfo
  • fileTable: /volumeId/bucketId/parentId/fileName -> KeyInfo

Here, using the same name for a file and a directory will cause an error. If we execute the following commands, we will encounter a ClassCastException because the file name and directory name are the same:

ozone sh key put s3v/fso-bucket/dir6 NOTICE.txt
ozone sh key delete s3v/fso-bucket/dir6
ozone fs -mkdir -p ofs://om/s3v/fso-bucket/dir6

Breakdown of the Error:

  1. First Command: The first command will be recorded as a PUT operation on the FileTable in OMDBUpdateEvent and stored in the omdbLatestUpdateEvents map as:
    • Key: /volumeId/bucketId/parentId/dirName
    • Value: OMDBUpdateEvent with updatedValue as OmKeyInfo for the file named dir6 and oldValue as null.
  2. Second Command: The second command will be recorded as a DELETE operation on the FileTable in OMDBUpdateEvent. It will first check the omdbLatestUpdateEvents for a previous mention of this key (dir6). It finds the previous PUT operation, so the record in the map changes to a DELETE operation with updatedValue as OmKeyInfo (the value to be deleted) and oldValue as null.
  3. Third Command: When creating a directory named dir6, it results in a new OMDBUpdateEvent for the DirectoryTable. The newValue will be an OmDirectoryInfo object. However, when it checks omdbLatestUpdateEvents, it finds the old value associated with the previous DELETE operation, which is OmKeyInfo (the newValue of the delete event). This mismatch (newValue as OmDirectoryInfo and oldValue as OmKeyInfo) leads to a ClassCastException.

To prevent such issues, we implemented a safeguard (HDDS-8310) that checks for value mismatches and ignores such events. However, ignoring these events is not ideal, as it can lead to data inconsistency. For example, Recon would never know about the directory dir6, leading to data inconsistency. To fix this, we need to implement a map that distinguishes between different tables.

cc: @sadanand48

Copy link
Contributor

@sadanand48 sadanand48 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArafatKhan2198 for the explanation . I think the root cause for the mismatch should be fixed by this patch.

Copy link
Contributor

@sadanand48 sadanand48 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@sadanand48 sadanand48 merged commit 86c4339 into apache:master Jul 22, 2024
39 checks passed
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 26, 2024
errose28 added a commit to errose28/ozone that referenced this pull request Jul 30, 2024
…-delete

* HDDS-10239-container-reconciliation: (184 commits)
  HDDS-10373. Implement framework for capturing Merkle Tree Metrics. (apache#6864)
  HDDS-11188. Initial setup for new UI layout and enable users to switch to new UI (apache#6953)
  HDDS-11120. Rich rebalancing status info (apache#6911)
  HDDS-11187. Fix Event Handling in Recon OMDBUpdatesHandler to Prevent ClassCastException. (apache#6950)
  HDDS-11213. Bump commons-daemon to 1.4.0 (apache#6971)
  HDDS-11212. Bump commons-net to 3.11.1 (apache#6973)
  HDDS-11211. Bump assertj-core to 3.26.3 (apache#6972)
  HDDS-11210. Bump log4j2 to 2.23.1 (apache#6970)
  HDDS-11150. Recon Overview page crashes due to failed API Calls (apache#6944)
  HDDS-11183. Keys from DeletedTable and DeletedDirTable of AOS should be deleted on batch operation while creating a snapshot (apache#6946)
  HDDS-11198. Fix Typescript configs for Recon (apache#6961)
  HDDS-11180. Simplify HttpServer2#inferMimeType return statement (apache#6963)
  HDDS-11194. OM missing audit log for upgrade (apache#6958)
  HDDS-10389. Implement a search feature for users to locate open keys within the Open Keys Insights section. (apache#6231)
  HDDS-10561. Dashboard for delete key metrics (apache#6948)
  HDDS-11192. Increase SPNEGO URL test coverage (apache#6956)
  HDDS-11179. DBConfigFromFile#readFromFile result of toIOException not thrown (apache#6957)
  HDDS-11186. First container log missing from bundle (apache#6952)
  HDDS-10844. Clarify snapshot create error message. (apache#6955)
  HDDS-11166. Switch to Rocky Linux-based ozone-runner (apache#6942)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants