Fix for #2814 - [BUG] Flitering with where returns wrong results#2815
Fix for #2814 - [BUG] Flitering with where returns wrong results#2815robfrank merged 10 commits intoArcadeData:mainfrom
Conversation
… results )
## Bug:
When using non-unique indexes in ArcadeDB, queries failed to find records after certain update operations:
- Records were correctly updated in the database (verified by queries without WHERE clause)
- Queries using indexed WHERE clauses returned incorrect/incomplete results (0 instead of expected values)
- The bug occurred after updating a record's indexed field value multiple times
**Test Case Scenario:**
```javascript
// Create 3 children with status='synced'
Child c1, c2, c3 (all status='synced')
// Update c1 and c2 to 'pending'
UPDATE c1, c2 SET status='pending'
// Query for pending - WORKS
SELECT WHERE status='pending' → [c1, c2] ✓
// Update c1 back to 'synced'
UPDATE c1 SET status='synced'
// BUG: Query for pending - FAILED (before fix)
SELECT WHERE status='pending' → [] ✗ (expected [c2])
// BUG: Query for synced - FAILED (before fix)
SELECT WHERE status='synced' → [c1] ✗ (expected [c1, c3])
// AFTER FIX: Both queries work correctly
SELECT WHERE status='pending' → [c2] ✓
SELECT WHERE status='synced' → [c1, c3] ✓
```
## Root Cause
**File**: `engine/src/main/java/com/arcadedb/index/lsm/LSMTreeIndexAbstract.java`
**Method**: `lookupInPageAndAddInResultset()` (lines 568-627)
### The Problem
The original code tracked deleted **KEYS** instead of deleted **RIDs**:
```java
// ORIGINAL CODE (BUGGY):
if (rid.getBucketId() < 0) {
removedKeys.add(keys); // ❌ Marks entire KEY as deleted
continue;
}
if (removedKeys.contains(keys)) // ❌ Skips ALL RIDs with this key
continue;
```
**Impact**: For non-unique indexes where a key maps to multiple RIDs:
1. When a deletion marker `#-4:0` was found (meaning "RID ArcadeData#2:0 was deleted")
2. The code added the entire KEY (e.g., "pending") to `removedKeys`
3. Then ALL subsequent RIDs with that same key were skipped
4. Example: `("pending", [ArcadeData#2:0, ArcadeData#2:1])` + deletion marker `#-4:0` → **skipped both ArcadeData#2:0 AND ArcadeData#2:1**
### Why This Happened
LSM-tree indexes use **tombstone deletion**:
- Deleting `ArcadeData#2:0` creates a deletion marker `#-4:0` (negative bucketId)
- The marker is appended as a NEW entry (doesn't modify existing entries)
- During queries, the code reads backwards through pages and filters out deleted RIDs
- The filtering logic was incorrect for non-unique indexes
### The Fix
Track deleted **RIDs** instead of just deleted **KEYS**:
```java
// NEW CODE (FIXED):
final Set<RID> deletedRIDs = new HashSet<>();
for (int i = allValues.size() - 1; i > -1; --i) {
final RID rid = allValues.get(i);
if (rid.getBucketId() < 0) {
// Convert deletion marker to original RID
final RID originalRID = getOriginalRID(rid);
deletedRIDs.add(originalRID); // ✅ Track the SPECIFIC deleted RID
// For unique indexes, ALSO mark the entire key as removed
if (mainIndex.isUnique()) {
removedKeys.add(keys);
}
continue;
}
// For unique indexes, check if the entire key has been removed
if (mainIndex.isUnique() && removedKeys.contains(keys)) {
continue;
}
// For all indexes, check if THIS SPECIFIC RID has been deleted
if (deletedRIDs.contains(rid)) { // ✅ Only skip the specific deleted RID
continue;
}
validRIDs.add(rid);
set.add(new IndexCursorEntry(originalKeys, rid, 1));
}
```
### Key Changes
1. **Track Individual RIDs**: Added `Set<RID> deletedRIDs` to track which specific RIDs have been deleted
2. **Convert Deletion Markers**: Use `getOriginalRID(rid)` to convert deletion marker (e.g., `#-4:0`) back to original RID (e.g., `ArcadeData#2:0`)
3. **Differentiate Index Types**:
- **Unique indexes**: Continue to use `removedKeys` (only one RID per key, so marking the key as deleted is correct)
- **Non-unique indexes**: Use `deletedRIDs` to only skip the specific deleted RIDs, not all RIDs with that key
## Test Results
### Before Fix
```
Pending (WHERE): 0 → [] ❌ Expected [c2]
Synced (WHERE): 1 → [ "c1" ] ❌ Expected [c1, c3]
0 pass
1 fail
```
### After Fix
```
Pending (WHERE): 1 → [ "c2" ] ✅
Synced (WHERE): 2 → [ "c3", "c1" ] ✅
1 pass
0 fail
3 expect() calls
```
Summary of ChangesHello @mindreframer, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical bug in ArcadeDB's LSM-tree indexing mechanism that caused Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request provides a well-reasoned fix for a critical bug in index lookups for non-unique indexes. The core logic change in LSMTreeIndexAbstract.java correctly differentiates between unique and non-unique index deletion handling by tracking deleted RIDs instead of just keys. The addition of a comprehensive e2e test using Bun is excellent, as it clearly reproduces the bug and validates the fix. My review includes a few suggestions to improve the new test code's maintainability and security, such as removing hardcoded credentials and reducing code duplication.
engine/src/main/java/com/arcadedb/index/lsm/LSMTreeIndexAbstract.java
Outdated
Show resolved
Hide resolved
🧪 CI InsightsHere's what we observed from your CI run for 9a15cb1. 🟢 All jobs passed!But CI Insights is watching 👀 |
|
@mindreframer could you please install and run pre-commit and then push again? in the project dir: thanks |
|
Hi @mindreframer, first of all thank you very much for the contribution . I translated the test to Java, so now it's part of the test suite and we can avoid regressions. Now, the bad news. I'll remove the JS part, even if it's very useful, but we can't maintain an additional module just for a single test. It is better for us to have the test in the Java test suite, "near" to the index implementation. To keep your commit, I'll get rid of the js module on this pr and then I'll merge, so your name will be part of the commit and you will be added to the list of contributor. Is it ok for you? |
|
Yes, it's fine with me. |
Feat: fixes #2814 ([BUG] Flitering with where returns wrong results )
Bug:
When using non-unique indexes in ArcadeDB, queries failed to find records after certain update operations:
Test Case Scenario:
Root Cause
File:
engine/src/main/java/com/arcadedb/index/lsm/LSMTreeIndexAbstract.javaMethod:
lookupInPageAndAddInResultset()(lines 568-627)The Problem
The original code tracked deleted KEYS instead of deleted RIDs:
Impact: For non-unique indexes where a key maps to multiple RIDs:
#-4:0was found (meaning "RID Bump gremlin.version from 3.4.10 to 3.5.1 #2:0 was deleted")removedKeys("pending", [#2:0, #2:1])+ deletion marker#-4:0→ skipped both Bump gremlin.version from 3.4.10 to 3.5.1 #2:0 AND Bump gremlin.version from 3.4.10 to 3.5.1 #2:1Why This Happened
LSM-tree indexes use tombstone deletion:
#2:0creates a deletion marker#-4:0(negative bucketId)The Fix
Track deleted RIDs instead of just deleted KEYS:
Key Changes
Set<RID> deletedRIDsto track which specific RIDs have been deletedgetOriginalRID(rid)to convert deletion marker (e.g.,#-4:0) back to original RID (e.g.,#2:0)removedKeys(only one RID per key, so marking the key as deleted is correct)deletedRIDsto only skip the specific deleted RIDs, not all RIDs with that keyTest Results
Before Fix
After Fix
Checklist
mvn clean packagecommand