Skip to content

Fix for #2814 - [BUG] Flitering with where returns wrong results#2815

Merged
robfrank merged 10 commits intoArcadeData:mainfrom
mindreframer:fix/indexing
Nov 22, 2025
Merged

Fix for #2814 - [BUG] Flitering with where returns wrong results#2815
robfrank merged 10 commits intoArcadeData:mainfrom
mindreframer:fix/indexing

Conversation

@mindreframer
Copy link
Contributor

Feat: fixes #2814 ([BUG] Flitering with where returns wrong results )

Bug:

When using non-unique indexes in ArcadeDB, queries failed to find records after certain update operations:

  • Records were correctly updated in the database (verified by queries without WHERE clause)
  • Queries using indexed WHERE clauses returned incorrect/incomplete results (0 instead of expected values)
  • The bug occurred after updating a record's indexed field value multiple times

Test Case Scenario:

// Create 3 children with status='synced'
Child c1, c2, c3 (all status='synced')

// Update c1 and c2 to 'pending'
UPDATE c1, c2 SET status='pending'

// Query for pending - WORKS
SELECT WHERE status='pending'  [c1, c2] 

// Update c1 back to 'synced'
UPDATE c1 SET status='synced'

// BUG: Query for pending - FAILED (before fix)
SELECT WHERE status='pending'  []  (expected [c2])

// BUG: Query for synced - FAILED (before fix)
SELECT WHERE status='synced'  [c1]  (expected [c1, c3])

// AFTER FIX: Both queries work correctly
SELECT WHERE status='pending'  [c2] 
SELECT WHERE status='synced'  [c1, c3] 

Root Cause

File: engine/src/main/java/com/arcadedb/index/lsm/LSMTreeIndexAbstract.java
Method: lookupInPageAndAddInResultset() (lines 568-627)

The Problem

The original code tracked deleted KEYS instead of deleted RIDs:

// ORIGINAL CODE (BUGGY):
if (rid.getBucketId() < 0) {
    removedKeys.add(keys);  // ❌ Marks entire KEY as deleted
    continue;
}

if (removedKeys.contains(keys))  // ❌ Skips ALL RIDs with this key
    continue;

Impact: For non-unique indexes where a key maps to multiple RIDs:

  1. When a deletion marker #-4:0 was found (meaning "RID Bump gremlin.version from 3.4.10 to 3.5.1 #2:0 was deleted")
  2. The code added the entire KEY (e.g., "pending") to removedKeys
  3. Then ALL subsequent RIDs with that same key were skipped
  4. Example: ("pending", [#2:0, #2:1]) + deletion marker #-4:0skipped both Bump gremlin.version from 3.4.10 to 3.5.1 #2:0 AND Bump gremlin.version from 3.4.10 to 3.5.1 #2:1

Why This Happened

LSM-tree indexes use tombstone deletion:

  • Deleting #2:0 creates a deletion marker #-4:0 (negative bucketId)
  • The marker is appended as a NEW entry (doesn't modify existing entries)
  • During queries, the code reads backwards through pages and filters out deleted RIDs
  • The filtering logic was incorrect for non-unique indexes

The Fix

Track deleted RIDs instead of just deleted KEYS:

// NEW CODE (FIXED):
final Set<RID> deletedRIDs = new HashSet<>();

for (int i = allValues.size() - 1; i > -1; --i) {
    final RID rid = allValues.get(i);

    if (rid.getBucketId() < 0) {
        // Convert deletion marker to original RID
        final RID originalRID = getOriginalRID(rid);
        deletedRIDs.add(originalRID);  // ✅ Track the SPECIFIC deleted RID

        // For unique indexes, ALSO mark the entire key as removed
        if (mainIndex.isUnique()) {
            removedKeys.add(keys);
        }
        continue;
    }

    // For unique indexes, check if the entire key has been removed
    if (mainIndex.isUnique() && removedKeys.contains(keys)) {
        continue;
    }

    // For all indexes, check if THIS SPECIFIC RID has been deleted
    if (deletedRIDs.contains(rid)) {  // ✅ Only skip the specific deleted RID
        continue;
    }

    validRIDs.add(rid);
    set.add(new IndexCursorEntry(originalKeys, rid, 1));
}

Key Changes

  1. Track Individual RIDs: Added Set<RID> deletedRIDs to track which specific RIDs have been deleted
  2. Convert Deletion Markers: Use getOriginalRID(rid) to convert deletion marker (e.g., #-4:0) back to original RID (e.g., #2:0)
  3. Differentiate Index Types:
    • Unique indexes: Continue to use removedKeys (only one RID per key, so marking the key as deleted is correct)
    • Non-unique indexes: Use deletedRIDs to only skip the specific deleted RIDs, not all RIDs with that key

Test Results

Before Fix

Pending (WHERE): 0 → []  ❌ Expected [c2]
Synced (WHERE): 1 → [ "c1" ]  ❌ Expected [c1, c3]

 0 pass
 1 fail

After Fix

Pending (WHERE): 1 → [ "c2" ]  ✅
Synced (WHERE): 2 → [ "c3", "c1" ]  ✅

 1 pass
 0 fail
 3 expect() calls

Checklist

  • I have run the build using mvn clean package command
  • My unit tests cover both failure and success scenarios (AS Bun.js tests, which are simpler to maintain / contribute for such situations)

… results )

## Bug: 

When using non-unique indexes in ArcadeDB, queries failed to find records after certain update operations:
- Records were correctly updated in the database (verified by queries without WHERE clause)
- Queries using indexed WHERE clauses returned incorrect/incomplete results (0 instead of expected values)
- The bug occurred after updating a record's indexed field value multiple times

**Test Case Scenario:**
```javascript
// Create 3 children with status='synced'
Child c1, c2, c3 (all status='synced')

// Update c1 and c2 to 'pending'
UPDATE c1, c2 SET status='pending'

// Query for pending - WORKS
SELECT WHERE status='pending' → [c1, c2] ✓

// Update c1 back to 'synced'
UPDATE c1 SET status='synced'

// BUG: Query for pending - FAILED (before fix)
SELECT WHERE status='pending' → [] ✗ (expected [c2])

// BUG: Query for synced - FAILED (before fix)
SELECT WHERE status='synced' → [c1] ✗ (expected [c1, c3])

// AFTER FIX: Both queries work correctly
SELECT WHERE status='pending' → [c2] ✓
SELECT WHERE status='synced' → [c1, c3] ✓
```

## Root Cause

**File**: `engine/src/main/java/com/arcadedb/index/lsm/LSMTreeIndexAbstract.java`
**Method**: `lookupInPageAndAddInResultset()` (lines 568-627)

### The Problem

The original code tracked deleted **KEYS** instead of deleted **RIDs**:

```java
// ORIGINAL CODE (BUGGY):
if (rid.getBucketId() < 0) {
    removedKeys.add(keys);  // ❌ Marks entire KEY as deleted
    continue;
}

if (removedKeys.contains(keys))  // ❌ Skips ALL RIDs with this key
    continue;
```

**Impact**: For non-unique indexes where a key maps to multiple RIDs:
1. When a deletion marker `#-4:0` was found (meaning "RID ArcadeData#2:0 was deleted")
2. The code added the entire KEY (e.g., "pending") to `removedKeys`
3. Then ALL subsequent RIDs with that same key were skipped
4. Example: `("pending", [ArcadeData#2:0, ArcadeData#2:1])` + deletion marker `#-4:0` → **skipped both ArcadeData#2:0 AND ArcadeData#2:1**

### Why This Happened

LSM-tree indexes use **tombstone deletion**:
- Deleting `ArcadeData#2:0` creates a deletion marker `#-4:0` (negative bucketId)
- The marker is appended as a NEW entry (doesn't modify existing entries)
- During queries, the code reads backwards through pages and filters out deleted RIDs
- The filtering logic was incorrect for non-unique indexes

### The Fix

Track deleted **RIDs** instead of just deleted **KEYS**:

```java
// NEW CODE (FIXED):
final Set<RID> deletedRIDs = new HashSet<>();

for (int i = allValues.size() - 1; i > -1; --i) {
    final RID rid = allValues.get(i);

    if (rid.getBucketId() < 0) {
        // Convert deletion marker to original RID
        final RID originalRID = getOriginalRID(rid);
        deletedRIDs.add(originalRID);  // ✅ Track the SPECIFIC deleted RID

        // For unique indexes, ALSO mark the entire key as removed
        if (mainIndex.isUnique()) {
            removedKeys.add(keys);
        }
        continue;
    }

    // For unique indexes, check if the entire key has been removed
    if (mainIndex.isUnique() && removedKeys.contains(keys)) {
        continue;
    }

    // For all indexes, check if THIS SPECIFIC RID has been deleted
    if (deletedRIDs.contains(rid)) {  // ✅ Only skip the specific deleted RID
        continue;
    }

    validRIDs.add(rid);
    set.add(new IndexCursorEntry(originalKeys, rid, 1));
}
```

### Key Changes

1. **Track Individual RIDs**: Added `Set<RID> deletedRIDs` to track which specific RIDs have been deleted
2. **Convert Deletion Markers**: Use `getOriginalRID(rid)` to convert deletion marker (e.g., `#-4:0`) back to original RID (e.g., `ArcadeData#2:0`)
3. **Differentiate Index Types**:
   - **Unique indexes**: Continue to use `removedKeys` (only one RID per key, so marking the key as deleted is correct)
   - **Non-unique indexes**: Use `deletedRIDs` to only skip the specific deleted RIDs, not all RIDs with that key

## Test Results

### Before Fix
```
Pending (WHERE): 0 → []  ❌ Expected [c2]
Synced (WHERE): 1 → [ "c1" ]  ❌ Expected [c1, c3]

 0 pass
 1 fail
```

### After Fix
```
Pending (WHERE): 1 → [ "c2" ]  ✅
Synced (WHERE): 2 → [ "c3", "c1" ]  ✅

 1 pass
 0 fail
 3 expect() calls
```
@mindreframer mindreframer changed the title Fix for https://github.com/ArcadeData/arcadedb/issues/2814 - [BUG] Flitering with where returns wrong results Fix for #2814 - [BUG] Flitering with where returns wrong results Nov 21, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @mindreframer, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug in ArcadeDB's LSM-tree indexing mechanism that caused WHERE clause queries to return incomplete or incorrect results, particularly with non-unique indexes after multiple update operations. The fix refines how deletion markers are processed, ensuring that only specific deleted records are filtered out, thereby restoring the accuracy of indexed queries. The changes are validated by a new, minimal end-to-end test case.

Highlights

  • Bug Fix for Index Filtering: Addresses a critical issue where queries using non-unique indexes returned incorrect or incomplete results after update operations, specifically when an indexed field was modified multiple times.
  • Corrected Deletion Tracking Logic: The core fix involves refining how deletion markers are processed in LSM-tree indexes. Instead of marking an entire key as deleted, the system now accurately tracks individual deleted Record IDs (RIDs), preventing unintended filtering of other valid records associated with the same key.
  • Differentiated Index Type Handling: Implemented distinct logic for unique and non-unique indexes when processing deletion markers. Unique indexes continue to mark the entire key as removed, while non-unique indexes now use a dedicated set to track only specific deleted RIDs.
  • New E2E Test Suite and Case: Introduced a new end-to-end test environment using Bun.js, complete with helper functions and a specific test case (indexing-bug.test.js) that precisely reproduces the bug and verifies the effectiveness of the implemented fix.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a well-reasoned fix for a critical bug in index lookups for non-unique indexes. The core logic change in LSMTreeIndexAbstract.java correctly differentiates between unique and non-unique index deletion handling by tracking deleted RIDs instead of just keys. The addition of a comprehensive e2e test using Bun is excellent, as it clearly reproduces the bug and validates the fix. My review includes a few suggestions to improve the new test code's maintainability and security, such as removing hardcoded credentials and reducing code duplication.

@mergify
Copy link
Contributor

mergify bot commented Nov 21, 2025

🧪 CI Insights

Here's what we observed from your CI run for 9a15cb1.

🟢 All jobs passed!

But CI Insights is watching 👀

@robfrank
Copy link
Collaborator

@mindreframer could you please install and run pre-commit and then push again?

in the project dir:

pre-commit install
pre-commit run --all-files

thanks

@robfrank
Copy link
Collaborator

Hi @mindreframer, first of all thank you very much for the contribution .

I translated the test to Java, so now it's part of the test suite and we can avoid regressions.

Now, the bad news. I'll remove the JS part, even if it's very useful, but we can't maintain an additional module just for a single test. It is better for us to have the test in the Java test suite, "near" to the index implementation.

To keep your commit, I'll get rid of the js module on this pr and then I'll merge, so your name will be part of the commit and you will be added to the list of contributor.

Is it ok for you?

@mindreframer
Copy link
Contributor Author

Yes, it's fine with me.
Thanks and sorry for the late reply

@robfrank robfrank merged commit 57ba7b5 into ArcadeData:main Nov 22, 2025
11 of 13 checks passed
@mindreframer mindreframer deleted the fix/indexing branch November 22, 2025 16:33
@lvca lvca requested a review from robfrank January 22, 2026 22:05
@lvca lvca added this to the 25.12.1 milestone Jan 22, 2026
robfrank added a commit that referenced this pull request Feb 11, 2026
Co-authored-by: Roberto Franchini <ro.franchini@gmail.com>
(cherry picked from commit 57ba7b5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Filtering with where returns wrong results

3 participants