Fix for #2814 - [BUG] Flitering with where returns wrong results by mindreframer · Pull Request #2815 · ArcadeData/arcadedb

mindreframer · 2025-11-21T04:35:10Z

Feat: fixes #2814 ([BUG] Flitering with where returns wrong results )

Bug:

When using non-unique indexes in ArcadeDB, queries failed to find records after certain update operations:

Records were correctly updated in the database (verified by queries without WHERE clause)
Queries using indexed WHERE clauses returned incorrect/incomplete results (0 instead of expected values)
The bug occurred after updating a record's indexed field value multiple times

Test Case Scenario:

// Create 3 children with status='synced'
Child c1, c2, c3 (all status='synced')

// Update c1 and c2 to 'pending'
UPDATE c1, c2 SET status='pending'

// Query for pending - WORKS
SELECT WHERE status='pending' → [c1, c2] ✓

// Update c1 back to 'synced'
UPDATE c1 SET status='synced'

// BUG: Query for pending - FAILED (before fix)
SELECT WHERE status='pending' → [] ✗ (expected [c2])

// BUG: Query for synced - FAILED (before fix)
SELECT WHERE status='synced' → [c1] ✗ (expected [c1, c3])

// AFTER FIX: Both queries work correctly
SELECT WHERE status='pending' → [c2] ✓
SELECT WHERE status='synced' → [c1, c3] ✓

Root Cause

File: engine/src/main/java/com/arcadedb/index/lsm/LSMTreeIndexAbstract.java
Method: lookupInPageAndAddInResultset() (lines 568-627)

The Problem

The original code tracked deleted KEYS instead of deleted RIDs:

// ORIGINAL CODE (BUGGY):
if (rid.getBucketId() < 0) {
    removedKeys.add(keys);  // ❌ Marks entire KEY as deleted
    continue;
}

if (removedKeys.contains(keys))  // ❌ Skips ALL RIDs with this key
    continue;

Impact: For non-unique indexes where a key maps to multiple RIDs:

When a deletion marker #-4:0 was found (meaning "RID Bump gremlin.version from 3.4.10 to 3.5.1 #2:0 was deleted")
The code added the entire KEY (e.g., "pending") to removedKeys
Then ALL subsequent RIDs with that same key were skipped
Example: ("pending", [#2:0, #2:1]) + deletion marker #-4:0 → skipped both Bump gremlin.version from 3.4.10 to 3.5.1 #2:0 AND Bump gremlin.version from 3.4.10 to 3.5.1 #2:1

Why This Happened

LSM-tree indexes use tombstone deletion:

Deleting #2:0 creates a deletion marker #-4:0 (negative bucketId)
The marker is appended as a NEW entry (doesn't modify existing entries)
During queries, the code reads backwards through pages and filters out deleted RIDs
The filtering logic was incorrect for non-unique indexes

The Fix

Track deleted RIDs instead of just deleted KEYS:

// NEW CODE (FIXED):
final Set<RID> deletedRIDs = new HashSet<>();

for (int i = allValues.size() - 1; i > -1; --i) {
    final RID rid = allValues.get(i);

    if (rid.getBucketId() < 0) {
        // Convert deletion marker to original RID
        final RID originalRID = getOriginalRID(rid);
        deletedRIDs.add(originalRID);  // ✅ Track the SPECIFIC deleted RID

        // For unique indexes, ALSO mark the entire key as removed
        if (mainIndex.isUnique()) {
            removedKeys.add(keys);
        }
        continue;
    }

    // For unique indexes, check if the entire key has been removed
    if (mainIndex.isUnique() && removedKeys.contains(keys)) {
        continue;
    }

    // For all indexes, check if THIS SPECIFIC RID has been deleted
    if (deletedRIDs.contains(rid)) {  // ✅ Only skip the specific deleted RID
        continue;
    }

    validRIDs.add(rid);
    set.add(new IndexCursorEntry(originalKeys, rid, 1));
}

Key Changes

Track Individual RIDs: Added Set<RID> deletedRIDs to track which specific RIDs have been deleted
Convert Deletion Markers: Use getOriginalRID(rid) to convert deletion marker (e.g., #-4:0) back to original RID (e.g., #2:0)
Differentiate Index Types:
- Unique indexes: Continue to use removedKeys (only one RID per key, so marking the key as deleted is correct)
- Non-unique indexes: Use deletedRIDs to only skip the specific deleted RIDs, not all RIDs with that key

Test Results

Before Fix

Pending (WHERE): 0 → []  ❌ Expected [c2]
Synced (WHERE): 1 → [ "c1" ]  ❌ Expected [c1, c3]

 0 pass
 1 fail

After Fix

Pending (WHERE): 1 → [ "c2" ]  ✅
Synced (WHERE): 2 → [ "c3", "c1" ]  ✅

 1 pass
 0 fail
 3 expect() calls

Checklist

I have run the build using mvn clean package command
My unit tests cover both failure and success scenarios (AS Bun.js tests, which are simpler to maintain / contribute for such situations)

… results ) ## Bug: When using non-unique indexes in ArcadeDB, queries failed to find records after certain update operations: - Records were correctly updated in the database (verified by queries without WHERE clause) - Queries using indexed WHERE clauses returned incorrect/incomplete results (0 instead of expected values) - The bug occurred after updating a record's indexed field value multiple times **Test Case Scenario:** ```javascript // Create 3 children with status='synced' Child c1, c2, c3 (all status='synced') // Update c1 and c2 to 'pending' UPDATE c1, c2 SET status='pending' // Query for pending - WORKS SELECT WHERE status='pending' → [c1, c2] ✓ // Update c1 back to 'synced' UPDATE c1 SET status='synced' // BUG: Query for pending - FAILED (before fix) SELECT WHERE status='pending' → [] ✗ (expected [c2]) // BUG: Query for synced - FAILED (before fix) SELECT WHERE status='synced' → [c1] ✗ (expected [c1, c3]) // AFTER FIX: Both queries work correctly SELECT WHERE status='pending' → [c2] ✓ SELECT WHERE status='synced' → [c1, c3] ✓ ``` ## Root Cause **File**: `engine/src/main/java/com/arcadedb/index/lsm/LSMTreeIndexAbstract.java` **Method**: `lookupInPageAndAddInResultset()` (lines 568-627) ### The Problem The original code tracked deleted **KEYS** instead of deleted **RIDs**: ```java // ORIGINAL CODE (BUGGY): if (rid.getBucketId() < 0) { removedKeys.add(keys); // ❌ Marks entire KEY as deleted continue; } if (removedKeys.contains(keys)) // ❌ Skips ALL RIDs with this key continue; ``` **Impact**: For non-unique indexes where a key maps to multiple RIDs: 1. When a deletion marker `#-4:0` was found (meaning "RID ArcadeData#2:0 was deleted") 2. The code added the entire KEY (e.g., "pending") to `removedKeys` 3. Then ALL subsequent RIDs with that same key were skipped 4. Example: `("pending", [ArcadeData#2:0, ArcadeData#2:1])` + deletion marker `#-4:0` → **skipped both ArcadeData#2:0 AND ArcadeData#2:1** ### Why This Happened LSM-tree indexes use **tombstone deletion**: - Deleting `ArcadeData#2:0` creates a deletion marker `#-4:0` (negative bucketId) - The marker is appended as a NEW entry (doesn't modify existing entries) - During queries, the code reads backwards through pages and filters out deleted RIDs - The filtering logic was incorrect for non-unique indexes ### The Fix Track deleted **RIDs** instead of just deleted **KEYS**: ```java // NEW CODE (FIXED): final Set<RID> deletedRIDs = new HashSet<>(); for (int i = allValues.size() - 1; i > -1; --i) { final RID rid = allValues.get(i); if (rid.getBucketId() < 0) { // Convert deletion marker to original RID final RID originalRID = getOriginalRID(rid); deletedRIDs.add(originalRID); // ✅ Track the SPECIFIC deleted RID // For unique indexes, ALSO mark the entire key as removed if (mainIndex.isUnique()) { removedKeys.add(keys); } continue; } // For unique indexes, check if the entire key has been removed if (mainIndex.isUnique() && removedKeys.contains(keys)) { continue; } // For all indexes, check if THIS SPECIFIC RID has been deleted if (deletedRIDs.contains(rid)) { // ✅ Only skip the specific deleted RID continue; } validRIDs.add(rid); set.add(new IndexCursorEntry(originalKeys, rid, 1)); } ``` ### Key Changes 1. **Track Individual RIDs**: Added `Set<RID> deletedRIDs` to track which specific RIDs have been deleted 2. **Convert Deletion Markers**: Use `getOriginalRID(rid)` to convert deletion marker (e.g., `#-4:0`) back to original RID (e.g., `ArcadeData#2:0`) 3. **Differentiate Index Types**: - **Unique indexes**: Continue to use `removedKeys` (only one RID per key, so marking the key as deleted is correct) - **Non-unique indexes**: Use `deletedRIDs` to only skip the specific deleted RIDs, not all RIDs with that key ## Test Results ### Before Fix ``` Pending (WHERE): 0 → [] ❌ Expected [c2] Synced (WHERE): 1 → [ "c1" ] ❌ Expected [c1, c3] 0 pass 1 fail ``` ### After Fix ``` Pending (WHERE): 1 → [ "c2" ] ✅ Synced (WHERE): 2 → [ "c3", "c1" ] ✅ 1 pass 0 fail 3 expect() calls ```

gemini-code-assist · 2025-11-21T04:35:31Z

Summary of Changes

Hello @mindreframer, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug in ArcadeDB's LSM-tree indexing mechanism that caused WHERE clause queries to return incomplete or incorrect results, particularly with non-unique indexes after multiple update operations. The fix refines how deletion markers are processed, ensuring that only specific deleted records are filtered out, thereby restoring the accuracy of indexed queries. The changes are validated by a new, minimal end-to-end test case.

Highlights

Bug Fix for Index Filtering: Addresses a critical issue where queries using non-unique indexes returned incorrect or incomplete results after update operations, specifically when an indexed field was modified multiple times.
Corrected Deletion Tracking Logic: The core fix involves refining how deletion markers are processed in LSM-tree indexes. Instead of marking an entire key as deleted, the system now accurately tracks individual deleted Record IDs (RIDs), preventing unintended filtering of other valid records associated with the same key.
Differentiated Index Type Handling: Implemented distinct logic for unique and non-unique indexes when processing deletion markers. Unique indexes continue to mark the entire key as removed, while non-unique indexes now use a dedicated set to track only specific deleted RIDs.
New E2E Test Suite and Case: Introduced a new end-to-end test environment using Bun.js, complete with helper functions and a specific test case (indexing-bug.test.js) that precisely reproduces the bug and verifies the effectiveness of the implemented fix.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request provides a well-reasoned fix for a critical bug in index lookups for non-unique indexes. The core logic change in LSMTreeIndexAbstract.java correctly differentiates between unique and non-unique index deletion handling by tracking deleted RIDs instead of just keys. The addition of a comprehensive e2e test using Bun is excellent, as it clearly reproduces the bug and validates the fix. My review includes a few suggestions to improve the new test code's maintainability and security, such as removing hardcoded credentials and reducing code duplication.

e2e-bun/tests/helpers.js

e2e-bun/tests/indexing-bug.test.js

engine/src/main/java/com/arcadedb/index/lsm/LSMTreeIndexAbstract.java

mergify · 2025-11-21T13:10:15Z

🧪 CI Insights

Here's what we observed from your CI run for 9a15cb1.

🟢 All jobs passed!

But CI Insights is watching 👀

robfrank · 2025-11-21T13:29:40Z

@mindreframer could you please install and run pre-commit and then push again?

in the project dir:

pre-commit install
pre-commit run --all-files

thanks

robfrank · 2025-11-21T18:31:38Z

Hi @mindreframer, first of all thank you very much for the contribution .

I translated the test to Java, so now it's part of the test suite and we can avoid regressions.

Now, the bad news. I'll remove the JS part, even if it's very useful, but we can't maintain an additional module just for a single test. It is better for us to have the test in the Java test suite, "near" to the index implementation.

To keep your commit, I'll get rid of the js module on this pr and then I'll merge, so your name will be part of the commit and you will be added to the list of contributor.

Is it ok for you?

mindreframer · 2025-11-22T09:29:35Z

Yes, it's fine with me.
Thanks and sorry for the late reply

Co-authored-by: Roberto Franchini <ro.franchini@gmail.com> (cherry picked from commit 57ba7b5)

mindreframer added 2 commits November 21, 2025 02:52

Chore: e2e bun testing scripts

f48f121

mindreframer changed the title ~~Fix for https://github.com/ArcadeData/arcadedb/issues/2814 - [BUG] Flitering with where returns wrong results~~ Fix for #2814 - [BUG] Flitering with where returns wrong results Nov 21, 2025

gemini-code-assist bot reviewed Nov 21, 2025

View reviewed changes

mindreframer added 3 commits November 21, 2025 05:42

Chore: some minor cleanup

4b12061

Chore: verbose logging only on-demand for bun tests

245622c

Chore: better naming for the e2e test

3797192

mindreframer mentioned this pull request Nov 21, 2025

Corrupt composite indexes after bulk insert #1531

Closed

mindreframer and others added 4 commits November 21, 2025 15:16

Chore: fixes by pre-commit run --all-files

984419e

Chore: ignore files/folders starting with "@"

fc3e3ae

test: add unit tests for filtering with index in parameterized updates

51a73f3

fix pre-commit

3258362

remove e2e-bun

9a15cb1

robfrank merged commit 57ba7b5 into ArcadeData:main Nov 22, 2025
11 of 13 checks passed

mindreframer deleted the fix/indexing branch November 22, 2025 16:33

lvca assigned mindreframer Jan 22, 2026

lvca requested a review from robfrank January 22, 2026 22:05

lvca added this to the 25.12.1 milestone Jan 22, 2026

robfrank added a commit that referenced this pull request Feb 11, 2026

Fix for #2814 - [BUG] Flitering with where returns wrong results (#2815)

588f710

Co-authored-by: Roberto Franchini <ro.franchini@gmail.com> (cherry picked from commit 57ba7b5)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix for #2814 - [BUG] Flitering with where returns wrong results#2815

Fix for #2814 - [BUG] Flitering with where returns wrong results#2815
robfrank merged 10 commits intoArcadeData:mainfrom
mindreframer:fix/indexing

mindreframer commented Nov 21, 2025

Uh oh!

gemini-code-assist bot commented Nov 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Nov 21, 2025 •

edited

Loading

Uh oh!

robfrank commented Nov 21, 2025

Uh oh!

robfrank commented Nov 21, 2025

Uh oh!

mindreframer commented Nov 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mindreframer commented Nov 21, 2025

Bug:

Root Cause

The Problem

Why This Happened

The Fix

Key Changes

Test Results

Before Fix

After Fix

Checklist

Uh oh!

gemini-code-assist bot commented Nov 21, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 CI Insights

🟢 All jobs passed!

Uh oh!

robfrank commented Nov 21, 2025

Uh oh!

robfrank commented Nov 21, 2025

Uh oh!

mindreframer commented Nov 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mergify bot commented Nov 21, 2025 •

edited

Loading