Skip to content

Conversation

@karthikps97
Copy link
Member

@karthikps97 karthikps97 commented Nov 27, 2025

Describe the Problem

The mongo to pg query converter uses IN operator for array. But in case of postgres, using ANY operator improves performance as the query size is reduced and it also allows for plan caching. The sort is also applied on '_id' JSON field of data column, but there is _id column already present (primary key).

Explain the Changes

  1. Refactored the find_chunks_by_dedup_key method to self generate SQL queries. Improvements are:
  • Using ANY instead of IN improves query planning time for larger array size.
  • Using _id instead of data._id for sorting improves query execution time as we are utilizing the primary key index.

Issues: Fixed #xxx / Gap #xxx

Testing Instructions:

  • Doc added/updated
  • Tests added

Summary by CodeRabbit

  • Improvements

    • Improved chunk deduplication retrieval with broader database backend support and more resilient error handling.
    • Collection objects now expose schema metadata for easier inspection.
  • Tests

    • Added integration tests covering deduplicated chunk lookup and empty-key behavior on supported databases.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Nov 27, 2025

Walkthrough

Switched chunk deduplication lookup from MongoDB to PostgreSQL in MDStore, changed MapServer to pass base64 dedup keys (strings) instead of Buffers, added a schema field to the DBCollection TypeScript interface, and added PostgreSQL-gated integration tests for dedup lookup.

Changes

Cohort / File(s) Summary
Type definitions
src/sdk/nb.d.ts
Added public schema: any property to the DBCollection interface.
Deduplication key handling
src/server/object_services/map_server.js
Modified GetMapping.find_dups to collect chunk.digest_b64 values (base64 strings) into dedup_keys instead of creating Buffer instances.
Database implementation migration
src/server/object_services/md_store.js
Replaced MongoDB-based find_chunks_by_dedup_key with a PostgreSQL implementation: added decode_json import, builds and executes SQL using JSONB filtering, decodes rows to chunks, calls load_blocks_for_chunks, and returns results; logs and returns empty array on SQL errors.
PostgreSQL integration tests
src/test/integration_tests/db/test_md_store.js
Added two PostgreSQL-gated tests: one verifying find_chunks_by_dedup_key returns inserted chunk when given its dedup key, and one verifying an empty result when passed an empty dedup_key array.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Map as MapServer (GetMapping)
participant MD as MDStore
participant PG as PostgreSQL
participant Blocks as Block Loader
Map->>MD: find_chunks_by_dedup_key(bucket, [dedup_key_b64,...])
MD->>PG: execute SQL query (system,bucket, JSONB dedup_key filter)
PG-->>MD: rows (data JSONB)
MD->>MD: decode_json(row.data) -> chunk objects
MD->>Blocks: load_blocks_for_chunks(chunks)
Blocks-->>MD: chunks with blocks
MD-->>Map: return chunks array

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Review focus:
    • src/server/object_services/md_store.js: SQL construction, JSONB condition correctness, SQL injection risk, decoding logic, and parity with previous MongoDB behavior.
    • src/server/object_services/map_server.js: Ensure callers accept base64 string dedup keys (type compatibility).
    • src/test/integration_tests/db/test_md_store.js: Test assumptions for PostgreSQL-only gating and correctness of inserted test fixtures.

Possibly related PRs

Suggested reviewers

  • dannyzaken
  • jackyalbo

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: refactoring find_chunks_by_dedup_key to use optimized SQL queries for better performance, which is the central objective of this PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@karthikps97 karthikps97 marked this pull request as ready for review December 2, 2025 06:37
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/server/object_services/md_store.js (1)

1540-1543: Update JSDoc type to match actual parameter type.

The JSDoc indicates @param {nb.DBBuffer[]} dedup_keys but the implementation now expects base64 strings (as passed from map_server.js). Update the type to reflect the actual usage:

     /**
      * @param {nb.Bucket} bucket
-     * @param {nb.DBBuffer[]} dedup_keys
+     * @param {string[]} dedup_keys - base64 encoded dedup keys
      * @returns {Promise<nb.ChunkSchemaDB[]>}
      */
🧹 Nitpick comments (2)
src/server/object_services/map_server.js (1)

88-93: Consider using filter + map for a more functional approach.

The implementation is correct and aligns with the PostgreSQL path that expects base64 strings. The optional chaining (chunk?.digest_b64) provides null safety.

A more concise alternative using functional patterns:

-            const dedup_keys = [];
-            chunks.forEach(chunk => {
-                if (chunk?.digest_b64) {
-                    dedup_keys.push(chunk.digest_b64);
-                }
-            });
+            const dedup_keys = chunks
+                .map(chunk => chunk?.digest_b64)
+                .filter(Boolean);

This is optional and the current implementation works correctly.

src/test/integration_tests/db/test_md_store.js (1)

402-418: Good edge case coverage for empty dedup_key array.

This test ensures that passing an empty array returns an empty result without errors, which validates the FALSE AND data ? 'dedup_key' branch in the SQL query.

Consider adding a test case for chunks that don't have a dedup_key field to ensure they're properly excluded from results.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1382037 and 20920f6.

📒 Files selected for processing (4)
  • src/sdk/nb.d.ts (1 hunks)
  • src/server/object_services/map_server.js (1 hunks)
  • src/server/object_services/md_store.js (2 hunks)
  • src/test/integration_tests/db/test_md_store.js (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/test/**/*.*

⚙️ CodeRabbit configuration file

src/test/**/*.*: Ensure that the PR includes tests for the changes.

Files:

  • src/test/integration_tests/db/test_md_store.js
🧠 Learnings (2)
📓 Common learnings
Learnt from: naveenpaul1
Repo: noobaa/noobaa-core PR: 9182
File: src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js:9-17
Timestamp: 2025-08-08T13:12:46.728Z
Learning: In upgrade script src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js for noobaa-core, rely on structural detection (e.g., pool.mongo_info, and resource_type === 'INTERNAL') with name-prefix fallback for removing legacy mongo/internal pools, instead of depending solely on config.INTERNAL_STORAGE_POOL_NAME or config.DEFAULT_POOL_NAME. Handle multi-system stores and remove all matching pools in one change.
Learnt from: naveenpaul1
Repo: noobaa/noobaa-core PR: 9182
File: src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js:6-22
Timestamp: 2025-08-11T06:12:12.318Z
Learning: In the noobaa-core upgrade script src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js, bucket migration from the internal mongo pool to a new default pool is planned to be handled in separate future PRs with comprehensive testing, rather than being included directly in the pool removal script.
📚 Learning: 2025-08-08T13:12:46.728Z
Learnt from: naveenpaul1
Repo: noobaa/noobaa-core PR: 9182
File: src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js:9-17
Timestamp: 2025-08-08T13:12:46.728Z
Learning: In upgrade script src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js for noobaa-core, rely on structural detection (e.g., pool.mongo_info, and resource_type === 'INTERNAL') with name-prefix fallback for removing legacy mongo/internal pools, instead of depending solely on config.INTERNAL_STORAGE_POOL_NAME or config.DEFAULT_POOL_NAME. Handle multi-system stores and remove all matching pools in one change.

Applied to files:

  • src/server/object_services/md_store.js
🧬 Code graph analysis (3)
src/server/object_services/md_store.js (1)
src/util/postgres_client.js (6)
  • require (13-13)
  • require (14-14)
  • require (26-26)
  • query (326-326)
  • query (331-331)
  • res (258-258)
src/test/integration_tests/db/test_md_store.js (1)
src/server/object_services/md_store.js (2)
  • config (29-29)
  • assert (8-8)
src/server/object_services/map_server.js (1)
src/test/integration_tests/db/test_md_store.js (2)
  • chunk (386-394)
  • chunk (405-413)
🔇 Additional comments (3)
src/sdk/nb.d.ts (1)

771-771: LGTM!

The addition of the schema property to the DBCollection interface is necessary to support the decode_json usage in find_chunks_by_dedup_key. The any type is consistent with the existing patterns in this interface.

src/test/integration_tests/db/test_md_store.js (1)

383-400: Test coverage looks good for the PostgreSQL path.

The test correctly validates that:

  1. The result is an array
  2. At least one chunk is returned
  3. The frag ID matches the inserted chunk

The bucket mock structure { _id, system: { _id } } aligns with the expected parameters in find_chunks_by_dedup_key.

src/server/object_services/md_store.js (1)

1545-1567: SQL query implementation looks correct with parameterized queries.

The implementation:

  • Uses ANY($3) for array membership, which is more efficient than IN for PostgreSQL
  • Uses parameterized queries preventing SQL injection
  • Properly handles the empty dedup_keys array case with FALSE AND data ? 'dedup_key'
  • Sorts by _id DESC to utilize the primary key index

Two observations:

  1. Silent error handling: Returning an empty array on error could mask legitimate issues. Consider logging at a higher severity or re-throwing certain errors:
} catch (err) {
    dbg.error('Error while finding chunks by dedup_key. error is ', err);
    // Consider: throw err; or at least return based on error type
    return [];
}
  1. Null handling: Line 1556 checks data->'deleted' IS NULL OR data->'deleted' = 'null'::jsonb. This handles both missing keys and JSON null values, which is appropriate for JSONB columns.

Signed-off-by: Karthik P S <karthikperla2000@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/server/object_services/md_store.js (1)

1539-1543: Update JSDoc to reflect the changed parameter type.

The JSDoc declares dedup_keys as nb.DBBuffer[], but the PR changes this to string[] (base64-encoded). The implementation at line 1551 passes dedup_keys directly to a SQL query expecting string values for the ANY operator.

Update the JSDoc to match the actual signature:

 /**
  * @param {nb.Bucket} bucket
- * @param {nb.DBBuffer[]} dedup_keys
+ * @param {string[]} dedup_keys - Base64-encoded deduplication keys
  * @returns {Promise<nb.ChunkSchemaDB[]>}
  */
♻️ Duplicate comments (1)
src/server/object_services/md_store.js (1)

16-16: Import path convention.

The import path includes the .js extension. Per Node.js conventions, the extension can be omitted:

-const { decode_json } = require('../../util/postgres_client.js');
+const { decode_json } = require('../../util/postgres_client');
🧹 Nitpick comments (1)
src/sdk/nb.d.ts (1)

773-773: Schema field addition looks good.

The new schema: any field enables PostgreSQL JSON decoding in md_store.js. While any provides no type safety, it's a pragmatic choice for now.

For future improvement, consider typing this field more specifically, perhaps as a generic parameter based on the collection's schema type.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 20920f6 and 3eaf530.

📒 Files selected for processing (4)
  • src/sdk/nb.d.ts (1 hunks)
  • src/server/object_services/map_server.js (1 hunks)
  • src/server/object_services/md_store.js (2 hunks)
  • src/test/integration_tests/db/test_md_store.js (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/test/integration_tests/db/test_md_store.js
  • src/server/object_services/map_server.js
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: naveenpaul1
Repo: noobaa/noobaa-core PR: 9182
File: src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js:9-17
Timestamp: 2025-08-08T13:12:46.728Z
Learning: In upgrade script src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js for noobaa-core, rely on structural detection (e.g., pool.mongo_info, and resource_type === 'INTERNAL') with name-prefix fallback for removing legacy mongo/internal pools, instead of depending solely on config.INTERNAL_STORAGE_POOL_NAME or config.DEFAULT_POOL_NAME. Handle multi-system stores and remove all matching pools in one change.
📚 Learning: 2025-08-08T13:12:46.728Z
Learnt from: naveenpaul1
Repo: noobaa/noobaa-core PR: 9182
File: src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js:9-17
Timestamp: 2025-08-08T13:12:46.728Z
Learning: In upgrade script src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js for noobaa-core, rely on structural detection (e.g., pool.mongo_info, and resource_type === 'INTERNAL') with name-prefix fallback for removing legacy mongo/internal pools, instead of depending solely on config.INTERNAL_STORAGE_POOL_NAME or config.DEFAULT_POOL_NAME. Handle multi-system stores and remove all matching pools in one change.

Applied to files:

  • src/server/object_services/md_store.js
🧬 Code graph analysis (1)
src/server/object_services/md_store.js (1)
src/util/postgres_client.js (3)
  • query (326-326)
  • query (331-331)
  • res (258-258)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build Noobaa Image
  • GitHub Check: run-jest-unit-tests
  • GitHub Check: run-package-lock-validation
🔇 Additional comments (2)
src/server/object_services/md_store.js (2)

1545-1556: Query construction is well-implemented.

The SQL query correctly:

  • Uses parameterized queries ($1, $2, $3) to prevent SQL injection
  • Employs the ANY operator for array comparison (per PR objectives)
  • Sorts by the _id primary key column to leverage indexing (per PR objectives)
  • Follows patterns consistent with other queries in the codebase

These changes align with the stated performance goals of reducing query size and improving execution time.


1549-1554: Verify the dedup_keys parameter is properly formatted as an array for the ANY() operator.

The code uses PostgreSQL's ANY($3) operator, which requires the bound parameter to be a proper array type. When using node-postgres, the parameter must be passed as an array object. Ensure that dedup_keys is consistently an array before being pushed to values at line 1551—this includes edge cases where dedup_keys might be an empty array (which the code already handles) or single-item arrays. The JSONB operators (data ? 'dedup_key' and data ->>) are standard PostgreSQL syntax and should work correctly.

Comment on lines +1558 to +1566
try {
const res = await this._chunks.executeSQL(query, values);
const chunks = res?.rows.map(row => decode_json(this._chunks.schema, row.data));
await this.load_blocks_for_chunks(chunks);
return chunks;
} catch (err) {
dbg.error('Error while finding chunks by dedup_key. error is ', err);
return [];
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Improve optional chaining for safety.

Line 1560 uses incomplete optional chaining that could still throw if res is undefined:

const chunks = res?.rows.map(row => decode_json(this._chunks.schema, row.data));

If res is undefined, res?.rows returns undefined, and calling .map() on undefined throws an error. Fix with:

-const chunks = res?.rows.map(row => decode_json(this._chunks.schema, row.data));
+const chunks = res?.rows?.map(row => decode_json(this._chunks.schema, row.data)) || [];

This ensures chunks is always an array, even if res or res.rows is undefined.

🤖 Prompt for AI Agents
In src/server/object_services/md_store.js around lines 1558 to 1566, the mapping
uses incomplete optional chaining which can throw if res or res.rows is
undefined; replace the mapping with a safe expression that always produces an
array (e.g. const chunks = (res?.rows?.map(row =>
decode_json(this._chunks.schema, row.data))) ?? [];), then call await
this.load_blocks_for_chunks(chunks); and return chunks so chunks is guaranteed
to be an array even when the query returns no result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant