chore: enhance import script, add qdrant's docker compose to release artifacts #28

matiasdaloia · 2025-10-22T07:58:03Z

Summary by CodeRabbit

New Features
- Re-enabled production index building after bulk import so background optimizer will construct search indexes post-import.
- Import now stores vectors on disk and reduces RAM usage during large imports; quantization settings adjusted for import vs production.
Chores
- Included docker-compose.qdrant.yml in release assets.
- Tuned Qdrant storage and WAL settings to improve import performance and durability.

coderabbitai · 2025-10-22T07:59:25Z

Walkthrough

Added docker-compose.qdrant.yml to release artifacts; updated bulk import to create collections with HNSW disabled and vectors OnDisk, then re-enable production HNSW/quantization post-import via a new helper; extended Qdrant Docker env with storage and WAL tuning variables.

Changes

Cohort / File(s)	Summary
GitHub Actions Release Workflow `\.github/workflows/release.yml`	Added `docker-compose.qdrant.yml` to the list of release artifacts to be uploaded with releases.
Import Process Optimization `cmd/import/main.go`	During import, create collections with HNSW M=0 and vectors OnDisk; set quantization AlwaysRam=false for import. Added `enableProductionIndexing(ctx, client, collectionName) error` and call it after import to set HNSW M=48, disable HNSW OnDisk, set optimizer threshold to 0, and set quantization AlwaysRam=true; per-collection failures logged as warnings and optimizer runs build indexes in background.
Docker Compose Configuration `docker-compose.qdrant.yml`	Added Qdrant environment variables: `QDRANT__STORAGE__OPTIMIZERS__OVERWRITE__MAX_SEGMENT_SIZE=500000`, `QDRANT__STORAGE__PERFORMANCE__MAX_OPTIMIZATION_THREADS=4`, `QDRANT__STORAGE__WAL__WAL_CAPACITY_MB=32`.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ImportCLI as Import CLI
    participant Qdrant

    User->>ImportCLI: start import
    ImportCLI->>Qdrant: create collections (HNSW M=0, vectors OnDisk)
    ImportCLI->>Qdrant: bulk upload vectors
    Note over ImportCLI,Qdrant: Import completes with reduced-RAM indexing

    rect rgb(220,240,220)
    ImportCLI->>Qdrant: enableProductionIndexing(collection)
    Qdrant->>Qdrant: set HNSW M=48, HNSW OnDisk=false
    Qdrant->>Qdrant: set quantization AlwaysRam=true
    end

    Qdrant->>Qdrant: optimizer builds HNSW in background
    Qdrant->>ImportCLI: confirmation / warnings
    ImportCLI->>User: import complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

fix: create collections lazily, remove qdrant config, update docker compose #26: Modifies cmd/import/main.go import flow and post-import HNSW enabling — closely related to the import/indexing changes here.

Poem

🐇 I nudged the M down, soft and slow,

hid vectors on disk where quiet grows,
then woke the index when the work was done,
Qdrant hums as background runs run,
carrots safe — import victory won.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "chore: enhance import script, add qdrant's docker compose to release artifacts" directly addresses the main changes in the pull request. The phrase "enhance import script" accurately reflects the significant modifications to cmd/import/main.go, including the addition of the new enableProductionIndexing function and HNSW re-enabling workflow. The phrase "add qdrant's docker compose to release artifacts" captures both the workflow modification to include docker-compose.qdrant.yml in the release assets and the configuration updates to that file itself. The title is specific and descriptive rather than vague, and a reviewer scanning the history would clearly understand the primary objectives of this changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chore/mdaloia/enhance-import-script

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

cmd/import/main.go (1)

241-253: Consider failing fast if production indexing cannot be enabled.

Currently, failures in enableProductionIndexing only log warnings (line 247), allowing the import to complete successfully even if collections remain in import-optimized state (M=0, vectors on disk). This could severely degrade production query performance.

Consider this approach:

 	// Re-enable production HNSW indexing for all collections
 	log.Println("\n=== Enabling Production HNSW Indexing ===")
 	log.Println("Re-enabling HNSW indexing (M=48) for production queries...")
+	var productionIndexingErrors []string
 	for _, collectionName := range collections {
 		if err := enableProductionIndexing(ctx, client, collectionName); err != nil {
-			log.Printf("WARNING: Failed to enable production indexing for %s: %v", collectionName, err)
+			errMsg := fmt.Sprintf("Failed to enable production indexing for %s: %v", collectionName, err)
+			log.Printf("ERROR: %s", errMsg)
+			productionIndexingErrors = append(productionIndexingErrors, errMsg)
 		}
 	}
+	if len(productionIndexingErrors) > 0 {
+		log.Fatalf("FATAL: Cannot enable production indexing. Collections are not ready for production queries:\n%s",
+			strings.Join(productionIndexingErrors, "\n"))
+	}
 	log.Println("\n✓ Production indexing enabled for all collections.")
 	log.Println("The Qdrant optimizer will build HNSW indexes in the background.")
 	log.Println("Monitor collection stats to track indexing progress.")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f4a6dbb and 83d69bf.

📒 Files selected for processing (3)

.github/workflows/release.yml (1 hunks)
cmd/import/main.go (3 hunks)
docker-compose.qdrant.yml (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

cmd/import/main.go (1)

internal/repository/scan_repository_qdrant_impl.go (1)

VectorDim (35-35)

🔇 Additional comments (2)

.github/workflows/release.yml (1)

48-48: LGTM! Docker Compose configuration added to release artifacts.

The addition correctly includes the Qdrant Docker Compose configuration in release artifacts, making it easy for users to deploy the optimized Qdrant setup alongside the import tool.

docker-compose.qdrant.yml (1)

10-15: Good alignment with import optimization strategy.

The environment variables are well-chosen for bulk import scenarios:

MAX_SEGMENT_SIZE=500000 matches the collection configuration in cmd/import/main.go (line 304)

MAX_OPTIMIZATION_THREADS=4 provides reasonable resource limits

WAL_CAPACITY_MB=32 enables durability during bulk operations

The configuration complements the import-time optimizations in the Go code.

cmd/import/main.go

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 83d69bf and 4110185.

📒 Files selected for processing (1)

cmd/import/main.go (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

cmd/import/main.go (1)

internal/repository/scan_repository_qdrant_impl.go (1)

VectorDim (35-35)

🪛 GitHub Actions: Golang CI Lint

cmd/import/main.go

[error] 389-389: File is not properly formatted (gofumpt).

🪛 GitHub Check: build

cmd/import/main.go

[failure] 389-389:
File is not properly formatted (gofumpt)

🔇 Additional comments (3)

cmd/import/main.go (3)

241-253: Good post-import optimization flow.

The approach of re-enabling production HNSW indexing after bulk import completes is sound. Logging failures as warnings (rather than failing the entire import) is acceptable since the data has already been successfully imported, and operators can manually address indexing issues if needed.

Note that if enableProductionIndexing fails for a collection, that collection will remain in the import-optimized state (M=0, no HNSW), which will result in slower queries. Consider monitoring collection stats to verify that indexing completes successfully.

260-293: Excellent bulk import optimization strategy.

The configuration correctly optimizes for bulk import by:

Disabling HNSW indexing (M=0) to eliminate index build overhead

Storing vectors on disk (OnDisk=true) to minimize RAM usage

Allowing quantized vectors on disk (AlwaysRam=false)

This minimizes resource usage during the import phase, with production settings restored by enableProductionIndexing afterward. The inline comments clearly document the intent.

351-396: Excellent fix for the previous critical bug.

The implementation correctly addresses the bug flagged in previous reviews by using VectorsConfigDiff_ParamsMap to target each named vector individually. The map-based approach ensures that "dirs", "names", and "contents" each receive their own HNSW configuration update.

The production settings are well-chosen:

M=48 for production-quality HNSW graphs

HNSW in RAM (OnDisk=false) for fast queries

Vectors on disk (OnDisk=true) to conserve RAM

Quantization in RAM (AlwaysRam=true) for performance

IndexingThreshold=0 to trigger immediate background optimization

cmd/import/main.go

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cmd/import/main.go (1)
189-205: Fix: invalid range over int prevents build

for workerID := range optimalWorkers.NumWorkers won’t compile. Use a counted for-loop.
- for workerID := range optimalWorkers.NumWorkers {
+ for workerID := 0; workerID < optimalWorkers.NumWorkers; workerID++ {

♻️ Duplicate comments (2)

cmd/import/main.go (2)
351-395: Named-vector re-enable is now correct (ParamsMap). Please verify client API and run a quick config check

This fixes the earlier bug by targeting each named vector via ParamsMap — nice.

Confirm go-client v1.14.0 exposes VectorsConfigDiff_ParamsMap and VectorParamsDiffMap as used here.

After running, verify each vector’s HNSW config is applied:
#!/bin/bash
set -euo pipefail
QHOST="${QDRANT_HOST:-localhost}"
QPORT="${QDRANT_PORT:-6333}" # HTTP port for REST
for c in java_collection python_collection javascript_collection cpp_collection go_collection misc_collection; do
  echo "Collection: $c"
  curl -s "http://$QHOST:$QPORT/collections/$c" \
   | jq '{name: .result.name, vectors: .result.config.params.vectors}'
done
1-824: Ensure gofumpt passes (previous CI failure)

Run gofumpt -w cmd/import/main.go before pushing to unblock CI.

🧹 Nitpick comments (4)

cmd/import/main.go (4)
241-251: Post-import production re-enable: good; consider parallelizing

The sequencing and logging are solid. To reduce wall time across many collections, fan out with a small worker pool and collect errors.

I can provide a short diff to parallelize if you want.

260-296: Import-time config looks correct; reduce duplication and confirm Distance/Quantization combo

Settings align with the two-phase plan (vectors OnDisk, HNSW M=0).

There’s repeated vector params across dirs/names/contents; extract a helper to avoid drift.

Sanity-check that Distance_Manhattan + Binary quantization matches your retrieval expectations; keep as-is if this was validated.

Example DRY helper (add once in this file):
func importVectorParams() *qdrant.VectorParams {
	return &qdrant.VectorParams{
		Size:     VectorDim,
		Distance: qdrant.Distance_Manhattan,
		OnDisk:   qdrant.PtrOf(true),
		HnswConfig: &qdrant.HnswConfigDiff{
			M:                 qdrant.PtrOf(uint64(0)),
			EfConstruct:       qdrant.PtrOf(uint64(500)),
			FullScanThreshold: qdrant.PtrOf(uint64(100000)),
			OnDisk:            qdrant.PtrOf(true),
		},
	}
}
Then:
- "dirs": { ... repeated ... },
+ "dirs": importVectorParams(),
- "names": { ... repeated ... },
+ "names": importVectorParams(),
- "contents": { ... repeated ... },
+ "contents": importVectorParams(),
Also applies to: 307-314

56-56: Avoid duplicate VectorDim definitions across packages

You also define VectorDim=64 in internal/repository. Prefer a single source (e.g., import from internal/repository) to prevent drift.

564-566: Micro-alloc: hoist constant map out of the hot loop

categoryToRank is rebuilt per record. Move it to a package-level or enclosing scope to reduce allocs in large imports.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4110185 and 3e6d9d0.

📒 Files selected for processing (1)

cmd/import/main.go (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

cmd/import/main.go (2)

internal/repository/scan_repository_qdrant_impl.go (1)

VectorDim (35-35)

internal/config/config.go (1)

Config (38-72)

matiasdaloia added 3 commits October 16, 2025 15:37

enhance import script to reduce memory usage

1d3331e

update docker compose

23cbabc

chore: add qdrant's docker compose to release artifacts

83d69bf

matiasdaloia self-assigned this Oct 22, 2025

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

cmd/import/main.go Show resolved Hide resolved

cmd/import/main.go Show resolved Hide resolved

fix: qdrant import script named vectors config

4110185

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

cmd/import/main.go Outdated Show resolved Hide resolved

fix: lint error

3e6d9d0

matiasdaloia merged commit 93c1269 into main Oct 22, 2025
2 of 3 checks passed

matiasdaloia deleted the chore/mdaloia/enhance-import-script branch October 22, 2025 10:33

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

coderabbitai bot mentioned this pull request Oct 24, 2025

chore: update docker compose file name #29

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: enhance import script, add qdrant's docker compose to release artifacts #28

chore: enhance import script, add qdrant's docker compose to release artifacts #28

Uh oh!

matiasdaloia commented Oct 22, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 22, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chore: enhance import script, add qdrant's docker compose to release artifacts #28

chore: enhance import script, add qdrant's docker compose to release artifacts #28

Uh oh!

Conversation

matiasdaloia commented Oct 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

matiasdaloia commented Oct 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 22, 2025 •

edited

Loading