Skip to content

Conversation

@matiasdaloia
Copy link
Contributor

@matiasdaloia matiasdaloia commented Oct 14, 2025

Summary by CodeRabbit

  • New Features

    • Lazy, thread-safe creation of Qdrant collections, automatically created on insert/import.
    • Overwrite option now drops existing collections individually with clearer feedback.
  • Refactor

    • Updated collection defaults (fewer shards/segments), enabled quantization, and added payload indexing.
    • Maintains post-import HNSW indexing with the new lazy creation flow.
    • Enhanced logging for collection lifecycle events.
  • Chores

    • Removed external Qdrant config file and its Docker bind mount to simplify setup.

@matiasdaloia matiasdaloia self-assigned this Oct 14, 2025
@coderabbitai
Copy link

coderabbitai bot commented Oct 14, 2025

Walkthrough

Implements lazy, thread-safe Qdrant collection creation in the import tool, revises overwrite handling to drop existing collections conditionally, ensures collections before upserts, updates collection config (shards, segments, quantization, payload indexing), and removes external Qdrant config by deleting the compose volume bind and the qdrant-config.yaml file.

Changes

Cohort / File(s) Summary of changes
Import tool: lazy collection management and overwrite flow
cmd/import/main.go
Adds ensureCollectionExists with double-checked locking and in-memory cache; switches insert/import paths to lazy creation; updates overwrite flag behavior to drop existing collections iteratively; adjusts createCollection settings (reduced shards/segments, quantization, payload indexing); logging updates.
Container orchestration: config bind removal
docker-compose.qdrant.yml
Removes read-only bind mount of host Qdrant config (./qdrant-config.yaml:/qdrant/config/production.yaml:ro) from the qdrant service.
Qdrant config: file removal
qdrant-config.yaml
Deletes the configuration file entirely; no replacement content.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as Import CLI
  participant QC as Qdrant Client
  participant QS as Qdrant Server

  User->>CLI: Run import
  rect rgba(230,247,255,0.5)
    note over CLI: For each target collection during insert
    CLI->>CLI: ensureCollectionExists(name)
    alt Not cached
      CLI->>QS: Check collection exists
      opt Missing
        CLI->>QS: Create collection (shards, segments, quantization, payload indexes)
      end
      CLI->>CLI: Cache created/exists
    else Cached
      CLI->>CLI: Skip check/create
    end
  end
  CLI->>QC: Upsert points
  QC->>QS: Upsert request
  QS-->>QC: Ack
  QC-->>CLI: Result
Loading
sequenceDiagram
  autonumber
  actor User
  participant CLI as Import CLI
  participant QS as Qdrant Server

  User->>CLI: Run import -overwrite
  note over CLI: For each collection in scope
  loop Collections
    CLI->>QS: Check exists
    alt Exists
      CLI->>QS: Drop collection
      QS-->>CLI: Drop result
    else Not found
      CLI->>CLI: No action (lazy create later)
    end
  end
  note over CLI: Proceed with lazy creation during inserts
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I hop through shards with gentle grace,
Creating homes only when in place.
If overwrite’s moon is bright, I sweep—
Old nests gone, new roots run deep.
No config scrolls, just burrow lore—
Lazily built, then indexed for more.
Thump-thump, imports soar! 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly and accurately summarizes the main changes—introducing lazy creation of collections, removing the Qdrant config file, and updating the Docker Compose setup—in a concise, descriptive sentence.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/mdaloia/import-script

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05320e7 and 27ce608.

📒 Files selected for processing (3)
  • cmd/import/main.go (5 hunks)
  • docker-compose.qdrant.yml (0 hunks)
  • qdrant-config.yaml (0 hunks)
💤 Files with no reviewable changes (2)
  • docker-compose.qdrant.yml
  • qdrant-config.yaml
🔇 Additional comments (5)
cmd/import/main.go (5)

58-61: LGTM! Proper concurrent state management.

The global map and mutex are correctly initialized for tracking created collections in a thread-safe manner. This is an appropriate pattern for lazy initialization in a CLI tool with concurrent workers.


349-352: LGTM! Optimized shard and segment configuration.

Reducing ShardNumber to 2 and DefaultSegmentNumber to 4 prevents memory spikes during bulk import, aligning with the PR's goal to optimize resource usage.


357-363: LGTM! Binary quantization for memory efficiency.

Adding binary quantization with AlwaysRam keeps quantized vectors in memory for fast access while reducing memory footprint. This is an appropriate optimization for vector search workloads.


370-398: LGTM! Payload indexing improves query performance.

Creating indexes on purl, version, url, category, and rank fields enables efficient filtering and grouping operations. The error handling with warning logs is appropriate for index creation.


665-672: LGTM! Lazy creation correctly implemented.

Collections are created on-demand before the first upsert, preventing unnecessary memory allocation for unused language collections. The error handling properly propagates failures. This approach aligns with the PR objective to prevent memory spikes.

Comment on lines +127 to 144
if *overwrite {
for _, collectionName := range collections {
collectionExists, err := client.CollectionExists(ctx, collectionName)
if err != nil {
cleanupAndExit(client, "Error dropping collection %s: %v", collectionName, err)
log.Printf("Error checking if collection %s exists: %v", collectionName, err)
return
}
log.Printf("Collection '%s' dropped successfully", collectionName)
collectionExists = false
}

// Create collection if it doesn't exist
if !collectionExists {
createCollection(ctx, client, collectionName)
} else {
log.Printf("Using existing collection: %s", collectionName)
if collectionExists {
log.Printf("Collection %s exists and overwrite flag is set. Dropping collection...", collectionName)
err = client.DeleteCollection(ctx, collectionName)
if err != nil {
cleanupAndExit(client, "Error dropping collection %s: %v", collectionName, err)
}
log.Printf("Collection '%s' dropped successfully", collectionName)
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix incomplete error handling in overwrite loop.

Line 132 uses a bare return when checking collection existence fails, which exits the entire overwrite loop early. This leaves the system in an inconsistent state where some collections may be dropped but others are never checked.

Apply this diff to ensure consistent error handling:

 		for _, collectionName := range collections {
 			collectionExists, err := client.CollectionExists(ctx, collectionName)
 			if err != nil {
-				log.Printf("Error checking if collection %s exists: %v", collectionName, err)
-				return
+				cleanupAndExit(client, "Error checking if collection %s exists: %v", collectionName, err)
 			}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if *overwrite {
for _, collectionName := range collections {
collectionExists, err := client.CollectionExists(ctx, collectionName)
if err != nil {
cleanupAndExit(client, "Error dropping collection %s: %v", collectionName, err)
log.Printf("Error checking if collection %s exists: %v", collectionName, err)
return
}
log.Printf("Collection '%s' dropped successfully", collectionName)
collectionExists = false
}
// Create collection if it doesn't exist
if !collectionExists {
createCollection(ctx, client, collectionName)
} else {
log.Printf("Using existing collection: %s", collectionName)
if collectionExists {
log.Printf("Collection %s exists and overwrite flag is set. Dropping collection...", collectionName)
err = client.DeleteCollection(ctx, collectionName)
if err != nil {
cleanupAndExit(client, "Error dropping collection %s: %v", collectionName, err)
}
log.Printf("Collection '%s' dropped successfully", collectionName)
}
}
}
if *overwrite {
for _, collectionName := range collections {
collectionExists, err := client.CollectionExists(ctx, collectionName)
if err != nil {
cleanupAndExit(client, "Error checking if collection %s exists: %v", collectionName, err)
}
if collectionExists {
log.Printf("Collection %s exists and overwrite flag is set. Dropping collection...", collectionName)
err = client.DeleteCollection(ctx, collectionName)
if err != nil {
cleanupAndExit(client, "Error dropping collection %s: %v", collectionName, err)
}
log.Printf("Collection '%s' dropped successfully", collectionName)
}
}
}
🤖 Prompt for AI Agents
In cmd/import/main.go around lines 127 to 144, the overwrite loop currently uses
a bare `return` when CollectionExists returns an error which aborts the loop
without consistent cleanup; replace that bare `return` with a call to
`cleanupAndExit(client, "Error checking if collection %s exists: %v",
collectionName, err)` so the error is handled consistently (matching the other
error paths), ensuring proper cleanup and process exit when the existence check
fails.

Comment on lines +265 to +305
// ensureCollectionExists checks if a collection exists and creates it lazily if not.
func ensureCollectionExists(ctx context.Context, client *qdrant.Client, collectionName string) error {
// Check if we've already created this collection
collectionCreateLock.Lock()
if createdCollections[collectionName] {
collectionCreateLock.Unlock()
return nil
}
collectionCreateLock.Unlock()

// Check if collection exists in Qdrant
exists, err := client.CollectionExists(ctx, collectionName)
if err != nil {
return fmt.Errorf("error checking collection existence: %w", err)
}

if exists {
// Mark as created to avoid checking again
collectionCreateLock.Lock()
createdCollections[collectionName] = true
collectionCreateLock.Unlock()
log.Printf("Collection %s already exists, using it", collectionName)
return nil
}

// Collection doesn't exist, create it (thread-safe)
collectionCreateLock.Lock()
defer collectionCreateLock.Unlock()

// Double-check pattern: another goroutine might have created it
if createdCollections[collectionName] {
return nil
}

log.Printf("Creating collection: %s", collectionName)
createCollection(ctx, client, collectionName)
createdCollections[collectionName] = true
log.Printf("Collection %s created and marked as ready", collectionName)

return nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: createCollection terminates program in concurrent context.

Line 300 calls createCollection which uses cleanupAndExit on errors, terminating the entire program. In this concurrent context with multiple worker goroutines, a collection creation failure will abruptly kill all workers mid-processing, potentially losing in-flight data.

Refactor to return errors gracefully:

Step 1: Modify createCollection to return an error instead of calling cleanupAndExit:

-func createCollection(ctx context.Context, client *qdrant.Client, collectionName string) {
+func createCollection(ctx context.Context, client *qdrant.Client, collectionName string) error {
 	log.Printf("Creating language-based collection with named vectors: %s", collectionName)
 	// ... rest of function ...
 	})
 	if err != nil {
-		cleanupAndExit(client, "Error creating collection %s: %v", collectionName, err)
+		return fmt.Errorf("error creating collection %s: %w", collectionName, err)
 	}
 	log.Printf("Collection '%s' with named vectors created successfully", collectionName)
+	return nil
 }

Step 2: Update Line 300 to handle the error:

 	log.Printf("Creating collection: %s", collectionName)
-	createCollection(ctx, client, collectionName)
+	if err := createCollection(ctx, client, collectionName); err != nil {
+		return err
+	}
 	createdCollections[collectionName] = true

This allows worker goroutines to handle creation failures gracefully and report errors through the errorsChan mechanism already in place (lines 171, 187).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// ensureCollectionExists checks if a collection exists and creates it lazily if not.
func ensureCollectionExists(ctx context.Context, client *qdrant.Client, collectionName string) error {
// Check if we've already created this collection
collectionCreateLock.Lock()
if createdCollections[collectionName] {
collectionCreateLock.Unlock()
return nil
}
collectionCreateLock.Unlock()
// Check if collection exists in Qdrant
exists, err := client.CollectionExists(ctx, collectionName)
if err != nil {
return fmt.Errorf("error checking collection existence: %w", err)
}
if exists {
// Mark as created to avoid checking again
collectionCreateLock.Lock()
createdCollections[collectionName] = true
collectionCreateLock.Unlock()
log.Printf("Collection %s already exists, using it", collectionName)
return nil
}
// Collection doesn't exist, create it (thread-safe)
collectionCreateLock.Lock()
defer collectionCreateLock.Unlock()
// Double-check pattern: another goroutine might have created it
if createdCollections[collectionName] {
return nil
}
log.Printf("Creating collection: %s", collectionName)
createCollection(ctx, client, collectionName)
createdCollections[collectionName] = true
log.Printf("Collection %s created and marked as ready", collectionName)
return nil
}
// ensureCollectionExists checks if a collection exists and creates it lazily if not.
func ensureCollectionExists(ctx context.Context, client *qdrant.Client, collectionName string) error {
// Check if we've already created this collection
collectionCreateLock.Lock()
if createdCollections[collectionName] {
collectionCreateLock.Unlock()
return nil
}
collectionCreateLock.Unlock()
// Check if collection exists in Qdrant
exists, err := client.CollectionExists(ctx, collectionName)
if err != nil {
return fmt.Errorf("error checking collection existence: %w", err)
}
if exists {
// Mark as created to avoid checking again
collectionCreateLock.Lock()
createdCollections[collectionName] = true
collectionCreateLock.Unlock()
log.Printf("Collection %s already exists, using it", collectionName)
return nil
}
// Collection doesn't exist, create it (thread-safe)
collectionCreateLock.Lock()
defer collectionCreateLock.Unlock()
// Double-check pattern: another goroutine might have created it
if createdCollections[collectionName] {
return nil
}
log.Printf("Creating collection: %s", collectionName)
if err := createCollection(ctx, client, collectionName); err != nil {
return err
}
createdCollections[collectionName] = true
log.Printf("Collection %s created and marked as ready", collectionName)
return nil
}
🤖 Prompt for AI Agents
In cmd/import/main.go around lines 265 to 305, the call to createCollection at
line ~300 can terminate the whole process via cleanupAndExit which is unsafe in
a concurrent worker context; change createCollection to return an error instead
of exiting on failure, remove any direct calls to cleanupAndExit from it, and
update this function to call createCollection(ctx, client, collectionName) and
check the returned error; if non-nil, release locks as appropriate and forward
the error to the worker error channel (errorsChan) or return the error up to the
caller so the worker goroutine can handle/report it gracefully rather than
killing the process.

@matiasdaloia matiasdaloia merged commit 0c8d1ad into main Oct 14, 2025
3 checks passed
@matiasdaloia matiasdaloia deleted the fix/mdaloia/import-script branch October 14, 2025 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants