Skip to content

Conversation

@icecrasher321
Copy link
Collaborator

@icecrasher321 icecrasher321 commented Aug 8, 2025

Summary

OpenAI embeddings API has no min chunk size character requirements. So this PR removes it from here too setting default to 1 char.

Type of Change

  • Bug fix

Testing

Tested manually creating doc with <10 chars.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link

vercel bot commented Aug 8, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sim ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 8, 2025 7:41pm
1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
docs ⬜️ Skipped (Inspect) Aug 8, 2025 7:41pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR systematically reduces the minimum chunk size constraint across Sim's document processing and knowledge base system from 50-100 characters to 1 character. The changes span the entire document processing pipeline, including:

  • Frontend validation: Upload modals and create modals now allow chunks as small as 1 character
  • API validation: Knowledge base creation and document processing APIs updated to accept minimum chunk sizes of 1
  • Core processing: The TextChunker class, DocsChunker, and document processor functions all now default to a minimum chunk size of 1
  • Tool integration: Knowledge tools updated to process documents with 1-character minimum chunks

The changes maintain backward compatibility by preserving existing parameter structures while changing default values. This modification affects how documents are chunked for RAG (Retrieval Augmented Generation) applications, allowing the system to preserve very small text fragments like single words, abbreviations, code snippets, or structured data elements that were previously filtered out.

The chunking system uses hierarchical text splitting that attempts to respect semantic boundaries (sentences, paragraphs, etc.) but falls back to character-level splitting when necessary. With this change, even single-character chunks will be preserved if larger semantic chunks cannot be formed, ensuring maximum content retention during document processing.

PR Description Notes:

  • The PR description is incomplete - all sections remain as template placeholders without actual content describing the changes or testing approach

Confidence score: 2/5

  • This PR introduces significant risk due to potential performance and quality impacts from creating many small chunks
  • Score reflects concerns about storage costs, retrieval quality degradation, and inconsistent implementation across the codebase
  • Pay close attention to chunking logic files and API validation schemas for potential inconsistencies

11 files reviewed, 1 comment

Edit Code Review Bot Settings | Greptile

@vercel vercel bot temporarily deployed to Preview – docs August 8, 2025 19:36 Inactive
@icecrasher321 icecrasher321 merged commit 0ec91f9 into staging Aug 8, 2025
3 of 4 checks passed
waleedlatif1 pushed a commit that referenced this pull request Aug 8, 2025
* fix(min-chunk): remove minsize for chunk

* fix tests
waleedlatif1 added a commit that referenced this pull request Aug 9, 2025
…ypes (#919)

* feat(execution-filesystem): system to pass files between blocks  (#866)

* feat(files): pass files between blocks

* presigned URL for downloads

* Remove latest migration before merge

* starter block file upload wasn't getting logged

* checkpoint in human readable form

* checkpoint files / file type outputs

* file downloads working for block outputs

* checkpoint file download

* fix type issues

* remove filereference interface with simpler user file interface

* show files in the tag dropdown for start block

* more migration to simple url object, reduce presigned time to 5 min

* Remove migration 0065_parallel_nightmare and related files

- Deleted apps/sim/db/migrations/0065_parallel_nightmare.sql
- Deleted apps/sim/db/migrations/meta/0065_snapshot.json
- Removed 0065 entry from apps/sim/db/migrations/meta/_journal.json

Preparing for merge with origin/staging and migration regeneration

* add migration files

* fix tests

* Update apps/sim/lib/uploads/setup.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update apps/sim/lib/workflows/execution-file-storage.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update apps/sim/lib/workflows/execution-file-storage.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* cleanup types

* fix lint

* fix logs typing for file refs

* open download in new tab

* fixed

* Update apps/sim/tools/index.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix file block

* cleanup unused code

* fix bugs

* remove hacky file id logic

* fix drag and drop

* fix tests

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* feat(trigger-mode): added trigger-mode to workflow_blocks table (#902)

* fix(schedules-perms): use regular perm system to view/edit schedule info (#901)

* fix(schedules-perms): use regular perm system to view schedule info

* fix perms

* improve logging

* feat(webhooks): deprecate singular webhook block + add trigger mode to blocks (#903)

* feat(triggers): added new trigger mode for blocks, added socket event, ran migrations

* Rename old trigger/ directory to background/

* cleaned up, ensured that we display active webhook at the block-level

* fix submenu in tag dropdown

* keyboard nav on tag dropdown submenu

* feat(triggers): add outlook to new triggers system

* cleanup

* add types to tag dropdown, type all outputs for tools and use that over block outputs

* update doc generator to truly reflect outputs

* fix docs

* add trigger handler

* fix active webhook tag

* tag dropdown fix for triggers

* remove trigger mode schema change

* feat(execution-filesystem): system to pass files between blocks  (#866)

* feat(files): pass files between blocks

* presigned URL for downloads

* Remove latest migration before merge

* starter block file upload wasn't getting logged

* checkpoint in human readable form

* checkpoint files / file type outputs

* file downloads working for block outputs

* checkpoint file download

* fix type issues

* remove filereference interface with simpler user file interface

* show files in the tag dropdown for start block

* more migration to simple url object, reduce presigned time to 5 min

* Remove migration 0065_parallel_nightmare and related files

- Deleted apps/sim/db/migrations/0065_parallel_nightmare.sql
- Deleted apps/sim/db/migrations/meta/0065_snapshot.json
- Removed 0065 entry from apps/sim/db/migrations/meta/_journal.json

Preparing for merge with origin/staging and migration regeneration

* add migration files

* fix tests

* Update apps/sim/lib/uploads/setup.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update apps/sim/lib/workflows/execution-file-storage.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update apps/sim/lib/workflows/execution-file-storage.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* cleanup types

* fix lint

* fix logs typing for file refs

* open download in new tab

* fixed

* Update apps/sim/tools/index.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix file block

* cleanup unused code

* fix bugs

* remove hacky file id logic

* fix drag and drop

* fix tests

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* feat(trigger-mode): added trigger-mode to workflow_blocks table (#902)

* fix(schedules-perms): use regular perm system to view/edit schedule info (#901)

* fix(schedules-perms): use regular perm system to view schedule info

* fix perms

* improve logging

* cleanup

* prevent tooltip showing up on modal open

* updated trigger config

* fix type issues

---------

Co-authored-by: Vikhyath Mondreti <vikhyathvikku@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Vikhyath Mondreti <vikhyath@simstudio.ai>

* fix(helm): fix helm charts migrations using wrong image (#907)

* fix(helm): fix helm charts migrations using wrong image

* fixed migrations

* feat(whitelist): add email & domain-based whitelisting for signups (#908)

* improvement(helm): fix duplicate SOCKET_SERVER_URL and add additional envvars to template (#909)

* improvement(helm): fix duplicate SOCKET_SERVER_URL and add additional envvars to template

* rm serper & freestyle

* improvement(tag-dropdown): typed tag dropdown values (#910)

* fix(min-chunk): remove minsize for chunk (#911)

* fix(min-chunk): remove minsize for chunk

* fix tests

* improvement(chunk-config): migrate unused default for consistency (#913)

* fix(mailer): update mailer to use the EMAIL_DOMAIN (#914)

* fix(mailer): update mailer to use the EMAIL_DOMAIn

* add more

* Improvement(cc): added cc to gmail and outlook (#900)

* changed just gmail

* bun run lint

* fixed bcc

* updated docs

---------

Co-authored-by: Adam Gough <adamgough@Mac.attlocal.net>
Co-authored-by: waleedlatif1 <walif6@gmail.com>

* fix(email-validation): add email validation to prevent bouncing, fixed OTP validation (#916)

* feat(email-validation): add email validation to prevent bouncing

* removed suspicious patterns

* fix(verification): fixed OTP verification

* fix failing tests, cleanup

* fix(otp): fix email not sending (#917)

* fix(email): manual OTP instead of better-auth (#921)

* fix(email): manual OTP instead of better-auth

* lint

---------

Co-authored-by: Vikhyath Mondreti <vikhyathvikku@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Vikhyath Mondreti <vikhyath@simstudio.ai>
Co-authored-by: Adam Gough <77861281+aadamgough@users.noreply.github.com>
Co-authored-by: Adam Gough <adamgough@Mac.attlocal.net>
@waleedlatif1 waleedlatif1 deleted the fix/min-size-chunk branch August 11, 2025 00:21
arenadeveloper02 pushed a commit to arenadeveloper02/p2-sim that referenced this pull request Sep 19, 2025
* fix(min-chunk): remove minsize for chunk

* fix tests
arenadeveloper02 pushed a commit to arenadeveloper02/p2-sim that referenced this pull request Sep 19, 2025
…ypes (simstudioai#919)

* feat(execution-filesystem): system to pass files between blocks  (simstudioai#866)

* feat(files): pass files between blocks

* presigned URL for downloads

* Remove latest migration before merge

* starter block file upload wasn't getting logged

* checkpoint in human readable form

* checkpoint files / file type outputs

* file downloads working for block outputs

* checkpoint file download

* fix type issues

* remove filereference interface with simpler user file interface

* show files in the tag dropdown for start block

* more migration to simple url object, reduce presigned time to 5 min

* Remove migration 0065_parallel_nightmare and related files

- Deleted apps/sim/db/migrations/0065_parallel_nightmare.sql
- Deleted apps/sim/db/migrations/meta/0065_snapshot.json
- Removed 0065 entry from apps/sim/db/migrations/meta/_journal.json

Preparing for merge with origin/staging and migration regeneration

* add migration files

* fix tests

* Update apps/sim/lib/uploads/setup.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update apps/sim/lib/workflows/execution-file-storage.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update apps/sim/lib/workflows/execution-file-storage.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* cleanup types

* fix lint

* fix logs typing for file refs

* open download in new tab

* fixed

* Update apps/sim/tools/index.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix file block

* cleanup unused code

* fix bugs

* remove hacky file id logic

* fix drag and drop

* fix tests

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* feat(trigger-mode): added trigger-mode to workflow_blocks table (simstudioai#902)

* fix(schedules-perms): use regular perm system to view/edit schedule info (simstudioai#901)

* fix(schedules-perms): use regular perm system to view schedule info

* fix perms

* improve logging

* feat(webhooks): deprecate singular webhook block + add trigger mode to blocks (simstudioai#903)

* feat(triggers): added new trigger mode for blocks, added socket event, ran migrations

* Rename old trigger/ directory to background/

* cleaned up, ensured that we display active webhook at the block-level

* fix submenu in tag dropdown

* keyboard nav on tag dropdown submenu

* feat(triggers): add outlook to new triggers system

* cleanup

* add types to tag dropdown, type all outputs for tools and use that over block outputs

* update doc generator to truly reflect outputs

* fix docs

* add trigger handler

* fix active webhook tag

* tag dropdown fix for triggers

* remove trigger mode schema change

* feat(execution-filesystem): system to pass files between blocks  (simstudioai#866)

* feat(files): pass files between blocks

* presigned URL for downloads

* Remove latest migration before merge

* starter block file upload wasn't getting logged

* checkpoint in human readable form

* checkpoint files / file type outputs

* file downloads working for block outputs

* checkpoint file download

* fix type issues

* remove filereference interface with simpler user file interface

* show files in the tag dropdown for start block

* more migration to simple url object, reduce presigned time to 5 min

* Remove migration 0065_parallel_nightmare and related files

- Deleted apps/sim/db/migrations/0065_parallel_nightmare.sql
- Deleted apps/sim/db/migrations/meta/0065_snapshot.json
- Removed 0065 entry from apps/sim/db/migrations/meta/_journal.json

Preparing for merge with origin/staging and migration regeneration

* add migration files

* fix tests

* Update apps/sim/lib/uploads/setup.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update apps/sim/lib/workflows/execution-file-storage.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update apps/sim/lib/workflows/execution-file-storage.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* cleanup types

* fix lint

* fix logs typing for file refs

* open download in new tab

* fixed

* Update apps/sim/tools/index.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix file block

* cleanup unused code

* fix bugs

* remove hacky file id logic

* fix drag and drop

* fix tests

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* feat(trigger-mode): added trigger-mode to workflow_blocks table (simstudioai#902)

* fix(schedules-perms): use regular perm system to view/edit schedule info (simstudioai#901)

* fix(schedules-perms): use regular perm system to view schedule info

* fix perms

* improve logging

* cleanup

* prevent tooltip showing up on modal open

* updated trigger config

* fix type issues

---------

Co-authored-by: Vikhyath Mondreti <vikhyathvikku@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Vikhyath Mondreti <vikhyath@simstudio.ai>

* fix(helm): fix helm charts migrations using wrong image (simstudioai#907)

* fix(helm): fix helm charts migrations using wrong image

* fixed migrations

* feat(whitelist): add email & domain-based whitelisting for signups (simstudioai#908)

* improvement(helm): fix duplicate SOCKET_SERVER_URL and add additional envvars to template (simstudioai#909)

* improvement(helm): fix duplicate SOCKET_SERVER_URL and add additional envvars to template

* rm serper & freestyle

* improvement(tag-dropdown): typed tag dropdown values (simstudioai#910)

* fix(min-chunk): remove minsize for chunk (simstudioai#911)

* fix(min-chunk): remove minsize for chunk

* fix tests

* improvement(chunk-config): migrate unused default for consistency (simstudioai#913)

* fix(mailer): update mailer to use the EMAIL_DOMAIN (simstudioai#914)

* fix(mailer): update mailer to use the EMAIL_DOMAIn

* add more

* Improvement(cc): added cc to gmail and outlook (simstudioai#900)

* changed just gmail

* bun run lint

* fixed bcc

* updated docs

---------

Co-authored-by: Adam Gough <adamgough@Mac.attlocal.net>
Co-authored-by: waleedlatif1 <walif6@gmail.com>

* fix(email-validation): add email validation to prevent bouncing, fixed OTP validation (simstudioai#916)

* feat(email-validation): add email validation to prevent bouncing

* removed suspicious patterns

* fix(verification): fixed OTP verification

* fix failing tests, cleanup

* fix(otp): fix email not sending (simstudioai#917)

* fix(email): manual OTP instead of better-auth (simstudioai#921)

* fix(email): manual OTP instead of better-auth

* lint

---------

Co-authored-by: Vikhyath Mondreti <vikhyathvikku@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Vikhyath Mondreti <vikhyath@simstudio.ai>
Co-authored-by: Adam Gough <77861281+aadamgough@users.noreply.github.com>
Co-authored-by: Adam Gough <adamgough@Mac.attlocal.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants