-
Notifications
You must be signed in to change notification settings - Fork 3.2k
fix(min-chunk): remove minsize for chunk #911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
1 Skipped Deployment
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR systematically reduces the minimum chunk size constraint across Sim's document processing and knowledge base system from 50-100 characters to 1 character. The changes span the entire document processing pipeline, including:
- Frontend validation: Upload modals and create modals now allow chunks as small as 1 character
- API validation: Knowledge base creation and document processing APIs updated to accept minimum chunk sizes of 1
- Core processing: The
TextChunkerclass,DocsChunker, and document processor functions all now default to a minimum chunk size of 1 - Tool integration: Knowledge tools updated to process documents with 1-character minimum chunks
The changes maintain backward compatibility by preserving existing parameter structures while changing default values. This modification affects how documents are chunked for RAG (Retrieval Augmented Generation) applications, allowing the system to preserve very small text fragments like single words, abbreviations, code snippets, or structured data elements that were previously filtered out.
The chunking system uses hierarchical text splitting that attempts to respect semantic boundaries (sentences, paragraphs, etc.) but falls back to character-level splitting when necessary. With this change, even single-character chunks will be preserved if larger semantic chunks cannot be formed, ensuring maximum content retention during document processing.
PR Description Notes:
- The PR description is incomplete - all sections remain as template placeholders without actual content describing the changes or testing approach
Confidence score: 2/5
- This PR introduces significant risk due to potential performance and quality impacts from creating many small chunks
- Score reflects concerns about storage costs, retrieval quality degradation, and inconsistent implementation across the codebase
- Pay close attention to chunking logic files and API validation schemas for potential inconsistencies
11 files reviewed, 1 comment
* fix(min-chunk): remove minsize for chunk * fix tests
…ypes (#919) * feat(execution-filesystem): system to pass files between blocks (#866) * feat(files): pass files between blocks * presigned URL for downloads * Remove latest migration before merge * starter block file upload wasn't getting logged * checkpoint in human readable form * checkpoint files / file type outputs * file downloads working for block outputs * checkpoint file download * fix type issues * remove filereference interface with simpler user file interface * show files in the tag dropdown for start block * more migration to simple url object, reduce presigned time to 5 min * Remove migration 0065_parallel_nightmare and related files - Deleted apps/sim/db/migrations/0065_parallel_nightmare.sql - Deleted apps/sim/db/migrations/meta/0065_snapshot.json - Removed 0065 entry from apps/sim/db/migrations/meta/_journal.json Preparing for merge with origin/staging and migration regeneration * add migration files * fix tests * Update apps/sim/lib/uploads/setup.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update apps/sim/lib/workflows/execution-file-storage.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update apps/sim/lib/workflows/execution-file-storage.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * cleanup types * fix lint * fix logs typing for file refs * open download in new tab * fixed * Update apps/sim/tools/index.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix file block * cleanup unused code * fix bugs * remove hacky file id logic * fix drag and drop * fix tests --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * feat(trigger-mode): added trigger-mode to workflow_blocks table (#902) * fix(schedules-perms): use regular perm system to view/edit schedule info (#901) * fix(schedules-perms): use regular perm system to view schedule info * fix perms * improve logging * feat(webhooks): deprecate singular webhook block + add trigger mode to blocks (#903) * feat(triggers): added new trigger mode for blocks, added socket event, ran migrations * Rename old trigger/ directory to background/ * cleaned up, ensured that we display active webhook at the block-level * fix submenu in tag dropdown * keyboard nav on tag dropdown submenu * feat(triggers): add outlook to new triggers system * cleanup * add types to tag dropdown, type all outputs for tools and use that over block outputs * update doc generator to truly reflect outputs * fix docs * add trigger handler * fix active webhook tag * tag dropdown fix for triggers * remove trigger mode schema change * feat(execution-filesystem): system to pass files between blocks (#866) * feat(files): pass files between blocks * presigned URL for downloads * Remove latest migration before merge * starter block file upload wasn't getting logged * checkpoint in human readable form * checkpoint files / file type outputs * file downloads working for block outputs * checkpoint file download * fix type issues * remove filereference interface with simpler user file interface * show files in the tag dropdown for start block * more migration to simple url object, reduce presigned time to 5 min * Remove migration 0065_parallel_nightmare and related files - Deleted apps/sim/db/migrations/0065_parallel_nightmare.sql - Deleted apps/sim/db/migrations/meta/0065_snapshot.json - Removed 0065 entry from apps/sim/db/migrations/meta/_journal.json Preparing for merge with origin/staging and migration regeneration * add migration files * fix tests * Update apps/sim/lib/uploads/setup.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update apps/sim/lib/workflows/execution-file-storage.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update apps/sim/lib/workflows/execution-file-storage.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * cleanup types * fix lint * fix logs typing for file refs * open download in new tab * fixed * Update apps/sim/tools/index.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix file block * cleanup unused code * fix bugs * remove hacky file id logic * fix drag and drop * fix tests --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * feat(trigger-mode): added trigger-mode to workflow_blocks table (#902) * fix(schedules-perms): use regular perm system to view/edit schedule info (#901) * fix(schedules-perms): use regular perm system to view schedule info * fix perms * improve logging * cleanup * prevent tooltip showing up on modal open * updated trigger config * fix type issues --------- Co-authored-by: Vikhyath Mondreti <vikhyathvikku@gmail.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Vikhyath Mondreti <vikhyath@simstudio.ai> * fix(helm): fix helm charts migrations using wrong image (#907) * fix(helm): fix helm charts migrations using wrong image * fixed migrations * feat(whitelist): add email & domain-based whitelisting for signups (#908) * improvement(helm): fix duplicate SOCKET_SERVER_URL and add additional envvars to template (#909) * improvement(helm): fix duplicate SOCKET_SERVER_URL and add additional envvars to template * rm serper & freestyle * improvement(tag-dropdown): typed tag dropdown values (#910) * fix(min-chunk): remove minsize for chunk (#911) * fix(min-chunk): remove minsize for chunk * fix tests * improvement(chunk-config): migrate unused default for consistency (#913) * fix(mailer): update mailer to use the EMAIL_DOMAIN (#914) * fix(mailer): update mailer to use the EMAIL_DOMAIn * add more * Improvement(cc): added cc to gmail and outlook (#900) * changed just gmail * bun run lint * fixed bcc * updated docs --------- Co-authored-by: Adam Gough <adamgough@Mac.attlocal.net> Co-authored-by: waleedlatif1 <walif6@gmail.com> * fix(email-validation): add email validation to prevent bouncing, fixed OTP validation (#916) * feat(email-validation): add email validation to prevent bouncing * removed suspicious patterns * fix(verification): fixed OTP verification * fix failing tests, cleanup * fix(otp): fix email not sending (#917) * fix(email): manual OTP instead of better-auth (#921) * fix(email): manual OTP instead of better-auth * lint --------- Co-authored-by: Vikhyath Mondreti <vikhyathvikku@gmail.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Vikhyath Mondreti <vikhyath@simstudio.ai> Co-authored-by: Adam Gough <77861281+aadamgough@users.noreply.github.com> Co-authored-by: Adam Gough <adamgough@Mac.attlocal.net>
* fix(min-chunk): remove minsize for chunk * fix tests
…ypes (simstudioai#919) * feat(execution-filesystem): system to pass files between blocks (simstudioai#866) * feat(files): pass files between blocks * presigned URL for downloads * Remove latest migration before merge * starter block file upload wasn't getting logged * checkpoint in human readable form * checkpoint files / file type outputs * file downloads working for block outputs * checkpoint file download * fix type issues * remove filereference interface with simpler user file interface * show files in the tag dropdown for start block * more migration to simple url object, reduce presigned time to 5 min * Remove migration 0065_parallel_nightmare and related files - Deleted apps/sim/db/migrations/0065_parallel_nightmare.sql - Deleted apps/sim/db/migrations/meta/0065_snapshot.json - Removed 0065 entry from apps/sim/db/migrations/meta/_journal.json Preparing for merge with origin/staging and migration regeneration * add migration files * fix tests * Update apps/sim/lib/uploads/setup.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update apps/sim/lib/workflows/execution-file-storage.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update apps/sim/lib/workflows/execution-file-storage.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * cleanup types * fix lint * fix logs typing for file refs * open download in new tab * fixed * Update apps/sim/tools/index.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix file block * cleanup unused code * fix bugs * remove hacky file id logic * fix drag and drop * fix tests --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * feat(trigger-mode): added trigger-mode to workflow_blocks table (simstudioai#902) * fix(schedules-perms): use regular perm system to view/edit schedule info (simstudioai#901) * fix(schedules-perms): use regular perm system to view schedule info * fix perms * improve logging * feat(webhooks): deprecate singular webhook block + add trigger mode to blocks (simstudioai#903) * feat(triggers): added new trigger mode for blocks, added socket event, ran migrations * Rename old trigger/ directory to background/ * cleaned up, ensured that we display active webhook at the block-level * fix submenu in tag dropdown * keyboard nav on tag dropdown submenu * feat(triggers): add outlook to new triggers system * cleanup * add types to tag dropdown, type all outputs for tools and use that over block outputs * update doc generator to truly reflect outputs * fix docs * add trigger handler * fix active webhook tag * tag dropdown fix for triggers * remove trigger mode schema change * feat(execution-filesystem): system to pass files between blocks (simstudioai#866) * feat(files): pass files between blocks * presigned URL for downloads * Remove latest migration before merge * starter block file upload wasn't getting logged * checkpoint in human readable form * checkpoint files / file type outputs * file downloads working for block outputs * checkpoint file download * fix type issues * remove filereference interface with simpler user file interface * show files in the tag dropdown for start block * more migration to simple url object, reduce presigned time to 5 min * Remove migration 0065_parallel_nightmare and related files - Deleted apps/sim/db/migrations/0065_parallel_nightmare.sql - Deleted apps/sim/db/migrations/meta/0065_snapshot.json - Removed 0065 entry from apps/sim/db/migrations/meta/_journal.json Preparing for merge with origin/staging and migration regeneration * add migration files * fix tests * Update apps/sim/lib/uploads/setup.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update apps/sim/lib/workflows/execution-file-storage.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update apps/sim/lib/workflows/execution-file-storage.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * cleanup types * fix lint * fix logs typing for file refs * open download in new tab * fixed * Update apps/sim/tools/index.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix file block * cleanup unused code * fix bugs * remove hacky file id logic * fix drag and drop * fix tests --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * feat(trigger-mode): added trigger-mode to workflow_blocks table (simstudioai#902) * fix(schedules-perms): use regular perm system to view/edit schedule info (simstudioai#901) * fix(schedules-perms): use regular perm system to view schedule info * fix perms * improve logging * cleanup * prevent tooltip showing up on modal open * updated trigger config * fix type issues --------- Co-authored-by: Vikhyath Mondreti <vikhyathvikku@gmail.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Vikhyath Mondreti <vikhyath@simstudio.ai> * fix(helm): fix helm charts migrations using wrong image (simstudioai#907) * fix(helm): fix helm charts migrations using wrong image * fixed migrations * feat(whitelist): add email & domain-based whitelisting for signups (simstudioai#908) * improvement(helm): fix duplicate SOCKET_SERVER_URL and add additional envvars to template (simstudioai#909) * improvement(helm): fix duplicate SOCKET_SERVER_URL and add additional envvars to template * rm serper & freestyle * improvement(tag-dropdown): typed tag dropdown values (simstudioai#910) * fix(min-chunk): remove minsize for chunk (simstudioai#911) * fix(min-chunk): remove minsize for chunk * fix tests * improvement(chunk-config): migrate unused default for consistency (simstudioai#913) * fix(mailer): update mailer to use the EMAIL_DOMAIN (simstudioai#914) * fix(mailer): update mailer to use the EMAIL_DOMAIn * add more * Improvement(cc): added cc to gmail and outlook (simstudioai#900) * changed just gmail * bun run lint * fixed bcc * updated docs --------- Co-authored-by: Adam Gough <adamgough@Mac.attlocal.net> Co-authored-by: waleedlatif1 <walif6@gmail.com> * fix(email-validation): add email validation to prevent bouncing, fixed OTP validation (simstudioai#916) * feat(email-validation): add email validation to prevent bouncing * removed suspicious patterns * fix(verification): fixed OTP verification * fix failing tests, cleanup * fix(otp): fix email not sending (simstudioai#917) * fix(email): manual OTP instead of better-auth (simstudioai#921) * fix(email): manual OTP instead of better-auth * lint --------- Co-authored-by: Vikhyath Mondreti <vikhyathvikku@gmail.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Vikhyath Mondreti <vikhyath@simstudio.ai> Co-authored-by: Adam Gough <77861281+aadamgough@users.noreply.github.com> Co-authored-by: Adam Gough <adamgough@Mac.attlocal.net>
Summary
OpenAI embeddings API has no min chunk size character requirements. So this PR removes it from here too setting default to 1 char.
Type of Change
Testing
Tested manually creating doc with <10 chars.
Checklist