Skip to content

Add chunking support to vectorize.table() #162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 43 commits into from
Jan 21, 2025
Merged

Conversation

asr2003
Copy link
Contributor

@asr2003 asr2003 commented Oct 17, 2024

Closes #142
/claim #142

@asr2003
Copy link
Contributor Author

asr2003 commented Oct 17, 2024

@ChuckHend Have a look at your free time and kindly approve workflows, if any adjustments needed i am happy to incorporate them

@asr2003
Copy link
Contributor Author

asr2003 commented Oct 18, 2024

Waiting for a quick review on this too - @ChuckHend

@asr2003
Copy link
Contributor Author

asr2003 commented Oct 19, 2024

@ChuckHend Any changes needed here?

@ChuckHend
Copy link
Owner

I might need to provide further clarification in the issue. What is intended is that when calling vectorize.table() on an existing table, each row is split into multiple 'chunks' (one new row for each chunk). I think these new rows will need to live on a separate table.

@asr2003
Copy link
Contributor Author

asr2003 commented Oct 23, 2024

@ChuckHend Updated the changes with intended approach of chunking vectorie.table by retreiving rows and lives into new table,

Have a review on it and If any changes or modifications required I am ready to incorporate them

@asr2003
Copy link
Contributor Author

asr2003 commented Oct 26, 2024

@ChuckHend Have a review on it in your free time

@ChuckHend
Copy link
Owner

Can you add a test or two that shows how the functionality will work and assert that it functions as expected?

@asr2003
Copy link
Contributor Author

asr2003 commented Nov 6, 2024

Sure! I will update the tests

@asr2003
Copy link
Contributor Author

asr2003 commented Dec 13, 2024 via email

@ChuckHend
Copy link
Owner

Yes, let's leave the integration with vectorize.table() out to make this simpler / easier. Just keep chunk_text() a sql function that we can call directly, and add a couple tests that validate its functionality and shows how to use it.

@asr2003
Copy link
Contributor Author

asr2003 commented Dec 20, 2024

Okay!

@asr2003
Copy link
Contributor Author

asr2003 commented Jan 14, 2025

@ChuckHend Any reviews on it

@ChuckHend
Copy link
Owner

Ill have this reviewed by end of week

@asr2003
Copy link
Contributor Author

asr2003 commented Jan 18, 2025

I haven't pushed my latest changes here which are in my local :)

@asr2003 asr2003 requested a review from ChuckHend January 18, 2025 10:02
@asr2003
Copy link
Contributor Author

asr2003 commented Jan 19, 2025

@ChuckHend 👋
I have updated the chunk_text function to use the text_splitter crate instead of the previous manual approach. This change makes the chunking more accurate especially for larger documents and ensures we preserve word boundaries while still enforcing a character limit. The previous logic sometimes caused inconsistencies with expected vs actual outputs. Using text_splitter brings more stability and aligns with the best practices seen in LangChain’s recursive text splitter. I have also updated docs of it's usage as standalone sql function and added couple of tests that validate its functionality and tests passes locally.

test test_chunk_text ... URL: ***localhost:28817/postgres
ok

With this I think it's ready to merge

@@ -212,3 +212,17 @@ pub async fn ready(conn: &Pool<Postgres>) -> bool {
.await
.expect("failed")
}

/// Fetch rows from a given table and schema
pub async fn fetch_table_rows(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this function being used anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! It used in previous approach, will remove it

@asr2003 asr2003 requested a review from ChuckHend January 21, 2025 02:47
@ChuckHend ChuckHend merged commit d874b36 into ChuckHend:main Jan 21, 2025
2 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chunking support to vectorize.table()
2 participants