Summarize the data store on every document summarization #69

conradocloudera · 2024-12-11T21:15:08Z

This makes the data store summary consistent with the documents and repeatable (vs generated every time), plus it's faster to fetch because it's saved. We might want to change later to create this when the java BE wnats (e.g. all new documents have already been indexed/summarized, so it makes sense to re-generate).

jkwatson · 2024-12-11T21:37:51Z

how difficult would it be to add a test to make sure this works, at least with the llama-index pieces?

conradocloudera · 2024-12-11T21:41:44Z

how difficult would it be to add a test to make sure this works, at least with the llama-index pieces?

Good question. I think the trickiest bit would be to update the noop model to return something meaningful, but not exactly the same every time. We could look if returning the same thing would be fine 🤔

jkwatson · 2024-12-11T22:00:11Z

how difficult would it be to add a test to make sure this works, at least with the llama-index pieces?

Good question. I think the trickiest bit would be to update the noop model to return something meaningful, but not exactly the same every time. We could look if returning the same thing would be fine 🤔

I think same content would be fine for starters.

conradocloudera · 2024-12-11T22:38:14Z

how difficult would it be to add a test to make sure this works, at least with the llama-index pieces?

Good question. I think the trickiest bit would be to update the noop model to return something meaningful, but not exactly the same every time. We could look if returning the same thing would be fine 🤔

I think same content would be fine for starters.

Seems like we were already testing this! The test verifies that the answer is the same. It just happens to be generated on summary creation vs on demand now.

* upgrade everything * small refactor for params, update loading * add bedrock converse * fix loading * Clean up Cohere suggested questions * Add property-based test for process_response() (#56) * Add hypothesis * Add property-based test for process_response() * Shorten variable * Formatting * Add type annotations * Fix type annotation * hacking on startup scripts * hacking on startup scripts, moar * fix wrong dir * try having the java side restart itself if it dies * see output from java startup * add debug info * add the executable bit * change the flags * Add docstrings for tests * refactor datasourceId * update to exclude 405b model and default to 8b * update readme for new cohere * fix broken tests monkeypatching * "wip on creating with models and response chunks" * wip on modal updates * commit java updates * wip on populating the chat setting modal * set up ui for updating a session * add update method * use updated session for chat * remove query configuration from the chat context * refactoring fe and fixing bug with empty model * remove the datasource id from the context and use the active session instead * Update release version to 1.4.0-beta * Support multiple embedding models (#59) * add embedding model to the data source in the java API * embedding model used from the datasource while indexing * replace the rest of the embedding model defaults * "test & fix bugs with embedding variability" * small refactoring to make embedding & llm caii methods look the same * fix linting issues * add a todo for a failing property test case * remove unused import --------- Co-authored-by: Elijah Williams <ewilliams@cloudera.com> * Provide CAII batch embedding for better performance (#35) * CAII endpoint discovery (#60) * "wip on endpoint listing" * "wip on list_endpoints typing" * "refactoring to endpoint object" * "wip filtering" * "endpoints queried!" * "refactoring" * "wip on cleaning up types" * "type cleanup complete" * "moving files" * "use a dummy embedding model for deletes" * fix some bits from merge, get evals working again with CAII, tests passing * formatting * clean up ruff stuff * use the chat llm for evals * fix mypy for reformatting * "wip on java reconciler" * "reconciler don't do no model; start python work" * "python - updating for summarization model" * "comment out batch embeddings to get it working again" * add handling for no summarization in the files table * finish up ui and python for summarization * make sure to update the time-updated fields on data sources and chat sessions * use no-op models when we don't need real ones for summary functionality * Update release version to dev-testing * use the summarization llm when summarizing summaries --------- Co-authored-by: Elijah Williams <ewilliams@cloudera.com> Co-authored-by: actions-user <actions@github.com> * Update release version to 1.4.0 * pass the original filename from java-> python so we don't need s3 metadata to store it * don't read the whole directory when summarizing docs * "refactor java to use RagFileService" * remove seaweedfs experiment * Make mypy happy (#62) * Refactor summary index to isolate the logic (#63) * Refactor summary index to isolate the logic * fix tests * handle race condition * handle mypy * ignore errors if the directory doesn't exist --------- Co-authored-by: jwatson <jkwatson@gmail.com> * image * Update catalog entry to match the official one (#66) * Update local catalog with official info * add the git-ref back * add the html long description (#67) * Shuffle API for data sources for easier human consumption (#68) * Shuffle API for data sources for easier human consumption * make mypy happy * remove prints * wip o fs rag file uploader * "now we're thinking with overtime" * Revert ""now we're thinking with overtime"" This reverts commit 3c93206. * get the databases directory from the environment (in local_dev) python file storage abstraction python tests currently broken real AMP startup script needs new env var * add a todo * merge from main * properly override the configuration in pytest configure to point at a temp directory * get the tests passing with filesystem file handoff * update project metadata to support new local filesystem storage * Update release version to dev-testing * fix java * cleanup after switching tests to use the local filesystem * Remove unused settings (#70) * remove unused dep * fix circular dep and refactor doc storage * Update release version to 1.4.0 * Summarize the data store on every document summarization (#69) * fix bug with s3 path when the prefix is not provided (#72) * add --reload to the fastapi startup_app script * Avoid global variables and use ephemeral folder for tests (#71) * Avoid global variables and use ephemeral folder for tests * fix with merge to main * Remove print * lint * refetch knowledge base summary on doc summary change * Bump @eslint/plugin-kit (#16) Bumps the npm_and_yarn group with 1 update in the /ui directory: [@eslint/plugin-kit](https://github.com/eslint/rewrite). Updates `@eslint/plugin-kit` from 0.2.2 to 0.2.3 - [Release notes](https://github.com/eslint/rewrite/releases) - [Changelog](https://github.com/eslint/rewrite/blob/main/release-please-config.json) - [Commits](eslint/rewrite@plugin-kit-v0.2.2...plugin-kit-v0.2.3) --- updated-dependencies: - dependency-name: "@eslint/plugin-kit" dependency-type: indirect dependency-group: npm_and_yarn ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: jwatson <jkwatson@gmail.com> Co-authored-by: Michael Liu <mliu@cloudera.com> Co-authored-by: actions-user <actions@github.com> Co-authored-by: conradocloudera <csilvamiranda@cloudera.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Summarize the data store on every document summarization

ec51a35

conradocloudera requested review from jkwatson, ewilliams-cloudera and mliu-cloudera December 11, 2024 21:15

Merge branch 'main' into cm/data_source_index

5707fd5

ewilliams-cloudera approved these changes Dec 11, 2024

View reviewed changes

conradocloudera merged commit 2a29fdd into main Dec 11, 2024
3 checks passed

conradocloudera deleted the cm/data_source_index branch December 11, 2024 23:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summarize the data store on every document summarization #69

Summarize the data store on every document summarization #69

conradocloudera commented Dec 11, 2024

jkwatson commented Dec 11, 2024

conradocloudera commented Dec 11, 2024

jkwatson commented Dec 11, 2024

conradocloudera commented Dec 11, 2024

Summarize the data store on every document summarization #69

Summarize the data store on every document summarization #69

Conversation

conradocloudera commented Dec 11, 2024

jkwatson commented Dec 11, 2024

conradocloudera commented Dec 11, 2024

jkwatson commented Dec 11, 2024

conradocloudera commented Dec 11, 2024