Ray based document parsing of more file types #94

ellisonbg · 2023-04-19T05:07:39Z

This is a general improvement to the file indexing capabilities:

Use ray to parallelize the loading and splitting of documents. This can now read the entire JupyterLab source tree in 800ms.
Better error handling logic in the indexer.
More file types supported.
Added exclude patterns of files/directories to not read (.ipynb_checkpoints, node_modules, build, lib).
File-extension specific text splitting, including notebooks.

3coins · 2023-04-19T15:31:58Z

@ellisonbg
Amazing work!

ellisonbg · 2023-04-19T19:13:20Z

OK, this is ready for review.

packages/jupyter-ai/jupyter_ai/actors/learn.py

JasonWeill · 2023-04-19T19:22:28Z

packages/jupyter-ai/jupyter_ai/document_loaders/directory.py

-        loader_kwargs: Optional[Dict] = None,
-        recursive: bool = False,
+        path,
+        extensions={'.py', '.md', '.R', '.Rmd', '.jl', '.sh', '.ipynb', '.js', '.ts'},


Should we add .tsx and .txt here?

Yeah, and .jsx as well.

JasonWeill · 2023-04-19T19:23:02Z

packages/jupyter-ai/jupyter_ai/document_loaders/directory.py

-        recursive: bool = False,
+        path,
+        extensions={'.py', '.md', '.R', '.Rmd', '.jl', '.sh', '.ipynb', '.js', '.ts'},
+        exclude={'.ipynb_checkpoints', 'node_modules', 'lib', 'build'}


Might be good to ignore .git and .DS_Store as well

Could we parse .gitignore and use that for our denylist by default? This can be addressed in a follow-up PR, since the problem of "over-indexing" is generally not immediately obvious to end users.

Yep, let's explore this in a follow up PR.

Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>

packages/jupyter-ai/jupyter_ai/document_loaders/directory.py

packages/jupyter-ai/jupyter_ai/actors/learn.py

dlqqq · 2023-04-19T22:56:37Z

packages/jupyter-ai/jupyter_ai/document_loaders/directory.py

-        recursive: bool = False,
+        path,
+        extensions={'.py', '.md', '.R', '.Rmd', '.jl', '.sh', '.ipynb', '.js', '.ts'},
+        exclude={'.ipynb_checkpoints', 'node_modules', 'lib', 'build'}


Could we parse .gitignore and use that for our denylist by default? This can be addressed in a follow-up PR, since the problem of "over-indexing" is generally not immediately obvious to end users.

packages/jupyter-ai/jupyter_ai/document_loaders/directory.py

All points addressed or will iterate in further PRs.

* Ray based document parsing of more file types. * Renaming to learn/ask to make for human centered. * Improvements to the learn/ask commands. * fix typo Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com> * improve grammar Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com> * improve wording Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com> * Adding new extensions and excludes. * Update langchain to version 0.144. --------- Co-authored-by: david qiu <david@qiu.dev> Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>

Ray based document parsing of more file types.

4f8a2a8

ellisonbg changed the title ~~Ray based document parsing of more file types~~ [WIP] Ray based document parsing of more file types Apr 19, 2023

Renaming to learn/ask to make for human centered.

26e2887

3coins added the enhancement New feature or request label Apr 19, 2023

Improvements to the learn/ask commands.

54b965a

ellisonbg added the scope:chat-ux Issues concerning the chat user experience label Apr 19, 2023

ellisonbg changed the title ~~[WIP] Ray based document parsing of more file types~~ Ray based document parsing of more file types Apr 19, 2023

JasonWeill reviewed Apr 19, 2023

View reviewed changes

packages/jupyter-ai/jupyter_ai/actors/learn.py Outdated Show resolved Hide resolved

JasonWeill reviewed Apr 19, 2023

View reviewed changes

dlqqq and others added 3 commits April 19, 2023 14:46

fix typo

4cb2257

Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>

improve grammar

b0956a6

Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>

improve wording

ad0a5d1

Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>

dlqqq requested changes Apr 19, 2023

View reviewed changes

dlqqq previously requested changes Apr 19, 2023

View reviewed changes

packages/jupyter-ai/jupyter_ai/document_loaders/directory.py Show resolved Hide resolved

JasonWeill mentioned this pull request Apr 19, 2023

Document /generate, /ask, /learn commands #92

Closed

ellisonbg added 2 commits April 19, 2023 21:15

Adding new extensions and excludes.

316b47c

Update langchain to version 0.144.

8d75c79

ellisonbg merged commit e6a78ad into jupyterlab:main Apr 20, 2023

krassowski mentioned this pull request Jul 9, 2024

/learn has unused -v/--verbose option #885

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray based document parsing of more file types #94

Ray based document parsing of more file types #94

ellisonbg commented Apr 19, 2023

3coins commented Apr 19, 2023

ellisonbg commented Apr 19, 2023

JasonWeill Apr 19, 2023

ellisonbg Apr 19, 2023

JasonWeill Apr 19, 2023

ellisonbg Apr 19, 2023

dlqqq Apr 19, 2023

ellisonbg Apr 20, 2023

dlqqq Apr 19, 2023

Ray based document parsing of more file types #94

Ray based document parsing of more file types #94

Conversation

ellisonbg commented Apr 19, 2023

3coins commented Apr 19, 2023

ellisonbg commented Apr 19, 2023

JasonWeill Apr 19, 2023

Choose a reason for hiding this comment

ellisonbg Apr 19, 2023

Choose a reason for hiding this comment

JasonWeill Apr 19, 2023

Choose a reason for hiding this comment

ellisonbg Apr 19, 2023

Choose a reason for hiding this comment

dlqqq Apr 19, 2023

Choose a reason for hiding this comment

ellisonbg Apr 20, 2023

Choose a reason for hiding this comment

dlqqq Apr 19, 2023

Choose a reason for hiding this comment