Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray based document parsing of more file types #94

Merged
merged 8 commits into from
Apr 20, 2023

Conversation

ellisonbg
Copy link
Contributor

This is a general improvement to the file indexing capabilities:

  • Use ray to parallelize the loading and splitting of documents. This can now read the entire JupyterLab source tree in 800ms.
  • Better error handling logic in the indexer.
  • More file types supported.
  • Added exclude patterns of files/directories to not read (.ipynb_checkpoints, node_modules, build, lib).
  • File-extension specific text splitting, including notebooks.

@ellisonbg ellisonbg changed the title Ray based document parsing of more file types [WIP] Ray based document parsing of more file types Apr 19, 2023
@3coins 3coins added the enhancement New feature or request label Apr 19, 2023
@3coins
Copy link
Collaborator

3coins commented Apr 19, 2023

@ellisonbg
Amazing work!

@ellisonbg ellisonbg changed the title [WIP] Ray based document parsing of more file types Ray based document parsing of more file types Apr 19, 2023
@ellisonbg
Copy link
Contributor Author

OK, this is ready for review.

packages/jupyter-ai/jupyter_ai/actors/learn.py Outdated Show resolved Hide resolved
packages/jupyter-ai/jupyter_ai/actors/learn.py Outdated Show resolved Hide resolved
loader_kwargs: Optional[Dict] = None,
recursive: bool = False,
path,
extensions={'.py', '.md', '.R', '.Rmd', '.jl', '.sh', '.ipynb', '.js', '.ts'},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add .tsx and .txt here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and .jsx as well.

recursive: bool = False,
path,
extensions={'.py', '.md', '.R', '.Rmd', '.jl', '.sh', '.ipynb', '.js', '.ts'},
exclude={'.ipynb_checkpoints', 'node_modules', 'lib', 'build'}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to ignore .git and .DS_Store as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we parse .gitignore and use that for our denylist by default? This can be addressed in a follow-up PR, since the problem of "over-indexing" is generally not immediately obvious to end users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, let's explore this in a follow up PR.

dlqqq and others added 3 commits April 19, 2023 14:46
Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>
Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>
Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>
recursive: bool = False,
path,
extensions={'.py', '.md', '.R', '.Rmd', '.jl', '.sh', '.ipynb', '.js', '.ts'},
exclude={'.ipynb_checkpoints', 'node_modules', 'lib', 'build'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we parse .gitignore and use that for our denylist by default? This can be addressed in a follow-up PR, since the problem of "over-indexing" is generally not immediately obvious to end users.

@ellisonbg ellisonbg dismissed dlqqq’s stale review April 20, 2023 04:33

All points addressed or will iterate in further PRs.

@ellisonbg ellisonbg merged commit e6a78ad into jupyterlab:main Apr 20, 2023
dbelgrod pushed a commit to dbelgrod/jupyter-ai that referenced this pull request Jun 10, 2024
* Ray based document parsing of more file types.

* Renaming to learn/ask to make for human centered.

* Improvements to the learn/ask commands.

* fix typo

Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>

* improve grammar

Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>

* improve wording

Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>

* Adding new extensions and excludes.

* Update langchain to version 0.144.

---------

Co-authored-by: david qiu <david@qiu.dev>
Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>
Marchlak pushed a commit to Marchlak/jupyter-ai that referenced this pull request Oct 28, 2024
* Ray based document parsing of more file types.

* Renaming to learn/ask to make for human centered.

* Improvements to the learn/ask commands.

* fix typo

Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>

* improve grammar

Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>

* improve wording

Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>

* Adding new extensions and excludes.

* Update langchain to version 0.144.

---------

Co-authored-by: david qiu <david@qiu.dev>
Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request @jupyter-ai/chatui
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants