-
-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray based document parsing of more file types #94
Conversation
@ellisonbg |
OK, this is ready for review. |
loader_kwargs: Optional[Dict] = None, | ||
recursive: bool = False, | ||
path, | ||
extensions={'.py', '.md', '.R', '.Rmd', '.jl', '.sh', '.ipynb', '.js', '.ts'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add .tsx
and .txt
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, and .jsx
as well.
recursive: bool = False, | ||
path, | ||
extensions={'.py', '.md', '.R', '.Rmd', '.jl', '.sh', '.ipynb', '.js', '.ts'}, | ||
exclude={'.ipynb_checkpoints', 'node_modules', 'lib', 'build'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to ignore .git
and .DS_Store
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we parse .gitignore
and use that for our denylist by default? This can be addressed in a follow-up PR, since the problem of "over-indexing" is generally not immediately obvious to end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, let's explore this in a follow up PR.
Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>
Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>
recursive: bool = False, | ||
path, | ||
extensions={'.py', '.md', '.R', '.Rmd', '.jl', '.sh', '.ipynb', '.js', '.ts'}, | ||
exclude={'.ipynb_checkpoints', 'node_modules', 'lib', 'build'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we parse .gitignore
and use that for our denylist by default? This can be addressed in a follow-up PR, since the problem of "over-indexing" is generally not immediately obvious to end users.
All points addressed or will iterate in further PRs.
* Ray based document parsing of more file types. * Renaming to learn/ask to make for human centered. * Improvements to the learn/ask commands. * fix typo Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com> * improve grammar Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com> * improve wording Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com> * Adding new extensions and excludes. * Update langchain to version 0.144. --------- Co-authored-by: david qiu <david@qiu.dev> Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>
* Ray based document parsing of more file types. * Renaming to learn/ask to make for human centered. * Improvements to the learn/ask commands. * fix typo Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com> * improve grammar Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com> * improve wording Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com> * Adding new extensions and excludes. * Update langchain to version 0.144. --------- Co-authored-by: david qiu <david@qiu.dev> Co-authored-by: Jason Weill <93281816+JasonWeill@users.noreply.github.com>
This is a general improvement to the file indexing capabilities: