-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add character text splitter #6
Merged
Merged
Changes from 1 commit
Commits
Show all changes
43 commits
Select commit
Hold shift + click to select a range
be5ec7b
add character text splitter
AlisoSouza 192cb49
add embbeddings
AlisoSouza 11c0209
temp code
AlisoSouza 78fdc2d
add celery to docker-compose
AlisoSouza ea12e1e
add celery and redis dependencies
AlisoSouza b03e32e
fix root_validator deprecated warning
AlisoSouza 719acdc
add S3 file downloader
AlisoSouza f82c1d8
add celery and content bases api
AlisoSouza 0d73358
add IndexerFileManager
AlisoSouza 0b2bf89
add content base search endpoint
AlisoSouza 2b56939
add token verification to content bases handler
AlisoSouza 36f064a
Add: NexusRESTClient; call nexus endpoint after indexing document
AlisoSouza f15a4bb
fix: NexusRESTClient circular import
AlisoSouza ac87404
add: delete content base file endpoint
AlisoSouza 43fe32d
add index_file_url
AlisoSouza a5db3e8
add a line at the end of files
AlisoSouza 56b2099
return full page at content base search, add PDFLoader and DataLoader…
AlisoSouza 8bdbf8e
add txt and docx class loaders
AlisoSouza 288d176
ajust txt loader to save file temp
AlisoSouza 2a24bc7
send file type in the request
AlisoSouza 4b019c5
send text of file in file_type
AlisoSouza b4953ae
fix INDEX_CONTENTBASES_NAME env var
AlisoSouza adc476b
add file_uuid to metadata
AlisoSouza 5f7cc9c
update elasticsearch vectors, search by file_uuid
AlisoSouza 6d3ab64
delete file by uuid
AlisoSouza 257cdf5
index as environment variable
AlisoSouza f980c53
change docx loader
AlisoSouza 6683157
add: xlsx and xls support
AlisoSouza b575f23
xlsx: save temp file
AlisoSouza c860aa4
Merge pull request #24 from weni-ai/fix/xlsx
AlisoSouza 14283b3
Merge pull request #23 from weni-ai/fix/DocxLoader
AlisoSouza b5b622c
Merge pull request #22 from weni-ai/feature/delete-file-by-uuid
AlisoSouza 85c4dec
Merge pull request #21 from weni-ai/feature/add_file_uuid_metadata
AlisoSouza 72c1a56
Merge pull request #20 from weni-ai/fix/index_succedded
AlisoSouza a21e2a1
Merge pull request #19 from weni-ai/feature/document-loader-cls
AlisoSouza 2d2526f
Merge pull request #18 from weni-ai/feature/full_page_index
AlisoSouza 9c6d328
Merge pull request #17 from weni-ai/feature/load-file-url
AlisoSouza 7b8a6ba
Merge pull request #14 from weni-ai/feature/delete-content-base
AlisoSouza c7501af
Merge pull request #13 from weni-ai/feature/nexus-rest
AlisoSouza 0c939f3
Merge pull request #12 from weni-ai/feature/token-verification
AlisoSouza 4113913
Merge pull request #11 from weni-ai/feature/search-endpoint
AlisoSouza a18c561
Merge pull request #9 from weni-ai/feature/content-base-api
AlisoSouza 7d7f75e
Merge branch 'main' into feature/text-splitter
AlisoSouza File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
import unittest | ||
from app.text_splitters.text_splitters import ( | ||
TextSplitter, character_text_splitter | ||
) | ||
from lorem_text import lorem | ||
|
||
|
||
class TestProductsHandler(unittest.TestCase): | ||
def setUp(self): | ||
self.text = lorem.paragraphs(5) | ||
|
||
def test_character_text_splitter(self): | ||
splitter = TextSplitter(character_text_splitter, self.text) | ||
chunks = splitter.split_text() | ||
self.assertEqual(type(chunks), list) | ||
self.assertGreaterEqual(len(chunks), len(chunks)) |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
from typing import Callable, List | ||
from langchain.text_splitter import CharacterTextSplitter | ||
from app.util import count_words | ||
import os | ||
|
||
DEFAULT_CHUNK_SIZE = os.environ.get("DEFAULT_CHUNK_SIZE", 75) | ||
DEFAULT_CHUNK_OVERLAP = os.environ.get("DEFAULT_CHUNK_OVERLAP", 75) | ||
DEFAULT_SEPARATOR = os.environ.get("DEFAULT_SEPARATOR", "\n") | ||
|
||
|
||
class TextSplitter: | ||
def __init__(self, splitter: Callable, content: str) -> None: | ||
self.splitter = splitter | ||
self.content = content | ||
|
||
def split_text(self) -> Callable: | ||
return self.splitter(self.content) | ||
|
||
|
||
def character_text_splitter( | ||
content: str, | ||
chunk_size: int = DEFAULT_CHUNK_SIZE, | ||
chunk_overlap: int = DEFAULT_CHUNK_OVERLAP, | ||
length_function: Callable = count_words, | ||
separator: str = DEFAULT_SEPARATOR) -> List: | ||
|
||
text_splitter = CharacterTextSplitter( | ||
chunk_size=chunk_size, | ||
chunk_overlap=chunk_overlap, | ||
length_function=length_function, | ||
separator=separator, | ||
) | ||
pages = text_splitter.split_text(content) | ||
return pages |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is outside the TextSplitter class, should it really be outside?