Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add character text splitter #6

Merged
merged 43 commits into from
Jan 31, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
be5ec7b
add character text splitter
AlisoSouza Nov 10, 2023
192cb49
add embbeddings
AlisoSouza Nov 10, 2023
11c0209
temp code
AlisoSouza Nov 16, 2023
78fdc2d
add celery to docker-compose
AlisoSouza Dec 1, 2023
ea12e1e
add celery and redis dependencies
AlisoSouza Dec 1, 2023
b03e32e
fix root_validator deprecated warning
AlisoSouza Dec 1, 2023
719acdc
add S3 file downloader
AlisoSouza Dec 1, 2023
f82c1d8
add celery and content bases api
AlisoSouza Dec 1, 2023
0d73358
add IndexerFileManager
AlisoSouza Dec 12, 2023
0b2bf89
add content base search endpoint
AlisoSouza Dec 14, 2023
2b56939
add token verification to content bases handler
AlisoSouza Dec 15, 2023
36f064a
Add: NexusRESTClient; call nexus endpoint after indexing document
AlisoSouza Dec 19, 2023
f15a4bb
fix: NexusRESTClient circular import
AlisoSouza Dec 27, 2023
ac87404
add: delete content base file endpoint
AlisoSouza Dec 27, 2023
43fe32d
add index_file_url
AlisoSouza Jan 10, 2024
a5db3e8
add a line at the end of files
AlisoSouza Jan 10, 2024
56b2099
return full page at content base search, add PDFLoader and DataLoader…
AlisoSouza Jan 12, 2024
8bdbf8e
add txt and docx class loaders
AlisoSouza Jan 18, 2024
288d176
ajust txt loader to save file temp
AlisoSouza Jan 19, 2024
2a24bc7
send file type in the request
AlisoSouza Jan 22, 2024
4b019c5
send text of file in file_type
AlisoSouza Jan 22, 2024
b4953ae
fix INDEX_CONTENTBASES_NAME env var
AlisoSouza Jan 25, 2024
adc476b
add file_uuid to metadata
AlisoSouza Jan 29, 2024
5f7cc9c
update elasticsearch vectors, search by file_uuid
AlisoSouza Jan 29, 2024
6d3ab64
delete file by uuid
AlisoSouza Jan 29, 2024
257cdf5
index as environment variable
AlisoSouza Jan 30, 2024
f980c53
change docx loader
AlisoSouza Jan 30, 2024
6683157
add: xlsx and xls support
AlisoSouza Jan 30, 2024
b575f23
xlsx: save temp file
AlisoSouza Jan 30, 2024
c860aa4
Merge pull request #24 from weni-ai/fix/xlsx
AlisoSouza Jan 31, 2024
14283b3
Merge pull request #23 from weni-ai/fix/DocxLoader
AlisoSouza Jan 31, 2024
b5b622c
Merge pull request #22 from weni-ai/feature/delete-file-by-uuid
AlisoSouza Jan 31, 2024
85c4dec
Merge pull request #21 from weni-ai/feature/add_file_uuid_metadata
AlisoSouza Jan 31, 2024
72c1a56
Merge pull request #20 from weni-ai/fix/index_succedded
AlisoSouza Jan 31, 2024
a21e2a1
Merge pull request #19 from weni-ai/feature/document-loader-cls
AlisoSouza Jan 31, 2024
2d2526f
Merge pull request #18 from weni-ai/feature/full_page_index
AlisoSouza Jan 31, 2024
9c6d328
Merge pull request #17 from weni-ai/feature/load-file-url
AlisoSouza Jan 31, 2024
7b8a6ba
Merge pull request #14 from weni-ai/feature/delete-content-base
AlisoSouza Jan 31, 2024
c7501af
Merge pull request #13 from weni-ai/feature/nexus-rest
AlisoSouza Jan 31, 2024
0c939f3
Merge pull request #12 from weni-ai/feature/token-verification
AlisoSouza Jan 31, 2024
4113913
Merge pull request #11 from weni-ai/feature/search-endpoint
AlisoSouza Jan 31, 2024
a18c561
Merge pull request #9 from weni-ai/feature/content-base-api
AlisoSouza Jan 31, 2024
7d7f75e
Merge branch 'main' into feature/text-splitter
AlisoSouza Jan 31, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions app/tests/test_text_splitter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
import unittest
from app.text_splitters.text_splitters import (
TextSplitter, character_text_splitter
)
from lorem_text import lorem


class TestProductsHandler(unittest.TestCase):
def setUp(self):
self.text = lorem.paragraphs(5)

def test_character_text_splitter(self):
splitter = TextSplitter(character_text_splitter, self.text)
chunks = splitter.split_text()
self.assertEqual(type(chunks), list)
self.assertGreaterEqual(len(chunks), len(chunks))
Empty file added app/text_splitters/__init__.py
Empty file.
34 changes: 34 additions & 0 deletions app/text_splitters/text_splitters.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
from typing import Callable, List
from langchain.text_splitter import CharacterTextSplitter
from app.util import count_words
import os

DEFAULT_CHUNK_SIZE = os.environ.get("DEFAULT_CHUNK_SIZE", 75)
DEFAULT_CHUNK_OVERLAP = os.environ.get("DEFAULT_CHUNK_OVERLAP", 75)
DEFAULT_SEPARATOR = os.environ.get("DEFAULT_SEPARATOR", "\n")


class TextSplitter:
def __init__(self, splitter: Callable, content: str) -> None:
self.splitter = splitter
self.content = content

def split_text(self) -> Callable:
return self.splitter(self.content)


def character_text_splitter(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is outside the TextSplitter class, should it really be outside?

content: str,
chunk_size: int = DEFAULT_CHUNK_SIZE,
chunk_overlap: int = DEFAULT_CHUNK_OVERLAP,
length_function: Callable = count_words,
separator: str = DEFAULT_SEPARATOR) -> List:

text_splitter = CharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=length_function,
separator=separator,
)
pages = text_splitter.split_text(content)
return pages
4 changes: 4 additions & 0 deletions app/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,7 @@ def transform_input(self, inputs: list[str], model_kwargs: dict) -> bytes:
def transform_output(self, output: bytes) -> list[list[float]]:
response_json = json.loads(output.read().decode("utf-8"))
return response_json["vectors"]


def count_words(string: str):
return len(string.split())
16 changes: 15 additions & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ black = "^23.9.1"
reportlab = "^4.0.7"
xlsxwriter = "^3.1.9"
flake8 = "^6.1.0"
lorem-text = "^2.1"

[build-system]
requires = ["poetry-core"]
Expand Down