Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalable data ingestion architecture with microservices #53

Open
teocns opened this issue Feb 20, 2024 · 11 comments
Open

Scalable data ingestion architecture with microservices #53

teocns opened this issue Feb 20, 2024 · 11 comments
Labels
brainstorming brainstorming documentation Improvements or additions to documentation enhancement New feature or request

Comments

@teocns
Copy link

teocns commented Feb 20, 2024

Building the Docker image I noticed it was sized at 7GB in size and that it took 5 minutes for the build to complete.

REPOSITORY TAG IMAGE ID CREATED SIZE
super-rag-api latest 1c9aa42e2450 2 days ago 7.09GB

It turns out PyTorch is responsible for pulling its massive nvidia driver libraries packaged in torch[gpu], while its sibling torch[cpu] bottoms down at 56mb.

pip list sorted by size
$ pip list   | tail -n +3   | awk '{print $1}'   | xargs pip show   | grep -E 'Location:|Name:'   | cut -d ' ' -f 2   | paste -d ' ' - -   | awk '{print $2 "/" tolower($1)}'   | xargs du -sh 2> /dev/null   | sort -hr

1.5G	/app/.venv/lib/python3.11/site-packages/torch
419M	/app/.venv/lib/python3.11/site-packages/triton
72M	/app/.venv/lib/python3.11/site-packages/cmake
62M	/app/.venv/lib/python3.11/site-packages/onnx
46M	/app/.venv/lib/python3.11/site-packages/pandas
40M	/app/.venv/lib/python3.11/site-packages/transformers
33M	/app/.venv/lib/python3.11/site-packages/matplotlib
27M	/app/.venv/lib/python3.11/site-packages/sympy
23M	/app/.venv/lib/python3.11/site-packages/layoutparser
21M	/app/.venv/lib/python3.11/site-packages/lxml
17M	/app/.venv/lib/python3.11/site-packages/onnxruntime
15M	/app/.venv/lib/python3.11/site-packages/pip
14M	/app/.venv/lib/python3.11/site-packages/rapidfuzz
14M	/app/.venv/lib/python3.11/site-packages/cryptography
13M	/app/.venv/lib/python3.11/site-packages/sqlalchemy
13M	/app/.venv/lib/python3.11/site-packages/debugpy
12M	/app/.venv/lib/python3.11/site-packages/tokenizers
12M	/app/.venv/lib/python3.11/site-packages/fastavro
11M	/app/.venv/lib/python3.11/site-packages/torchvision
11M	/app/.venv/lib/python3.11/site-packages/jedi
7.6M	/app/.venv/lib/python3.11/site-packages/timm
7.1M	/app/.venv/lib/python3.11/site-packages/networkx
6.5M	/app/.venv/lib/python3.11/site-packages/unstructured
6.5M	/app/.venv/lib/python3.11/site-packages/nltk
6.4M	/app/.venv/lib/python3.11/site-packages/tiktoken
5.5M	/app/.venv/lib/python3.11/site-packages/pydantic_core
5.3M	/app/.venv/lib/python3.11/site-packages/kiwisolver
4.9M	/app/.venv/lib/python3.11/site-packages/safetensors
4.8M	/app/.venv/lib/python3.11/site-packages/pygments
4.8M	/app/.venv/lib/python3.11/site-packages/aiohttp
3.1M	/app/.venv/lib/python3.11/site-packages/emoji
2.9M	/app/.venv/lib/python3.11/site-packages/regex
2.7M	/app/.venv/lib/python3.11/site-packages/tzdata
2.7M	/app/.venv/lib/python3.11/site-packages/pytz
2.7M	/app/.venv/lib/python3.11/site-packages/pikepdf
2.6M	/app/.venv/lib/python3.11/site-packages/setuptools
2.4M	/app/.venv/lib/python3.11/site-packages/langdetect
2.2M	/app/.venv/lib/python3.11/site-packages/greenlet
2.1M	/app/.venv/lib/python3.11/site-packages/mpmath
1.8M	/app/.venv/lib/python3.11/site-packages/tornado
1.7M	/app/.venv/lib/python3.11/site-packages/pydantic
1.6M	/app/.venv/lib/python3.11/site-packages/pycocotools
1.5M	/app/.venv/lib/python3.11/site-packages/openai
1.4M	/app/.venv/lib/python3.11/site-packages/openpyxl
1.3M	/app/.venv/lib/python3.11/site-packages/pypdf
1.3M	/app/.venv/lib/python3.11/site-packages/joblib
1.3M	/app/.venv/lib/python3.11/site-packages/authlib
1.2M	/app/.venv/lib/python3.11/site-packages/chardet
1.1M	/app/.venv/lib/python3.11/site-packages/yarl
1.1M	/app/.venv/lib/python3.11/site-packages/psutil
984K	/app/.venv/lib/python3.11/site-packages/contourpy
916K	/app/.venv/lib/python3.11/site-packages/frozenlist
812K	/app/.venv/lib/python3.11/site-packages/xlsxwriter
760K	/app/.venv/lib/python3.11/site-packages/fastapi
724K	/app/.venv/lib/python3.11/site-packages/fsspec
660K	/app/.venv/lib/python3.11/site-packages/black
640K	/app/.venv/lib/python3.11/site-packages/pycparser
528K	/app/.venv/lib/python3.11/site-packages/jinja2
488K	/app/.venv/lib/python3.11/site-packages/wcwidth
488K	/app/.venv/lib/python3.11/site-packages/effdet
464K	/app/.venv/lib/python3.11/site-packages/urllib3
452K	/app/.venv/lib/python3.11/site-packages/multidict
452K	/app/.venv/lib/python3.11/site-packages/ipykernel
436K	/app/.venv/lib/python3.11/site-packages/jupyter_client
412K	/app/.venv/lib/python3.11/site-packages/pyparsing
412K	/app/.venv/lib/python3.11/site-packages/cffi
412K	/app/.venv/lib/python3.11/site-packages/anyio
396K	/app/.venv/lib/python3.11/site-packages/olefile
388K	/app/.venv/lib/python3.11/site-packages/omegaconf
384K	/app/.venv/lib/python3.11/site-packages/xlrd
376K	/app/.venv/lib/python3.11/site-packages/parso
372K	/app/.venv/lib/python3.11/site-packages/markdown
368K	/app/.venv/lib/python3.11/site-packages/traitlets
364K	/app/.venv/lib/python3.11/site-packages/click
344K	/app/.venv/lib/python3.11/site-packages/httpx
336K	/app/.venv/lib/python3.11/site-packages/pyflakes
328K	/app/.venv/lib/python3.11/site-packages/starlette
328K	/app/.venv/lib/python3.11/site-packages/httpcore
312K	/app/.venv/lib/python3.11/site-packages/humanfriendly
308K	/app/.venv/lib/python3.11/site-packages/certifi
292K	/app/.venv/lib/python3.11/site-packages/uvicorn
288K	/app/.venv/lib/python3.11/site-packages/idna
280K	/app/.venv/lib/python3.11/site-packages/wrapt
252K	/app/.venv/lib/python3.11/site-packages/flake8
252K	/app/.venv/lib/python3.11/site-packages/cassio
248K	/app/.venv/lib/python3.11/site-packages/tqdm
248K	/app/.venv/lib/python3.11/site-packages/pypdfium2
248K	/app/.venv/lib/python3.11/site-packages/h11
244K	/app/.venv/lib/python3.11/site-packages/h2
244K	/app/.venv/lib/python3.11/site-packages/cohere
232K	/app/.venv/lib/python3.11/site-packages/hpack
224K	/app/.venv/lib/python3.11/site-packages/pipdeptree
224K	/app/.venv/lib/python3.11/site-packages/pexpect
220K	/app/.venv/lib/python3.11/site-packages/requests
204K	/app/.venv/lib/python3.11/site-packages/marshmallow
184K	/app/.venv/lib/python3.11/site-packages/pdfplumber
184K	/app/.venv/lib/python3.11/site-packages/packaging
156K	/app/.venv/lib/python3.11/site-packages/astrapy
152K	/app/.venv/lib/python3.11/site-packages/coloredlogs
148K	/app/.venv/lib/python3.11/site-packages/soupsieve
148K	/app/.venv/lib/python3.11/site-packages/iopath
116K	/app/.venv/lib/python3.11/site-packages/vulture
112K	/app/.venv/lib/python3.11/site-packages/flatbuffers
108K	/app/.venv/lib/python3.11/site-packages/validators
108K	/app/.venv/lib/python3.11/site-packages/jupyter_core
108K	/app/.venv/lib/python3.11/site-packages/filetype
104K	/app/.venv/lib/python3.11/site-packages/tabulate
96K	/app/.venv/lib/python3.11/site-packages/pillow_heif
96K	/app/.venv/lib/python3.11/site-packages/pathspec
96K	/app/.venv/lib/python3.11/site-packages/dirtyjson
88K	/app/.venv/lib/python3.11/site-packages/platformdirs
88K	/app/.venv/lib/python3.11/site-packages/markupsafe
88K	/app/.venv/lib/python3.11/site-packages/executing
84K	/app/.venv/lib/python3.11/site-packages/asttokens
76K	/app/.venv/lib/python3.11/site-packages/tenacity
68K	/app/.venv/lib/python3.11/site-packages/toml
68K	/app/.venv/lib/python3.11/site-packages/geomet
64K	/app/.venv/lib/python3.11/site-packages/distro
60K	/app/.venv/lib/python3.11/site-packages/pypandoc
60K	/app/.venv/lib/python3.11/site-packages/portalocker
56K	/app/.venv/lib/python3.11/site-packages/backoff
48K	/app/.venv/lib/python3.11/site-packages/ptyprocess
48K	/app/.venv/lib/python3.11/site-packages/pdf2image
48K	/app/.venv/lib/python3.11/site-packages/hyperframe
44K	/app/.venv/lib/python3.11/site-packages/filelock
32K	/app/.venv/lib/python3.11/site-packages/deprecated
32K	/app/.venv/lib/python3.11/site-packages/attrs
24K	/app/.venv/lib/python3.11/site-packages/zipp
24K	/app/.venv/lib/python3.11/site-packages/termcolor
24K	/app/.venv/lib/python3.11/site-packages/sniffio
24K	/app/.venv/lib/python3.11/site-packages/pytesseract
24K	/app/.venv/lib/python3.11/site-packages/cycler
24K	/app/.venv/lib/python3.11/site-packages/colorlog
20K	/app/.venv/lib/python3.11/site-packages/comm
12K	/app/.venv/lib/python3.11/site-packages/docx2txt
12K	/app/.venv/lib/python3.11/site-packages/aiosignal
8.0K	/app/.venv/lib/python3.11/site-packages/ruff

New to poetry, I've been through several community discussions covering the same kind of issue:

While the solution could've been as simple as adding extras = [ "cpu" ] or setting the /cpu branch as torch wheels source URL, that wasn't possible.

PyTorch does not implement a specific PEP-standard protocol consumers (such as Poetry) look for when enumerating the package wheels' index.

The last resort is bundling a list of .whl URLs passed to PyTorch's dependency source by intersecting:

Architecture Python Version Platform
arm64 3.9 darwin
x86 3.10 windows
aarch64 3.11 linux

Sticking to Python range 3.9 <> 3.12 as it was initially defined in pyconfig.toml.

The list got quite long, the build time took longer (3x compared to the original).

The XY problem?

Super-rag's vision is a highly available and scalable API backed by workers, thus looking at a microservice-oriented architecture.

Torch is a heavy-lifting CPU/GPU-bound toolkit meant to be decoupled from the IO-bound API. It is a use-case, and it is probable that as the project grows, other "strategies" will be implemented, each with their own use cases.

It is essential for workers' images to be minimal. The image size directly impacts launch-time [availability]: you want a worker's image to be pulled, loaded in memory and start as quickly as possible.

Therefore, analyzing and understanding the use cases for dependencies helps identifying common libraries or services defined as reusable granular image layers.

As for now, my proposed solution is to have individual images (or layers) that only ships with what is strictly necessary for a given triplet (platform, architecture, python version).

Feel free to brainstorm with me on this subject; ideas are always welcome!

@teocns teocns added documentation Improvements or additions to documentation brainstorming brainstorming labels Feb 20, 2024
@teocns teocns assigned teocns and unassigned teocns Feb 20, 2024
@teocns teocns added the enhancement New feature or request label Feb 20, 2024
@homanp
Copy link
Contributor

homanp commented Feb 20, 2024

As for now, my proposed solution is to have individual images (or layers) that only ships with what is strictly necessary for a given triplet (platform, architecture, python version).

I agree with this. Not sure that the individual lib/packages currently being bundled in (specifically for encoders) is the best route forward.

How would one accomplish the layering part?!

@teocns
Copy link
Author

teocns commented Feb 21, 2024

@homanp How would one accomplish the layering part?!

Good question!

It rarely happens to find a one-fits-all solution, nevertheless I find inspiration in the those that drive modern technology (Elastic Search, Netflix, Kubernetes).

To have aclearer Idea I'd need to understand the expected super-rag workflow in detail (maybe an flow chart), but essentially we are looking at two fundamental design patterns:

a) Producer-Consumer

Traditional Queue-Broker/Exchange-Celery pub/sub as you know it

b) Control Plane - Data plane

The pipeline is entirely driven by the topic message [payload]. Comparable to langgraphs' "Graph", composed by nodes and edges that implementing an upstream/downstream communication system.

What I can tell from my experience is the traditional (a) design might not fit the kind of scalability modern technology demands (super-rag might be the case), unless you want to have DevOps team shooting themselves in the foot as queues start spilling overnight.

You may find detailed context in these articles:

@teocns teocns changed the title Docker and microservices Scalable data ingestion architecture with microservices Feb 21, 2024
@teocns teocns pinned this issue Feb 21, 2024
@elisalimli
Copy link
Contributor

As for now, my proposed solution is to have individual images (or layers) that only ships with what is strictly necessary for a given triplet (platform, architecture, python version).

@teocns I don't think we should worry about this since we prefer horizontal scaling over serverless functions at the moment. So that's not big deal I guess

@teocns
Copy link
Author

teocns commented Feb 21, 2024

@elisalimli Are we looking at a monolithic multi-threaded worker application sharing the same process runtime?

@homanp
Copy link
Contributor

homanp commented Feb 22, 2024

I added some optimisations in there now to allow for some concurrency without hitting rate limits.

@teocns
Copy link
Author

teocns commented Feb 22, 2024

@homanp perfect solution for not blocking I/O, though let's keep in mind that it operates on one CPU core.

image

Based on how we think of deploying this in the future, we might want to use multiprocessing pools and let the user specify n parallelism factor or fall back to the host number of CPUs

@homanp
Copy link
Contributor

homanp commented Feb 22, 2024

@homanp perfect solution for not blocking I/O, though let's keep in mind that it operates on one CPU core.

image

Based on how we think of deploying this in the future, we might want to use multiprocessing pools and let the user specify n parallelism factor or fall back to the host number of CPUs

Makes sense

@homanp
Copy link
Contributor

homanp commented Feb 22, 2024

I see two ways forward:

  1. Keep to using lightweight SDK wrappers around unstructured library. This would require the user to spin up an instance of unstructured on their own and pass in the config.

  2. Decouple unstructured as a micro-service similar to @teocns idea.

Not sure which approach is best atm.

@elisalimli
Copy link
Contributor

@elisalimli Are we looking at a monolithic multi-threaded worker application sharing the same process runtime?

for the current moment, yes.

@homanp
Copy link
Contributor

homanp commented Feb 23, 2024

I have now decoupled the unstructured package and am only utilising the client SDK. Makes the API much more lightweight but also gives the user to run locally.

@homanp homanp unpinned this issue Mar 2, 2024
@elisalimli elisalimli mentioned this issue Mar 11, 2024
@elisalimli
Copy link
Contributor

@teocns we have implemented this is in #91

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
brainstorming brainstorming documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants