-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spike (protocol-api): database and infra upgrades for scaling tweet index volume #321
Comments
One of the bullets is documented here: https://github.com/orgs/masa-finance/projects/7/views/1?pane=issue&itemId=87858771&issue=masa-finance%7Cmasa-bittensor%7C309 |
Deleted previous "proposal" since due to actual high load the DB got filled, had to increase storage limit and also increase the db size. Rethought proposal, built a lambda that archives tweets to s3 in parquet format (for querying via duckdb), but we have issues with managing or needing postgres for this long term, but also dealing with what we have smartly in the immediate / short term. Link to Miro Board with Architectural Diagram of proposed tweet ingest, index, archive pipeline: |
@5u6r054 Let me again suggest DuckDB for analytics as opposed to Athena. If the data is properly stored in S3 (Parquet with Hive Partioning) DuckDB is the better V1 option given that it's free and can work across this stack. |
@theMultitude: Valid suggestion. DuckDB: Free vs Athena's pay-per-query I'll make an updated diagram swapping out Athena / Glue with DuckDB |
Yeah, this knocks out the Athena and Glue buildout until we need it. We can talk about the best way to partition the data given it's structure, I haven't thought on it yet. Also, @5u6r054, why not use something like firehose initially? |
@theMultitude we don't need firehose yet, we are getting tweets as POSTs from the validators of batches, so the limiting factor here is the speed of these POSTs, then they go into a table in postgres that uses the tweet id, (which are numeric and chronological) as the primary key. This ensures we are not storing duplicate tweets. we are also storing the metadata about which validator sent which tweet in the tweets_metadata table. From here, we can query the postgresdb and archive tweets to s3, and at this stage, we figure out the indexing / file partitioning scheme that is optimized for ingestion / querying by duckdb. hive partitioning makes sense, and doing the partitioning key only on the tweet id, which again is chronological, makes sense to me. But we have problems. We are increasing by 20M rows per day right now. the postgres DB, due to some suboptimal things about its schema, is rapidly getting overwhelmed.
we don't need realtime indexing of tweets, that's what the validators' api's provide. So if instead of the postgres DB we just had our go app do this: receives POST from validator then, a separate worker processes triggered to run periodically (period tuned to the tweets per minute we are taking in as compared to the number of tweets in the parquet file we want) batches the ingest/*.jsonl files into larger parquet files of optimal size for ingest by duckdb, storing them in the hive partitioned path structure. re: tuning periodicity, I mean that say we take in 10k tweets per minute and our parquet file batch size holds 100k tweets, we would set our batching script to run about every hour, at which point it would process about six batches in a run. |
evolving miro board diagram is now reflected in edited comments above. tl;dr: need to cut postgres out of the loop and just write to s3. a bit longer: "just write to s3" is still somewhat complex, can either modify the go app to do it, running via ecs, or since it's not that many requests in the grand scheme of things (just 15 validators sending the tweet data), it could be a python lambda instead of ECS go app. Either way, we still need an archiver process to record metadata (which validator harvested tweets from and when) and batch up the smaller incoming batches into big but optimal for duckdb parquet files, storing them with hive partitioning, whenever the threshold of incoming small batches of tweets is met. |
Problem:
We are seeing a large volume of tweets flowing into the indexing API and have 10's on millions of tweets in the PostgreSQL data base.
Acceptance criteria & questions arising:
The outcome of this ticket is one or more tickets that define a stable system that scales to billions of tweets
added a comment below proposing a path forward that enables us to capture, index, and archive the scale of tweets we envision (billions) in an efficient manner.
The text was updated successfully, but these errors were encountered: