spike (protocol-api): database and infra upgrades for scaling tweet index volume #321

teslashibe · 2024-12-09T17:13:39Z

Problem:

We are seeing a large volume of tweets flowing into the indexing API and have 10's on millions of tweets in the PostgreSQL data base.

Acceptance criteria & questions arising:

Autoscaling the containers that run the app to write to the db
No alerting on the db service - this would help track the status
Monitor and track the db - there is no way to check what is happening with vertical scaling of the db
IO and network is increasing as the volume in tweets scale exponentially
How big is this going and where do we end up with this?
Do we save the tweet data to flat files in S3, from there this can be loaded to any DB
Architecture diagram - feel to Ettore on this
Whitelist Validator and internal IP addresses for access so the API is secure

The outcome of this ticket is one or more tickets that define a stable system that scales to billions of tweets

added a comment below proposing a path forward that enables us to capture, index, and archive the scale of tweets we envision (billions) in an efficient manner.

teslashibe · 2024-12-09T17:40:45Z

One of the bullets is documented here: https://github.com/orgs/masa-finance/projects/7/views/1?pane=issue&itemId=87858771&issue=masa-finance%7Cmasa-bittensor%7C309

5u6r054 · 2024-12-09T19:53:40Z

Deleted previous "proposal" since due to actual high load the DB got filled, had to increase storage limit and also increase the db size. Rethought proposal, built a lambda that archives tweets to s3 in parquet format (for querying via duckdb), but we have issues with managing or needing postgres for this long term, but also dealing with what we have smartly in the immediate / short term.

Link to Miro Board with Architectural Diagram of proposed tweet ingest, index, archive pipeline:
https://miro.com/welcomeonboard/TTBvK2hkRzVycEJGcnV2VFFtN05RS3k1L1B0SStWQzJicGx2TzhUc0JzclRWbGovTWpFcXdpRE9GL0NaZnVKV25xMU1kUTRNZXhKazhXbVRaSzZnMXFkQTFqditYSFpUaGhHZWlEUm0vN0NTc0pmeWZUckM0ak9rTmRpL2FIb0IhZQ==?share_link_id=38695797925

theMultitude · 2024-12-09T20:15:10Z

@5u6r054 Let me again suggest DuckDB for analytics as opposed to Athena.

If the data is properly stored in S3 (Parquet with Hive Partioning) DuckDB is the better V1 option given that it's free and can work across this stack.

5u6r054 · 2024-12-09T20:21:28Z

@theMultitude: Valid suggestion. DuckDB:

Free vs Athena's pay-per-query
Direct S3 integration
Fast Parquet handling
Local compute flexibility
Works well with partitioned data

I'll make an updated diagram swapping out Athena / Glue with DuckDB

5u6r054 · 2024-12-09T20:25:24Z

What's currently running:

What I propose, given the scale of the data and our needs for querying this data after:

theMultitude · 2024-12-10T17:24:37Z

@theMultitude: Valid suggestion. DuckDB:

Free vs Athena's pay-per-query Direct S3 integration Fast Parquet handling Local compute flexibility Works well with partitioned data

I'll make an updated diagram swapping out Athena / Glue with DuckDB

Yeah, this knocks out the Athena and Glue buildout until we need it. We can talk about the best way to partition the data given it's structure, I haven't thought on it yet.

Also, @5u6r054, why not use something like firehose initially?

5u6r054 · 2024-12-10T17:50:07Z

@theMultitude we don't need firehose yet, we are getting tweets as POSTs from the validators of batches, so the limiting factor here is the speed of these POSTs, then they go into a table in postgres that uses the tweet id, (which are numeric and chronological) as the primary key. This ensures we are not storing duplicate tweets. we are also storing the metadata about which validator sent which tweet in the tweets_metadata table.

From here, we can query the postgresdb and archive tweets to s3, and at this stage, we figure out the indexing / file partitioning scheme that is optimized for ingestion / querying by duckdb. hive partitioning makes sense, and doing the partitioning key only on the tweet id, which again is chronological, makes sense to me.

But we have problems. We are increasing by 20M rows per day right now. the postgres DB, due to some suboptimal things about its schema, is rapidly getting overwhelmed.

The primary key id, the tweet id, while chronological and numeric, is stored as a varchar instead of a BIGINT and each id has a ton of 0000 padding it.
apart from sorting out dupes and grabbing metadata, do we actually really need postgres, if we're going to be providing 3 things:

storing only new unique / unstored tweets
recording metadata about the validator that submitted the tweets
providing responses to queries for tweets that require them to be keyword indexed, so searches by text can be narrowed by time/date range.

we don't need realtime indexing of tweets, that's what the validators' api's provide.

So if instead of the postgres DB we just had our go app do this:

receives POST from validator
writes POST to s3 as JSONL file named by the first tweet id _ last tweet id to a tweet_bucket/ingest/ path.

then, a separate worker processes triggered to run periodically (period tuned to the tweets per minute we are taking in as compared to the number of tweets in the parquet file we want) batches the ingest/*.jsonl files into larger parquet files of optimal size for ingest by duckdb, storing them in the hive partitioned path structure. re: tuning periodicity, I mean that say we take in 10k tweets per minute and our parquet file batch size holds 100k tweets, we would set our batching script to run about every hour, at which point it would process about six batches in a run.

5u6r054 · 2024-12-10T18:37:52Z

evolving miro board diagram is now reflected in edited comments above.

tl;dr:

need to cut postgres out of the loop and just write to s3.

a bit longer:

"just write to s3" is still somewhat complex, can either modify the go app to do it, running via ecs, or since it's not that many requests in the grand scheme of things (just 15 validators sending the tweet data), it could be a python lambda instead of ECS go app.

Either way, we still need an archiver process to record metadata (which validator harvested tweets from and when) and batch up the smaller incoming batches into big but optimal for duckdb parquet files, storing them with hive partitioning, whenever the threshold of incoming small batches of tweets is met.

teslashibe changed the title ~~feat (protocol-api): database and infra upgrades for scaling tweet index voume~~ spike (protocol-api): database and infra upgrades for scaling tweet index voume Dec 9, 2024

teslashibe assigned teslashibe, 5u6r054 and grantdfoster and unassigned teslashibe Dec 9, 2024

teslashibe added the spike label Dec 9, 2024

5u6r054 changed the title ~~spike (protocol-api): database and infra upgrades for scaling tweet index voume~~ spike (protocol-api): database and infra upgrades for scaling tweet index volume Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spike (protocol-api): database and infra upgrades for scaling tweet index volume #321

spike (protocol-api): database and infra upgrades for scaling tweet index volume #321

teslashibe commented Dec 9, 2024 •

edited by 5u6r054

Loading

teslashibe commented Dec 9, 2024

5u6r054 commented Dec 9, 2024 •

edited

Loading

theMultitude commented Dec 9, 2024 •

edited

Loading

5u6r054 commented Dec 9, 2024

5u6r054 commented Dec 9, 2024 •

edited

Loading

theMultitude commented Dec 10, 2024

5u6r054 commented Dec 10, 2024

5u6r054 commented Dec 10, 2024

spike (protocol-api): database and infra upgrades for scaling tweet index volume #321

spike (protocol-api): database and infra upgrades for scaling tweet index volume #321

Comments

teslashibe commented Dec 9, 2024 • edited by 5u6r054 Loading

Problem:

Acceptance criteria & questions arising:

teslashibe commented Dec 9, 2024

5u6r054 commented Dec 9, 2024 • edited Loading

theMultitude commented Dec 9, 2024 • edited Loading

5u6r054 commented Dec 9, 2024

5u6r054 commented Dec 9, 2024 • edited Loading

theMultitude commented Dec 10, 2024

5u6r054 commented Dec 10, 2024

5u6r054 commented Dec 10, 2024

teslashibe commented Dec 9, 2024 •

edited by 5u6r054

Loading

5u6r054 commented Dec 9, 2024 •

edited

Loading

theMultitude commented Dec 9, 2024 •

edited

Loading

5u6r054 commented Dec 9, 2024 •

edited

Loading