Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spike (protocol-api): database and infra upgrades for scaling tweet index volume #321

Open
1 of 8 tasks
teslashibe opened this issue Dec 9, 2024 · 8 comments
Open
1 of 8 tasks
Assignees
Labels

Comments

@teslashibe
Copy link
Contributor

teslashibe commented Dec 9, 2024

Problem:

We are seeing a large volume of tweets flowing into the indexing API and have 10's on millions of tweets in the PostgreSQL data base.

Acceptance criteria & questions arising:

  • Autoscaling the containers that run the app to write to the db
  • No alerting on the db service - this would help track the status
  • Monitor and track the db - there is no way to check what is happening with vertical scaling of the db
  • IO and network is increasing as the volume in tweets scale exponentially
  • How big is this going and where do we end up with this?
  • Do we save the tweet data to flat files in S3, from there this can be loaded to any DB
  • Architecture diagram - feel to Ettore on this
  • Whitelist Validator and internal IP addresses for access so the API is secure

The outcome of this ticket is one or more tickets that define a stable system that scales to billions of tweets

added a comment below proposing a path forward that enables us to capture, index, and archive the scale of tweets we envision (billions) in an efficient manner.

@teslashibe teslashibe changed the title feat (protocol-api): database and infra upgrades for scaling tweet index voume spike (protocol-api): database and infra upgrades for scaling tweet index voume Dec 9, 2024
@teslashibe
Copy link
Contributor Author

@5u6r054
Copy link
Contributor

5u6r054 commented Dec 9, 2024

Deleted previous "proposal" since due to actual high load the DB got filled, had to increase storage limit and also increase the db size. Rethought proposal, built a lambda that archives tweets to s3 in parquet format (for querying via duckdb), but we have issues with managing or needing postgres for this long term, but also dealing with what we have smartly in the immediate / short term.

Link to Miro Board with Architectural Diagram of proposed tweet ingest, index, archive pipeline:
https://miro.com/welcomeonboard/TTBvK2hkRzVycEJGcnV2VFFtN05RS3k1L1B0SStWQzJicGx2TzhUc0JzclRWbGovTWpFcXdpRE9GL0NaZnVKV25xMU1kUTRNZXhKazhXbVRaSzZnMXFkQTFqditYSFpUaGhHZWlEUm0vN0NTc0pmeWZUckM0ak9rTmRpL2FIb0IhZQ==?share_link_id=38695797925

@theMultitude
Copy link
Contributor

theMultitude commented Dec 9, 2024

@5u6r054 Let me again suggest DuckDB for analytics as opposed to Athena.

If the data is properly stored in S3 (Parquet with Hive Partioning) DuckDB is the better V1 option given that it's free and can work across this stack.

@5u6r054
Copy link
Contributor

5u6r054 commented Dec 9, 2024

@theMultitude: Valid suggestion. DuckDB:

Free vs Athena's pay-per-query
Direct S3 integration
Fast Parquet handling
Local compute flexibility
Works well with partitioned data

I'll make an updated diagram swapping out Athena / Glue with DuckDB

@5u6r054
Copy link
Contributor

5u6r054 commented Dec 9, 2024

What's currently running:
Image

What I propose, given the scale of the data and our needs for querying this data after:
Image

@theMultitude
Copy link
Contributor

@theMultitude: Valid suggestion. DuckDB:

Free vs Athena's pay-per-query Direct S3 integration Fast Parquet handling Local compute flexibility Works well with partitioned data

I'll make an updated diagram swapping out Athena / Glue with DuckDB

Yeah, this knocks out the Athena and Glue buildout until we need it. We can talk about the best way to partition the data given it's structure, I haven't thought on it yet.

Also, @5u6r054, why not use something like firehose initially?

@5u6r054
Copy link
Contributor

5u6r054 commented Dec 10, 2024

@theMultitude we don't need firehose yet, we are getting tweets as POSTs from the validators of batches, so the limiting factor here is the speed of these POSTs, then they go into a table in postgres that uses the tweet id, (which are numeric and chronological) as the primary key. This ensures we are not storing duplicate tweets. we are also storing the metadata about which validator sent which tweet in the tweets_metadata table.

From here, we can query the postgresdb and archive tweets to s3, and at this stage, we figure out the indexing / file partitioning scheme that is optimized for ingestion / querying by duckdb. hive partitioning makes sense, and doing the partitioning key only on the tweet id, which again is chronological, makes sense to me.

But we have problems. We are increasing by 20M rows per day right now. the postgres DB, due to some suboptimal things about its schema, is rapidly getting overwhelmed.

  • The primary key id, the tweet id, while chronological and numeric, is stored as a varchar instead of a BIGINT and each id has a ton of 0000 padding it.
  • apart from sorting out dupes and grabbing metadata, do we actually really need postgres, if we're going to be providing 3 things:
  1. storing only new unique / unstored tweets
  2. recording metadata about the validator that submitted the tweets
  3. providing responses to queries for tweets that require them to be keyword indexed, so searches by text can be narrowed by time/date range.

we don't need realtime indexing of tweets, that's what the validators' api's provide.

So if instead of the postgres DB we just had our go app do this:

receives POST from validator
writes POST to s3 as JSONL file named by the first tweet id _ last tweet id to a tweet_bucket/ingest/ path.

then, a separate worker processes triggered to run periodically (period tuned to the tweets per minute we are taking in as compared to the number of tweets in the parquet file we want) batches the ingest/*.jsonl files into larger parquet files of optimal size for ingest by duckdb, storing them in the hive partitioned path structure. re: tuning periodicity, I mean that say we take in 10k tweets per minute and our parquet file batch size holds 100k tweets, we would set our batching script to run about every hour, at which point it would process about six batches in a run.

@5u6r054
Copy link
Contributor

5u6r054 commented Dec 10, 2024

evolving miro board diagram is now reflected in edited comments above.

tl;dr:

need to cut postgres out of the loop and just write to s3.

a bit longer:

"just write to s3" is still somewhat complex, can either modify the go app to do it, running via ecs, or since it's not that many requests in the grand scheme of things (just 15 validators sending the tweet data), it could be a python lambda instead of ECS go app.

Either way, we still need an archiver process to record metadata (which validator harvested tweets from and when) and batch up the smaller incoming batches into big but optimal for duckdb parquet files, storing them with hive partitioning, whenever the threshold of incoming small batches of tweets is met.

@5u6r054 5u6r054 changed the title spike (protocol-api): database and infra upgrades for scaling tweet index voume spike (protocol-api): database and infra upgrades for scaling tweet index volume Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants