Team Ingestion Q4 2022 Planning #11749

yakkomajuri · 2022-09-08T20:47:29Z

Edits

09/09/2022: Added "3.3 Ingestion scaling for self-hosted" and listed as priority 3

Dropping here a few initial thoughts about what we could pick up in the upcoming quarter.

Proposed priorities for Q4

In order, written by someone who's currently exhausted. Opinions might change in the near future.

Guaranteed job execution (1.1)
Plugin metrics (1.2)
Ingestion scaling for self-hosted (3.3)
Automated testing framework (2.3)
Dead letter queue (3.1)

Options

1. PostHog Pipeline as a Product

1.1 Guaranteed job execution

Jobs have become a massive part of PostHog. They power a good chunk of apps, particularly the more important ones (export apps).

We recently added retries to enqueueing jobs (#11561), but the system needs to be more robust. We need at least another fallback queue, with retries on top.

Something like Aurora + our main Postgres instance. We'd also need to think about how we'd like this to work for self-hosted users, where jobs already go to the main Postgres instance.

The goal here would be to never lose a job.

1.2 Plugin metrics (& more?)

A year ago, we added metrics to plugins. This was pretty cool. Devs could add their own metrics, and we automagically added metrics to export plugins. However, some frontend changes to insights broke them, and given we didn't fix it for a long time, we recently yeeted the feature, promising to revisit it in the future.

Metrics are important because:

They help plugin devs debug their plugins
They help plugin users debug their config
They provide transparency to users about plugin performance
They make us accountable
They help us identify if issues are our problem or a user configuration problem

Segment can work as inspiration here:

Beyond plugin metrics, we might also want to offer ingestion metrics, surfacing issues with event payloads to users so they can fix their implementations and we can be free to write a dumb (and thus scalable) events service, which does no validation on events and just drops them in Kafka. In order to do this, users need to know what, why, and how many events were dropped for some payload issue.

1.3 Export plugins

We've recently improved a lot of our export plugins, but there's more work to be done. There are plugins we should refactor, we can write more unit tests, etc.

More power to jobs

Essentially #10816.

3 key things:

Cancelling a job that hasn't run yet
Debounce keys (so trying to trigger the same job multiple times only triggers once)
Monitoring job status

We might also want to consider a "long-running jobs" implementation, which we discussed a year or so ago. Essentially better native support for things like historical exports.

2. Housekeeping

2.1 Evaluate getting rid of Piscina

An idea has floated around to get rid of Piscina, and I do quite like it at a high level.

If we can get rid of this whole thread pool, our codebase will be significantly simpler, which is a big win. We'll be able to understand things better, debug things better, etc.

Trying this out is not too hard - I've written a little PoC for this locally as I was curious. Essentially, the idea would be to replace Piscina with horizontal scaling - spinning up more instances rather than more threads per instance.

If getting rid of Piscina turns out to be slightly less performant and slightly more costly, I'd still think it'd be worth it, for the complexity it gets rid of.

I don't necessarily think Piscina helps us with Performance a lot, as the stuff we do in workers is very async-heavy and not too CPU intensive, meaning "single-threaded vanilla" Node should handle it well.

However, the two key things to watch out for are:

Background processing management
Event loop blockages

I'm particularly worried about 2. If a malicious plugin hogs the event loop for 30s (our current max timeout) - or even 5s, we're fine, because the main thread is protected. If we get rid of worker threads, suddenly we need to be a lot more careful with this.

Honestly, I think it's hard for us to come up with a great solution for this, so we'll probably stick with Piscina for now. But always worth considering the proposal.

2.2 Renaming plugins --> apps

This is perhaps even more annoying of a task than Yeetcode, but one we need to do at some point.

We moved too fast with renaming plugins to apps on the website and UI, and now we're left with a lot of confusion when talking to customers, naming models, writing frontend code, etc.

If we pick this up, might be worth renaming the plugin-server to ingestion-server too.

2.3 Automated testing framework

We've discussed writing a little system to run automated tests on plugins so we can make sure that plugins are working at all times.

This would involve e.g. spying on network calls and running snapshot tests on plugin server changes.

@macobo event mentioned having a good sense of how he'd build this already.

3. Core pipeline

3.1 Finally make the dead letter queue useful

Some would argue the dead letter queue should be only for events that are malformed and we really cannot process. Whatever we want to call it though, we currently have a "events retry queue" and we don't retry ingestion from it.

Tasks:

3.2 Queueing system V2

Creating a better queue system, whatever that looks like: queues per team, queues per groups of teams, queues per <team, app>, queues per speed, etc.

Doing so would allow us to be more performant and robust.

3.3 Ingestion scaling for self-hosted

We still never got this in: PostHog/charts-clickhouse#243

And it's understandable - it seems simple but comes with a good chunk of complicated configuration.

However, the truth is that currently ingestion on self-hosted instances doesn't scale very well, nor is it very robust.

Thus we need to at least:

Start creating topics with more partitions
Devise a way for people to move to Kafka topics with more partitions without losing data
Write docs for scaling ingestion

4. Miscellaneous ideas

Some random stuff that I have in mind, just dropping here for now. A lot of it is thinking way ahead, and they probably don't make a lot of sense at the moment, just anticipating problems we might have a long way down the line.

Dumb events service
Ingester service (rather than using Kafka tables)
Move everything to Protobuf (or some other serialization framework)
...

tiina303 · 2022-09-19T11:10:28Z

Something more to consider:

Ingest events backwards #6834 - this is slightly more important after persons on events, where a bad person property value can't be fixed easily later. Also relevant if we want to consume from DLQ.
Our merging (identify / aliasing) persons code could use some love
Person creation failure prolonged by cache #10463 (fits under 3.1?)

Consuming from DLQ: this is a bit tricky because of person properties, both due to their updates causing bad overwrites and the values of some of the person properties might have changed between the time the event went to DLQ and re-ingestion. Alternative: fix the sources of those problems and set up alerting to make sure we keep addressing them on a timely manner.

tiina303 · 2022-09-20T16:23:20Z

#7710 - might also be something to plan for fixing in Q4

macobo · 2022-09-21T07:16:05Z

Q: Do we want to be planning a roadmap or do we care about setting objectives?

From the limited information here and in PostHog/posthog.com#4217 it feels to me we're getting too stuck setting a roadmap/listing important things without having a clear guiding principle behind them.

Rant / background - roadmaps vs objectives

A roadmap defines what we build quite strictly - it's either under-scoped or over-scoped and needs constant iteration to keep up to date. If it's underscoped, it doesn't provide us tooling to think beyond the tasks listed - how do we prioritize what to take on? If it's overscoped it's a receipe for disaster as people in the team get frustrated due to lack of recognition beyond a checklist, people outside the team get frustrated due to a lack of apparent progress, we won't have tools to tackle unexpected changes in direction and poor product experiences due to working off of a checklist. In worst cases I've seen bad roadmaps alone lead to brain drain.

Examples of roadmap items: parallelizing exports, plugin metrics, let's rebrand as acronym-of-the-week.

Objectives on the other hand are meant to help guide decisions: E.g. "Let's build the best data export system", "Nail Funnels", etc. This might mean making plugin jobs more reliable, it might mean plugin metrics. It allows the team to iterate sprint-to-sprint and project-to-project on how to move the needle closer. It makes for better results as it allows for flexibility and from-the-trenches thinking, it is a much motivating environment to work in due to allowing professionals to make decisions, it allows for listening to client feedback and iterating with them for a better decisions, it allows for easier reprioritization since the guiding lines are set.

Fundamentally an objectives-based approach is closer to the ideal we want to be selling to our customers to experiment, look at the data, listen to your clients, investigate and make informed decisions based on the situation.

My suggestions for objectives

Note these are not final - just strumming down my thoughts based on everything above.

Events are ingested, processed, and exported quickly, reliably and accurately

This is from the current set of objectives.

Reasoning: This is the backbone of the whole company and is not a solved problem and will need iteration as new issues occur.

The major thing is keeping this rolling and scaling as we onboard larger and larger customers.

Some other work that might fall under this:

there's housekeeping that yakko listed above
scaling for self-hosted
Dead letter queue improvements

Nail Data Exports

Reasoning: This aligns with some ongoing work, (I think) with some ideas outlined in PostHog/posthog.com#4217 feedback we've received.

Work that might fall under this:

Parallelizing exports
Jobs improvements
Plugin jobs reliability
Plugin metrics
Moving apps to be a CDP (customer data platform) and what that entails
Exposing persons for plugins
Improving the UX of the apps application area

However due to how the last ~6 months have been invested into some very technical subjects, I don't think we've invested nearly enough into speaking with customers and understanding their experience and needs in this topic. With @lharries and @annikaschmid on board, I think we should reverse that direction and give us more confidence in what areas in this we can find asymmetric wins.

We might be able to squeeze a priority 3 in here but for a small team it might be too much.

yakkomajuri added feature/ingestion plugin-server labels Sep 8, 2022

This was referenced Sep 12, 2022

Guaranteeing job execution #11784

Closed

2022 Q4 OKR rough draft Pipeline PostHog/posthog.com#4217

Merged

This was referenced Sep 22, 2022

added small team responsibilities, defined the role of the team lead and cleared up how PMs work with engineers PostHog/posthog.com#4261

Merged

Expose apps statistics to users #12009

Closed

yakkomajuri mentioned this issue Oct 19, 2022

chore(plugin-server): remove piscina dependency #12342

Closed

joethreepwood mentioned this issue Dec 5, 2022

PostHog CDP #13126

Closed

yakkomajuri closed this as completed Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Team Ingestion Q4 2022 Planning #11749

Team Ingestion Q4 2022 Planning #11749

yakkomajuri commented Sep 8, 2022 •

edited

Loading

tiina303 commented Sep 19, 2022 •

edited

Loading

tiina303 commented Sep 20, 2022

macobo commented Sep 21, 2022 •

edited

Loading

Team Ingestion Q4 2022 Planning #11749

Team Ingestion Q4 2022 Planning #11749

Comments

yakkomajuri commented Sep 8, 2022 • edited Loading

Proposed priorities for Q4

Options

1. PostHog Pipeline as a Product

1.1 Guaranteed job execution

1.2 Plugin metrics (& more?)

1.3 Export plugins

More power to jobs

2. Housekeeping

2.1 Evaluate getting rid of Piscina

2.2 Renaming plugins --> apps

2.3 Automated testing framework

3. Core pipeline

3.1 Finally make the dead letter queue useful

3.2 Queueing system V2

3.3 Ingestion scaling for self-hosted

4. Miscellaneous ideas

tiina303 commented Sep 19, 2022 • edited Loading

tiina303 commented Sep 20, 2022

macobo commented Sep 21, 2022 • edited Loading

Rant / background - roadmaps vs objectives

My suggestions for objectives

Events are ingested, processed, and exported quickly, reliably and accurately

Nail Data Exports

yakkomajuri commented Sep 8, 2022 •

edited

Loading

tiina303 commented Sep 19, 2022 •

edited

Loading

macobo commented Sep 21, 2022 •

edited

Loading