Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Team Ingestion Q4 2022 Planning #11749

Closed
yakkomajuri opened this issue Sep 8, 2022 · 3 comments
Closed

Team Ingestion Q4 2022 Planning #11749

yakkomajuri opened this issue Sep 8, 2022 · 3 comments

Comments

@yakkomajuri
Copy link
Contributor

yakkomajuri commented Sep 8, 2022

Edits

  • 09/09/2022: Added "3.3 Ingestion scaling for self-hosted" and listed as priority 3

Dropping here a few initial thoughts about what we could pick up in the upcoming quarter.

Proposed priorities for Q4

In order, written by someone who's currently exhausted. Opinions might change in the near future.

  1. Guaranteed job execution (1.1)
  2. Plugin metrics (1.2)
  3. Ingestion scaling for self-hosted (3.3)
  4. Automated testing framework (2.3)
  5. Dead letter queue (3.1)

Options

1. PostHog Pipeline as a Product

1.1 Guaranteed job execution

Jobs have become a massive part of PostHog. They power a good chunk of apps, particularly the more important ones (export apps).

We recently added retries to enqueueing jobs (#11561), but the system needs to be more robust. We need at least another fallback queue, with retries on top.

Something like Aurora + our main Postgres instance. We'd also need to think about how we'd like this to work for self-hosted users, where jobs already go to the main Postgres instance.

The goal here would be to never lose a job.

1.2 Plugin metrics (& more?)

A year ago, we added metrics to plugins. This was pretty cool. Devs could add their own metrics, and we automagically added metrics to export plugins. However, some frontend changes to insights broke them, and given we didn't fix it for a long time, we recently yeeted the feature, promising to revisit it in the future.

Metrics are important because:

  1. They help plugin devs debug their plugins
  2. They help plugin users debug their config
  3. They provide transparency to users about plugin performance
  4. They make us accountable
  5. They help us identify if issues are our problem or a user configuration problem

Segment can work as inspiration here:

Beyond plugin metrics, we might also want to offer ingestion metrics, surfacing issues with event payloads to users so they can fix their implementations and we can be free to write a dumb (and thus scalable) events service, which does no validation on events and just drops them in Kafka. In order to do this, users need to know what, why, and how many events were dropped for some payload issue.

1.3 Export plugins

We've recently improved a lot of our export plugins, but there's more work to be done. There are plugins we should refactor, we can write more unit tests, etc.

More power to jobs

Essentially #10816.

3 key things:

  1. Cancelling a job that hasn't run yet
  2. Debounce keys (so trying to trigger the same job multiple times only triggers once)
  3. Monitoring job status

We might also want to consider a "long-running jobs" implementation, which we discussed a year or so ago. Essentially better native support for things like historical exports.

2. Housekeeping

2.1 Evaluate getting rid of Piscina

An idea has floated around to get rid of Piscina, and I do quite like it at a high level.

If we can get rid of this whole thread pool, our codebase will be significantly simpler, which is a big win. We'll be able to understand things better, debug things better, etc.

Trying this out is not too hard - I've written a little PoC for this locally as I was curious. Essentially, the idea would be to replace Piscina with horizontal scaling - spinning up more instances rather than more threads per instance.

If getting rid of Piscina turns out to be slightly less performant and slightly more costly, I'd still think it'd be worth it, for the complexity it gets rid of.

I don't necessarily think Piscina helps us with Performance a lot, as the stuff we do in workers is very async-heavy and not too CPU intensive, meaning "single-threaded vanilla" Node should handle it well.

However, the two key things to watch out for are:

  1. Background processing management
  2. Event loop blockages

I'm particularly worried about 2. If a malicious plugin hogs the event loop for 30s (our current max timeout) - or even 5s, we're fine, because the main thread is protected. If we get rid of worker threads, suddenly we need to be a lot more careful with this.

Honestly, I think it's hard for us to come up with a great solution for this, so we'll probably stick with Piscina for now. But always worth considering the proposal.

2.2 Renaming plugins --> apps

This is perhaps even more annoying of a task than Yeetcode, but one we need to do at some point.

We moved too fast with renaming plugins to apps on the website and UI, and now we're left with a lot of confusion when talking to customers, naming models, writing frontend code, etc.

If we pick this up, might be worth renaming the plugin-server to ingestion-server too.

2.3 Automated testing framework

We've discussed writing a little system to run automated tests on plugins so we can make sure that plugins are working at all times.

This would involve e.g. spying on network calls and running snapshot tests on plugin server changes.

@macobo event mentioned having a good sense of how he'd build this already.

3. Core pipeline

3.1 Finally make the dead letter queue useful

Some would argue the dead letter queue should be only for events that are malformed and we really cannot process. Whatever we want to call it though, we currently have a "events retry queue" and we don't retry ingestion from it.

Tasks:

3.2 Queueing system V2

Creating a better queue system, whatever that looks like: queues per team, queues per groups of teams, queues per <team, app>, queues per speed, etc.

Doing so would allow us to be more performant and robust.

3.3 Ingestion scaling for self-hosted

We still never got this in: PostHog/charts-clickhouse#243

And it's understandable - it seems simple but comes with a good chunk of complicated configuration.

However, the truth is that currently ingestion on self-hosted instances doesn't scale very well, nor is it very robust.

Thus we need to at least:

  • Start creating topics with more partitions
  • Devise a way for people to move to Kafka topics with more partitions without losing data
  • Write docs for scaling ingestion

4. Miscellaneous ideas

Some random stuff that I have in mind, just dropping here for now. A lot of it is thinking way ahead, and they probably don't make a lot of sense at the moment, just anticipating problems we might have a long way down the line.

  • Dumb events service
  • Ingester service (rather than using Kafka tables)
  • Move everything to Protobuf (or some other serialization framework)
  • ...
@tiina303
Copy link
Contributor

tiina303 commented Sep 19, 2022

Something more to consider:

  1. Ingest events backwards #6834 - this is slightly more important after persons on events, where a bad person property value can't be fixed easily later. Also relevant if we want to consume from DLQ.
  2. Our merging (identify / aliasing) persons code could use some love
  3. Person creation failure prolonged by cache #10463 (fits under 3.1?)

Consuming from DLQ: this is a bit tricky because of person properties, both due to their updates causing bad overwrites and the values of some of the person properties might have changed between the time the event went to DLQ and re-ingestion. Alternative: fix the sources of those problems and set up alerting to make sure we keep addressing them on a timely manner.

@tiina303
Copy link
Contributor

#7710 - might also be something to plan for fixing in Q4

@macobo
Copy link
Contributor

macobo commented Sep 21, 2022

Q: Do we want to be planning a roadmap or do we care about setting objectives?

From the limited information here and in PostHog/posthog.com#4217 it feels to me we're getting too stuck setting a roadmap/listing important things without having a clear guiding principle behind them.

Rant / background - roadmaps vs objectives

A roadmap defines what we build quite strictly - it's either under-scoped or over-scoped and needs constant iteration to keep up to date. If it's underscoped, it doesn't provide us tooling to think beyond the tasks listed - how do we prioritize what to take on? If it's overscoped it's a receipe for disaster as people in the team get frustrated due to lack of recognition beyond a checklist, people outside the team get frustrated due to a lack of apparent progress, we won't have tools to tackle unexpected changes in direction and poor product experiences due to working off of a checklist. In worst cases I've seen bad roadmaps alone lead to brain drain.

Examples of roadmap items: parallelizing exports, plugin metrics, let's rebrand as acronym-of-the-week.

Objectives on the other hand are meant to help guide decisions: E.g. "Let's build the best data export system", "Nail Funnels", etc. This might mean making plugin jobs more reliable, it might mean plugin metrics. It allows the team to iterate sprint-to-sprint and project-to-project on how to move the needle closer. It makes for better results as it allows for flexibility and from-the-trenches thinking, it is a much motivating environment to work in due to allowing professionals to make decisions, it allows for listening to client feedback and iterating with them for a better decisions, it allows for easier reprioritization since the guiding lines are set.

Fundamentally an objectives-based approach is closer to the ideal we want to be selling to our customers to experiment, look at the data, listen to your clients, investigate and make informed decisions based on the situation.

My suggestions for objectives

Note these are not final - just strumming down my thoughts based on everything above.

Events are ingested, processed, and exported quickly, reliably and accurately

This is from the current set of objectives.

Reasoning: This is the backbone of the whole company and is not a solved problem and will need iteration as new issues occur.

The major thing is keeping this rolling and scaling as we onboard larger and larger customers.

Some other work that might fall under this:

  • there's housekeeping that yakko listed above
  • scaling for self-hosted
  • Dead letter queue improvements

Nail Data Exports

Reasoning: This aligns with some ongoing work, (I think) with some ideas outlined in PostHog/posthog.com#4217 feedback we've received.

Work that might fall under this:

  • Parallelizing exports
  • Jobs improvements
  • Plugin jobs reliability
  • Plugin metrics
  • Moving apps to be a CDP (customer data platform) and what that entails
  • Exposing persons for plugins
  • Improving the UX of the apps application area

However due to how the last ~6 months have been invested into some very technical subjects, I don't think we've invested nearly enough into speaking with customers and understanding their experience and needs in this topic. With @lharries and @annikaschmid on board, I think we should reverse that direction and give us more confidence in what areas in this we can find asymmetric wins.


We might be able to squeeze a priority 3 in here but for a small team it might be too much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants