-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Team Ingestion Q4 2022 Planning #11749
Comments
Something more to consider:
Consuming from DLQ: this is a bit tricky because of person properties, both due to their updates causing bad overwrites and the values of some of the person properties might have changed between the time the event went to DLQ and re-ingestion. Alternative: fix the sources of those problems and set up alerting to make sure we keep addressing them on a timely manner. |
#7710 - might also be something to plan for fixing in Q4 |
Q: Do we want to be planning a roadmap or do we care about setting objectives? From the limited information here and in PostHog/posthog.com#4217 it feels to me we're getting too stuck setting a roadmap/listing important things without having a clear guiding principle behind them. Rant / background - roadmaps vs objectivesA roadmap defines what we build quite strictly - it's either under-scoped or over-scoped and needs constant iteration to keep up to date. If it's underscoped, it doesn't provide us tooling to think beyond the tasks listed - how do we prioritize what to take on? If it's overscoped it's a receipe for disaster as people in the team get frustrated due to lack of recognition beyond a checklist, people outside the team get frustrated due to a lack of apparent progress, we won't have tools to tackle unexpected changes in direction and poor product experiences due to working off of a checklist. In worst cases I've seen bad roadmaps alone lead to brain drain. Examples of roadmap items: parallelizing exports, plugin metrics, let's rebrand as acronym-of-the-week. Objectives on the other hand are meant to help guide decisions: E.g. "Let's build the best data export system", "Nail Funnels", etc. This might mean making plugin jobs more reliable, it might mean plugin metrics. It allows the team to iterate sprint-to-sprint and project-to-project on how to move the needle closer. It makes for better results as it allows for flexibility and from-the-trenches thinking, it is a much motivating environment to work in due to allowing professionals to make decisions, it allows for listening to client feedback and iterating with them for a better decisions, it allows for easier reprioritization since the guiding lines are set. Fundamentally an objectives-based approach is closer to the ideal we want to be selling to our customers to experiment, look at the data, listen to your clients, investigate and make informed decisions based on the situation. My suggestions for objectivesNote these are not final - just strumming down my thoughts based on everything above. Events are ingested, processed, and exported quickly, reliably and accuratelyThis is from the current set of objectives. Reasoning: This is the backbone of the whole company and is not a solved problem and will need iteration as new issues occur. The major thing is keeping this rolling and scaling as we onboard larger and larger customers. Some other work that might fall under this:
Nail Data ExportsReasoning: This aligns with some ongoing work, (I think) with some ideas outlined in PostHog/posthog.com#4217 feedback we've received. Work that might fall under this:
However due to how the last ~6 months have been invested into some very technical subjects, I don't think we've invested nearly enough into speaking with customers and understanding their experience and needs in this topic. With @lharries and @annikaschmid on board, I think we should reverse that direction and give us more confidence in what areas in this we can find asymmetric wins. We might be able to squeeze a priority 3 in here but for a small team it might be too much. |
Edits
Dropping here a few initial thoughts about what we could pick up in the upcoming quarter.
Proposed priorities for Q4
In order, written by someone who's currently exhausted. Opinions might change in the near future.
Options
1. PostHog Pipeline as a Product
1.1 Guaranteed job execution
Jobs have become a massive part of PostHog. They power a good chunk of apps, particularly the more important ones (export apps).
We recently added retries to enqueueing jobs (#11561), but the system needs to be more robust. We need at least another fallback queue, with retries on top.
Something like Aurora + our main Postgres instance. We'd also need to think about how we'd like this to work for self-hosted users, where jobs already go to the main Postgres instance.
The goal here would be to never lose a job.
1.2 Plugin metrics (& more?)
A year ago, we added metrics to plugins. This was pretty cool. Devs could add their own metrics, and we automagically added metrics to export plugins. However, some frontend changes to insights broke them, and given we didn't fix it for a long time, we recently yeeted the feature, promising to revisit it in the future.
Metrics are important because:
Segment can work as inspiration here:
Beyond plugin metrics, we might also want to offer ingestion metrics, surfacing issues with event payloads to users so they can fix their implementations and we can be free to write a dumb (and thus scalable) events service, which does no validation on events and just drops them in Kafka. In order to do this, users need to know what, why, and how many events were dropped for some payload issue.
1.3 Export plugins
We've recently improved a lot of our export plugins, but there's more work to be done. There are plugins we should refactor, we can write more unit tests, etc.
More power to jobs
Essentially #10816.
3 key things:
We might also want to consider a "long-running jobs" implementation, which we discussed a year or so ago. Essentially better native support for things like historical exports.
2. Housekeeping
2.1 Evaluate getting rid of Piscina
An idea has floated around to get rid of Piscina, and I do quite like it at a high level.
If we can get rid of this whole thread pool, our codebase will be significantly simpler, which is a big win. We'll be able to understand things better, debug things better, etc.
Trying this out is not too hard - I've written a little PoC for this locally as I was curious. Essentially, the idea would be to replace Piscina with horizontal scaling - spinning up more instances rather than more threads per instance.
If getting rid of Piscina turns out to be slightly less performant and slightly more costly, I'd still think it'd be worth it, for the complexity it gets rid of.
I don't necessarily think Piscina helps us with Performance a lot, as the stuff we do in workers is very async-heavy and not too CPU intensive, meaning "single-threaded vanilla" Node should handle it well.
However, the two key things to watch out for are:
I'm particularly worried about 2. If a malicious plugin hogs the event loop for 30s (our current max timeout) - or even 5s, we're fine, because the main thread is protected. If we get rid of worker threads, suddenly we need to be a lot more careful with this.
Honestly, I think it's hard for us to come up with a great solution for this, so we'll probably stick with Piscina for now. But always worth considering the proposal.
2.2 Renaming plugins --> apps
This is perhaps even more annoying of a task than Yeetcode, but one we need to do at some point.
We moved too fast with renaming plugins to apps on the website and UI, and now we're left with a lot of confusion when talking to customers, naming models, writing frontend code, etc.
If we pick this up, might be worth renaming the plugin-server to ingestion-server too.
2.3 Automated testing framework
We've discussed writing a little system to run automated tests on plugins so we can make sure that plugins are working at all times.
This would involve e.g. spying on network calls and running snapshot tests on plugin server changes.
@macobo event mentioned having a good sense of how he'd build this already.
3. Core pipeline
3.1 Finally make the dead letter queue useful
Some would argue the dead letter queue should be only for events that are malformed and we really cannot process. Whatever we want to call it though, we currently have a "events retry queue" and we don't retry ingestion from it.
Tasks:
3.2 Queueing system V2
Creating a better queue system, whatever that looks like: queues per team, queues per groups of teams, queues per <team, app>, queues per speed, etc.
Doing so would allow us to be more performant and robust.
3.3 Ingestion scaling for self-hosted
We still never got this in: PostHog/charts-clickhouse#243
And it's understandable - it seems simple but comes with a good chunk of complicated configuration.
However, the truth is that currently ingestion on self-hosted instances doesn't scale very well, nor is it very robust.
Thus we need to at least:
4. Miscellaneous ideas
Some random stuff that I have in mind, just dropping here for now. A lot of it is thinking way ahead, and they probably don't make a lot of sense at the moment, just anticipating problems we might have a long way down the line.
The text was updated successfully, but these errors were encountered: