Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: slower response times when events api are enabled #203

Open
subzero10 opened this issue Oct 24, 2024 · 8 comments
Open

bug: slower response times when events api are enabled #203

subzero10 opened this issue Oct 24, 2024 · 8 comments
Assignees
Labels

Comments

@subzero10
Copy link
Member

subzero10 commented Oct 24, 2024

A user reported they noticed 300ms slower response times when they enabled Insights.

The user suggested to send the notice after the response is returned.

@subzero10 subzero10 added the bug label Oct 24, 2024
@subzero10 subzero10 self-assigned this Oct 24, 2024
@subzero10 subzero10 changed the title Slower response times when events api are enabled bug: slower response times when events api are enabled Oct 24, 2024
@subzero10
Copy link
Member Author

Update: Unless, I'm missing something here, I don't see an issue with our original approach to use a shutdown hook:

  • Events are queued
  • The shutdown function is called after the response is sent.

There are two cases where the events will be sent before the response is sent to the client:

  1. The number of events in the queue exceeds the threshold (default is 50).
  2. Some time has passed since the last time events were sent (defaults to 2 seconds). This time is set as soon as the class is instantiated, which means that if a request is queuing events for more than 2 seconds and has not finished yet, a request will be made to the Events API even if the response is not yet sent to the client.

If the user is consistently seeing significant slower response times when enabling Insights, it is possible that they are queueing a lot of events before responding to the client, hence triggering the 1st case from above. In this case, they can simply increase the threshold.

Sending events using a terminable middleware:
Slack 2024-11-15 09 57 47

Sending events to insights using the shutdown hook:
Arc 2024-11-15 09 57 11

@joshuap
Copy link
Member

joshuap commented Nov 15, 2024

@subzero10 so that I'm clear:

In cases 1 and 2 above, we're blocking the response while the events are being sent? Or are we sending the events asynchronously, i.e. in a separate thread? We should never be sending events synchronously imo. Am I missing something?

@subzero10
Copy link
Member Author

subzero10 commented Nov 16, 2024

In cases 1 and 2 above, we're blocking the response while the events are being sent?

Yes, that’s correct. PHP applications do not use multi-threading. Each request is handled by a separate process, typically managed by tools like PHP-FPM.

To achieve non-blocking processing, PHP applications leverage a worker process (e.g., a job queue service) to handle tasks asynchronously. However, not all PHP applications use such queues, and identifying their presence can be non-trivial. Though, for Laravel apps specifically, detecting a queue setup should be straightforward.

We should never be sending events synchronously imo.

Agreed, in most cases involving HTTP requests, synchronous event sending is suboptimal due to the potential for delayed responses. However, synchronous processing is sometimes acceptable, such as when running PHP outside the context of an HTTP request. In those scenarios (e.g., CLI processes or worker services), response time is less critical, as each job runs independently and on its own process.


Our current approach provides a general-purpose solution for PHP and Laravel applications:

  1. PHP code executed in the context of a web server:
  • Events are queued in memory and sent in batches before the process shuts down (after the response is sent).
  • For cases where a request generates many events over an extended period, sending batches synchronously during request processing is an option. While this may delay the response slightly (a few hundred milliseconds), it’s often acceptable when response time is not a strict requirement.
  1. PHP code executed in CLI (e.g., artisan) or worker services:
  • In these cases, processes are long-running and independent. Synchronous or asynchronous event sending has minimal impact on the overall operation.

Trade-off:
The current approach may become inefficient when a large number of events are generated in a short time, as sending them synchronously could introduce bottlenecks.

Potential Improvements:
To better address varying use cases, we can adapt our implementation based on the application context:

  1. HTTP Requests:
  • Queue events in memory and send them in batches after the response is sent.
  • For Laravel apps with a queue system:
    • Dispatch a HoneybadgerSendEventsJob to handle event sending asynchronously.
  1. Non-HTTP Requests:
  • Send events in batches as the in-memory queue fills.
  • For Laravel apps with a queue system:
    • Dispatch a HoneybadgerSendEventsJob for asynchronous processing.

If you agree with the above, I can create issues to start working on them. Let me know if there’s anything else you’d like clarified or adjusted.

@joshuap
Copy link
Member

joshuap commented Nov 21, 2024

To better address varying use cases, we can adapt our implementation based on the application context:

  1. HTTP Requests:
  • Queue events in memory and send them in batches after the response is sent.

  • For Laravel apps with a queue system:

    • Dispatch a HoneybadgerSendEventsJob to handle event sending asynchronously.

The big change here from how it behaves currently (and the solution to this users problem) would be to use the queue system, right? It already queues the events in memeory and sends them in batches after the response is sent (unless there the batch fills up)?

@subzero10
Copy link
Member Author

subzero10 commented Nov 21, 2024

The big change here from how it behaves currently would be to use the queue system, right?

Correct. Basically, I'm thinking that we would implement more than one mechanism to send events and choose the best one based on configuration. I don't know yet if this can be done entirely automatically without any further configuration (current research shows that we can't) because even though Laravel's configuration may point to a queue, the queue process may not be running (i.e. this is a local/dev/staging/testing environment). This may cause confusion. We will have to make sure this is well documented.

(and the solution to this users problem)

We don't know for sure that this user's problem was the fact that they were generating too many events in a short amount of time. If indeed this was the problem, they could configure the batch threshold to be higher (the default is 50) and their problem would go away even with the current implementation.
🤔 On a similar note, should we raise the default number to something bigger (i.e. 100)?

It already queues the events in memeory and sends them in batches after the response is sent (unless there the batch fills up)?

Yes, correct. This is the job of the BulkEventDispatcher.php:

  • It pushes events to an array and
  • checks if the array is over the threshold (defaults to 50) or the last dispatch time is greater than the interval (defaults to 2 seconds).
  • When the process is ready to shutdown (after HTTP response has been sent), the flushEvents() is called to send any remaining events.

@stympy
Copy link
Member

stympy commented Nov 21, 2024

I'm fine with bumping the default threshold to 100 and recommending that the impacted customer try adjusting that setting.

Instead of trying to auto-detect (and use) a queueing system, how about we make it an option in the config to enable a queueing system (from a list of supported systems) for sending payloads? Then the customer can choose to use/enable the plugin. This would be analogous to a Ruby user opting to use Sidekiq jobs for sending events.

We could also point customers to using Vector as a proxy: configure the PHP app to deliver payloads to a locally-running Vector instance that is configured to send events on to our API. This would avoid potential network delays, which might have contributed to the reported slowdown.

@subzero10
Copy link
Member Author

I'm fine with bumping the default threshold to 100 and recommending that the impacted customer try adjusting that setting.

👍 OK I will submit a PR for this.

Instead of trying to auto-detect (and use) a queueing system, how about we make it an option in the config to enable a queueing system (from a list of supported systems) for sending payloads? Then the customer can choose to use/enable the plugin. This would be analogous to a Ruby user opting to use Sidekiq jobs for sending events.

After some more thinking, I'm leaning towards this approach as well. Auto-detection will not always work - in that case we will have to fall back to the default implementation. But this will introduce another debugging step to find out which implementation was used. It will probably create more problems along the way, such as queuing Honeybadger events to the wrong queue. We could create a new implementation and allow people to enable it instead of the default approach and allow configuring it as well (i.e. queue connection and queue name). @joshuap, if you are OK with this, we can create an issue and start working on it.

We could also point customers to using Vector as a proxy: configure the PHP app to deliver payloads to a locally-running Vector instance that is configured to send events on to our API. This would avoid potential network delays, which might have contributed to the reported slowdown.

Correct, that's also an option. I'm thinking that it would be a solution for people that have already witnessed the value of Honeybadger Insights (or are familiar with concepts such as wide events) and would go the extra effort of setting up Vector in their infrastructure.

@joshuap
Copy link
Member

joshuap commented Dec 13, 2024

After some more thinking, I'm leaning towards this approach as well. Auto-detection will not always work - in that case we will have to fall back to the default implementation. But this will introduce another debugging step to find out which implementation was used. It will probably create more problems along the way, such as queuing Honeybadger events to the wrong queue. We could create a new implementation and allow people to enable it instead of the default approach and allow configuring it as well (i.e. queue connection and queue name). @joshuap, if you are OK with this, we can create an issue and start working on it.

I'm good with this. I agree that it should be user-configured instead of automatically detected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants