-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processing delays when scaling out Event Hub triggered function #1601
Comments
Run the test again today by enabling the Event Hub auto-inflate feature. That didn't seem to make a difference. Another interesting view to the data is below that seems to indicate that on each new instance, there is an initial timeout before even processing starts. The timeout is 5 minutes which I believe comes from the default max execution time. Above screenshot is from today from a function named Off-topic to the original issue, I checked with query |
@vjrantal -- you mentioned that you tried this without your DocumentDB output binding. Were the findings the same in that case? As for the App Insights output, we turn on Adaptive Sampling by default, which means that not all records were likely sent. If you'd like to turn it off, you can do that with the host.json settings here: https://github.com/Azure/Azure-Functions/wiki/App-Insights-(Preview)#hostjson-settings (see |
@brettsam The issue was reproducible also when I removed the Document DB output. In that case, the function Node.js code was basically a no-op that just calls context.done(). The Application Insights request count is now clear and the adaptive sampling explains the delta in the count - thanks for the information! |
@paulbatum has been doing a lot of Event Hubs performance investigations -- he may have some insights into what you're seeing. |
@vjrantal Its worth noting that if you want to keep adaptive sampling on, you can still get execution counts using the aggregated metrics we emit. For example:
If you notice any discrepancies with these metrics then do let us know because they should be accurate. For the processing delays, how long are they? Judging from your graph it looks like they are about 5 minutes in length, is that correct? |
@paulbatum Yes, the delay seems to be 5 minutes which I believe comes from the default max execution time. |
@vjrantal interesting observation. You can change the max execution time in the host.json file (see here https://github.com/Azure/azure-webjobs-sdk-script/wiki/host.json). It would be great if you could lower that to something like 3 minutes and see if the latency spikes correlate... |
@paulbatum I now re-tried the scenario by having the following as my
It looks like the spikes don't directly correlate since I can see that the longest delays were longer than 1 minute: In this run, I had a bit over million messages in an Event Hub and then a function like this https://gist.github.com/vjrantal/49803ec907f78e15b7f5669d013a6262 that processes those messages. Here is the Application Insights overview timeline from the test run: |
@paulbatum I re-tried now by toggling the cardinality value and it looks like the specific issue I refer to may happen only with As a related finding, when I had What I observed was that during the beginning of the processing, Event Hubs metrics were showing some amount of "Other Errors": I tried to hunt down where in the code these errors are handled and I might be wrong, but could it be that they are silently ignored by these lines https://github.com/Azure/azure-webjobs-sdk/blob/901764c1aa610685ce850748103c8554aef3d250/src/Microsoft.Azure.WebJobs.ServiceBus/EventHubs/EventHubListener.cs#L117-L121? Also, I was wondering how checkpointing works when this kind of error occurs, i.e., could I lose messages or will they get re-processed? If useful, I can re-run the scenario once there is an environment where the "Other Errors" are logged and see what the output is. |
Ahh that makes lots of sense, I feel stupid for not considering cardinality earlier. We have determined that the single threaded nature of JavaScript makes batched execution (i.e. multiple messages passed to a single function execution) even more critical from a performance perspective. We are considering removing 'single' cardinality for event hubs in our next major version release. Thank you for the pointer on the silent suppression of event hubs errors. I filed the following bug to track this: Azure/azure-webjobs-sdk#1357. I agree it would be good to investigate what those errors are once we have instrumentation for them. Checkpointing continues if errors occur when processing messages. There is no functionality built in to store failed messages, you need to do it yourself. If you don't, yes you might have dropped messages. The best way to address this is with appropriate error handling. Unfortunately, the experience for this is not consistent across languages. Here is an example of catching errors when flushing output bindings in C#: public static async Task Run(
EventData sensorEvent,
IAsyncCollector<EventData> outputWeatherData,
TraceWriter log)
{
var output = ProcessSensorEvent(sensorEvent);
try
{
await outputWeatherData.AddAsync(output);
await outputWeatherData.FlushAsync();
}
catch(Exception e)
{
// handle error, log sensorEvent for reprocessing
}
} |
Repro steps
Create Cosmos DB with Document DB API (and a collection with enough throughput so that the DB is not the bottleneck)
Create Event Hub and a consumer group the function will use
Create a Node.js function that uses the created Event Hub as trigger as the Cosmos DB as the output (see https://gist.github.com/vjrantal/1d5d3361bf6f76a2c698c3104b4262ec for the
function.json
andindex.js
files I used in my test)Start bombarding the Event Hub with messages and observe how Azure Functions scales out the processing (I used Apache Benchmark to post 1 million events to the hub by using round-robin model to distribute events evenly to the allocated Event Hub partitions)
Expected behavior
When the runtime determines scaling out is required, it happens without significant delays in event processing.
Actual behavior
As seen in below screenshot, there seems to be a gap in event processing when the initial scale out occurs. It also seems like the per request duration slows down significantly after the scale out (e.g., 50 percentile duration before scale out is something like ~18 ms, but afterwards something like ~2167 ms).
Related information
I run the test multiple times and tested with cardinality many instead of one. I also tested without the Document DB output. Furthermore, the amount of Event Hub partitions did not seem to matter (tried with 4 and 32). In all 5 trials, I was able to reproduce similar behavior.
During the gap in event processing, I can see that some errors occur, which are something I think I could not catch in my Node.js "user code":
UTC timestamps are in the screenshots and the Function App is called
documentdb-function-1496656123610
. All resources I created were in North Europe region. I had not touchedhost.json
so everything there were the default values.The text was updated successfully, but these errors were encountered: