Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance enhancements #8

Open
StephenWakely opened this issue Feb 22, 2021 · 3 comments
Open

Performance enhancements #8

StephenWakely opened this issue Feb 22, 2021 · 3 comments

Comments

@StephenWakely
Copy link
Owner

StephenWakely commented Feb 22, 2021

Following from the issues discussed in #7 it would be worth looking into ways to improve the performance of the parser.

Reducing allocations

In particular with parsing the structured data it was noted:

That sub-parser allocs some little Vec instances, which is where the hunch about allocations generally comes from. An alternate approach to remove any overhead associated with allocation would be to wire in a bump allocator as GlobalAlloc, drive down that cost as low as possible.

I don't know much about allocators so can't really comment, but would be very interested in exploring it. Is that something that would need to be done on Vectors end, or could syslog_loose setup it's own allocator?

There isn't really a strict reason beyond convenience why the structured data has to be stored in a Vec. We could provide an iterator that allows you to iterate over items. There would still need to be an allocation for the iterator and we would need to parse the structured data twice, the first time just to move past it to get to the message and the second time for the iterator to extract each item. So this might not provide a win.

Unless,

Message could implement Iterator itself. The initial parse would just parse up to the start of the structured data and store a pointer to that point. Iterating would return the next structured data pair, and advance the pointer. Once all the structured data is extracted, a final call could be made to return the rest of the message.

It does not make for a pleasant API, but it could be an acceptable trade off if it the performance gains were significant. There are also some caveats around error handling and determining if the message is an RFC5424 or a RFC3164 message.

Parsing multiple lines at once

Another interesting avenue for exploration would be to see if we can't expose something to parse multiple lines at once. We already have a hint from this PR that the larger the line the cheaper the per-byte cost. Does the same hold true if we're parsing multiple lines in one pass?

I can't think of a reason why this would make much of a difference. nom is fairly linear as it parses the text. There may be some caching things going on that could be worth exploring. Performance issues are never predictable!

Optimizing the parser

The way syslog_loose currently works is it attempts to parse the message as a 5424 message. If at any point it can't do this, it will restart the whole parse and attempt to parse it as a 3164. So theoretically, if there was an error at the very end of the structured data, the whole message would be parsed almost twice. This is probably unnecessary as it is generally known in advance what the type of the message is. Simply allowing the caller to indicate the type of the message would potentially almost double the throughput for 3164 messages.

It would be worth creating some benchmarks for 3164 messages to track this..

@blt
Copy link
Collaborator

blt commented Feb 22, 2021

I don't know much about allocators so can't really comment, but would be very interested in exploring it. Is that something that would need to be done on Vectors end, or could syslog_loose setup it's own allocator?

Today this would be Vector's job. Until allocator-wg wraps up the only transparent mechanism for boxed types -- explicit calls to Box, anything doing heap allocation implicitly like Vec -- will interact with whatever is linked into GlobalAlloc. The exact nature of the allocator is less interesting to me, for this discussion, than exploring whether by removing allocations the library would be faster. That is, per byte of line how many cycles are we spending in our library versus in the allocator? Open question today.

There isn't really a strict reason beyond convenience why the structured data has to be stored in a Vec. We could provide an iterator that allows you to iterate over items. There would still need to be an allocation for the iterator and we would need to parse the structured data twice, the first time just to move past it to get to the message and the second time for the iterator to extract each item. So this might not provide a win.

This general strategy is one that crossed my mind, but with regard to the timestamps. How often is the caller actually using the timestamp? It's parse time shows up in the flamegraph pretty clearly. If we were lazy in parsing fields only when the caller requests them that could be a win. Downside of this approach is if the caller always uses all fields we don't really gain anything, but it does mean something for callers that use a subset.

It does not make for a pleasant API, but it could be an acceptable trade off if it the performance gains were significant.

Agreed. My general notion around questions like this is to err on the side of having straightforward code and only go wacky once there's a clear need to do so. Fast but wrong is useless and the more complicated the program the likelier it is to go wrong, is my thought. I do, however, see how such an approach would allow us to implement a "lazy" API that we could then use to back the current "strict" API.

I can't think of a reason why this would make much of a difference. nom is fairly linear as it parses the text. There may be some caching things going on that could be worth exploring. Performance issues are never predictable!

Cache coherency is exactly what I'm thinking about here. I didn't measure it -- though it'd be straightforward to do with perf -- but I bet the more we buffer and parse in a batch the better we exploit CPU cache. Above all the other things we've discussed so far this would be my next area of investigation. Strikes me that there'd be very little code change needed to expose a batch interface.

It would be worth creating some benchmarks for 3164 messages to track this..

Excellent call out.

@StephenWakely
Copy link
Owner Author

How often is the caller actually using the timestamp?

It isn't really Vector's call on which fields get used. It will pull out all the fields, a transform down the road could choose to drop fields if configured to do so. So there isn't really much scope for laziness. We could allow some form of configuration to specify which fields are wanted, but I strongly suspect there would be little advantage gained from it. It is highly likely that the timestamp field would always be wanted. It looks like most of the timestamp parsing time is spent in chrono, which has likely had many eyes on it.

I'll have a bash at the other suggestions above shortly..

@blt
Copy link
Collaborator

blt commented Feb 23, 2021

It isn't really Vector's call on which fields get used. It will pull out all the fields, a transform down the road could choose to drop fields if configured to do so. So there isn't really much scope for laziness. We could allow some form of configuration to specify which fields are wanted, but I strongly suspect there would be little advantage gained from it.

I buy it.

It is highly likely that the timestamp field would always be wanted. It looks like most of the timestamp parsing time is spent in chrono, which has likely had many eyes on it.

Likely. If timestamp parsing were dominate it'd be worth confirming, but it's not so that seems like a lower priority path to explore.

I'll have a bash at the other suggestions above shortly..

Cool. I'll be curious to see what turns up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants