Skip to content
This repository has been archived by the owner on May 21, 2024. It is now read-only.

RFC5424: error on non UTF-8 free-form message #21

Closed
redrampage opened this issue Jun 1, 2019 · 4 comments · Fixed by #35
Closed

RFC5424: error on non UTF-8 free-form message #21

redrampage opened this issue Jun 1, 2019 · 4 comments · Fixed by #35
Assignees

Comments

@redrampage
Copy link

Hi,
There's seems to be a problem with parsing of RFC5424 messages, that contain non-UTF8 bytes/sequences in free-form message field (MSG). Parser returns following error:

expecting a free-form optional message in UTF-8 (starting with or without BOM)

But according to RFC5424 this field may contain data in any encoding.
Could you please make parser more relaxed about that issue?

Thanks!

@leodido
Copy link
Collaborator

leodido commented Jun 17, 2019

Hello @redrampage, the parser at the moment simply implements what grammar mandates.

MSG             = MSG-ANY / MSG-UTF8
MSG-ANY         = *OCTET ; not starting with BOM
MSG-UTF8        = BOM UTF-8-STRING
BOM             = %xEF.BB.BF
UTF-8-STRING    = *OCTET ; UTF-8 string as specified
                         ; in RFC 3629
OCTET           = %d00-255

Anyway the idea to implement an option that disables the rejection of invalid UTF8 sequences is a good idea, imho.

@goller WDYT?

@goller
Copy link
Contributor

goller commented Aug 8, 2019

Hey @leodido I go back and forth if we should accept invalid UTF8 sequences.

On one hand there are many, many loggers that do not get the format correct so it would be nice to help library users; on the other hand I worry that allowing invalid UTF-8 sequences would decrease performance.

Would allowing invalid UTF-8 decrease performance substantially?

@leodido
Copy link
Collaborator

leodido commented Aug 9, 2019

Hey @goller, first of all let's clarify that this feature will eventually be a parsing option (off by default).

Then, my reasoning about the performances.

My intuition is that with this option on, the performances would not decrease at all.

This because in such case the number of edges and arcs of the generated FSA is lower than the case in which we check for valid/accepted UTF-8 sequences. Thus, I expect the parsing in this case to have at least the same performances (to be conservative).

Anyway, it's worth a try also to verify if my intuition is wrong or not :)

@leodido
Copy link
Collaborator

leodido commented Feb 18, 2020

/assign

(when I'll have some spare time) :D

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants