RFC5424: error on non UTF-8 free-form message #21

redrampage · 2019-06-01T19:49:05Z

Hi,
There's seems to be a problem with parsing of RFC5424 messages, that contain non-UTF8 bytes/sequences in free-form message field (MSG). Parser returns following error:

expecting a free-form optional message in UTF-8 (starting with or without BOM)

But according to RFC5424 this field may contain data in any encoding.
Could you please make parser more relaxed about that issue?

Thanks!

leodido · 2019-06-17T05:45:29Z

Hello @redrampage, the parser at the moment simply implements what grammar mandates.

MSG             = MSG-ANY / MSG-UTF8
MSG-ANY         = *OCTET ; not starting with BOM
MSG-UTF8        = BOM UTF-8-STRING
BOM             = %xEF.BB.BF
UTF-8-STRING    = *OCTET ; UTF-8 string as specified
                         ; in RFC 3629
OCTET           = %d00-255

Anyway the idea to implement an option that disables the rejection of invalid UTF8 sequences is a good idea, imho.

@goller WDYT?

goller · 2019-08-08T21:51:16Z

Hey @leodido I go back and forth if we should accept invalid UTF8 sequences.

On one hand there are many, many loggers that do not get the format correct so it would be nice to help library users; on the other hand I worry that allowing invalid UTF-8 sequences would decrease performance.

Would allowing invalid UTF-8 decrease performance substantially?

leodido · 2019-08-09T16:01:01Z

Hey @goller, first of all let's clarify that this feature will eventually be a parsing option (off by default).

Then, my reasoning about the performances.

My intuition is that with this option on, the performances would not decrease at all.

This because in such case the number of edges and arcs of the generated FSA is lower than the case in which we check for valid/accepted UTF-8 sequences. Thus, I expect the parsing in this case to have at least the same performances (to be conservative).

Anyway, it's worth a try also to verify if my intuition is wrong or not :)

leodido · 2020-02-18T13:49:16Z

/assign

(when I'll have some spare time) :D

leodido self-assigned this Feb 18, 2020

This was referenced May 1, 2020

Allow non UTF8 characters in message #35

Merged

promtail syslog receiver aborts on non-UTF8 logs grafana/loki#1783

Closed

leodido closed this as completed in #35 May 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC5424: error on non UTF-8 free-form message #21

RFC5424: error on non UTF-8 free-form message #21

redrampage commented Jun 1, 2019

leodido commented Jun 17, 2019

goller commented Aug 8, 2019

leodido commented Aug 9, 2019

leodido commented Feb 18, 2020

RFC5424: error on non UTF-8 free-form message #21

RFC5424: error on non UTF-8 free-form message #21

Comments

redrampage commented Jun 1, 2019

leodido commented Jun 17, 2019

goller commented Aug 8, 2019

leodido commented Aug 9, 2019

leodido commented Feb 18, 2020