Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

Add an option to reject all non-utf8 data #1728

Closed
replay opened this issue Mar 25, 2020 · 4 comments · Fixed by #1831
Closed

Add an option to reject all non-utf8 data #1728

replay opened this issue Mar 25, 2020 · 4 comments · Fixed by #1831
Assignees
Labels
Milestone

Comments

@replay
Copy link
Contributor

replay commented Mar 25, 2020

We want to be able to reject all data where the metric name or associated tags / values contain non-ASCII chars. Some backends such as Cassandra don't accept all byte values as string chars, which can lead to MT "hanging" because Cassandra keeps rejecting the insert query while MT keeps retrying it.

There should be a flag which makes the input validation stricter and which makes it reject all non-ascii chars.

@replay replay added the feature label Mar 25, 2020
@replay
Copy link
Contributor Author

replay commented Mar 25, 2020

Another option would be to only reject data which does not pass the validation by unicode/utf8.Valid(): https://golang.org/pkg/unicode/utf8/#Valid
This would be more permissive and it would likely solve the issue of cassandra rejecting certain byte values, although I'm not 100% sure about that.

@tehlers320
Copy link

@replay i have a question on how this works, more specifically why the archive code block looks like it has a limiter on the amount of retries but the main block does not.

Would something like this be better?
tehlers320@abd4d0e

I think its just spinning forever on failures i put a test in just for my sanity with how the code block is today and it goes a long long time on bad data.
tehlers320@5b0f949

attempt 914220
attempt 914240
attempt 914260
attempt 914280
attempt 968440
attempt 914300

i took out the sleep but this would block the entire write channel if it made it to 914300 attempts.

@tehlers320
Copy link

Yeah, i think this confirms it. As soon as the index is spinning its wheels on a bad chunk of data everything else stalls. scaled the error to highlight when the rest of metrictank stops operating
Screenshot from 2020-04-13 16-09-24

@Dieterbe Dieterbe added this to the sprint-10 milestone Apr 14, 2020
@Dieterbe Dieterbe changed the title Add an option to reject all non-ascii data Add an option to reject all non-utf8 data Apr 27, 2020
@Dieterbe
Copy link
Contributor

We think that we can most likely address this by simply always rejecting non-utf8 identifiers.
Maybe make it optional, but seems like that should always be sane.
Ascii is suposed to be a subset of utf-8 anyway.

Thoughts anyone? @shanson7 maybe?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants