Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field type conflict blocks the output buffer #2245

Closed
kostasb opened this issue Jan 10, 2017 · 9 comments · Fixed by #2311
Closed

Field type conflict blocks the output buffer #2245

kostasb opened this issue Jan 10, 2017 · 9 comments · Fixed by #2311
Milestone

Comments

@kostasb
Copy link

kostasb commented Jan 10, 2017

High level description: A type mismatch for a single point totally blocks the output buffer for outputs.influxdb.

Version tested: v1.1.2

input plugins: any
output plugins: outputs.influxdb in HTTP mode

Conditions:

A field key for a measurement already exists with a defined context type (e.g. float) in the backend InfluxDB
A metric arrives to Telegraf for that measurement & field with a different context type (e.g. int)

Issue:

The output queue gets stuck due to type mismatch. Telegraf indefinitely retries to write the mismatching point and does not flush the output buffer at all. Any following points are stacked up and are not written to InfluxDB.

e.g., using the Telegraf inputs.http_listener:

curl -v "http://localhost:8186/write?db=telegraf" --data-binary "test value=1"
curl -v "http://localhost:8186/write?db=telegraf" --data-binary "test value=1i"

2017/01/10 13:46:46 I! Output [influxdb] buffer fullness: 1 / 10000 metrics. Total gathered metrics: 2. Total dropped metrics: 0.
2017/01/10 13:46:46 E! InfluxDB Output Error: {"error":"field type conflict: input field \"value\" on measurement \"test\" is type integer, already exists as type float"}

 curl -v "http://localhost:8186/write?db=telegraf" --data-binary "test value=1"

2017/01/10 13:47:08 I! Output [influxdb] buffer fullness: 2 / 10000 metrics. Total gathered metrics: 3. Total dropped metrics: 0.
2017/01/10 13:47:08 E! InfluxDB Output Error: {"error":"field type conflict: input field \"value\" on measurement \"test\" is type integer, already exists as type float"}

Result: Output buffer fullness increases indefinitely.

Sample config used for the above test:

 [[inputs.http_listener]]
   service_address = ":8186"
   read_timeout = "10s"
   write_timeout = "10s"
   max_body_size = 0
   max_line_size = 0

[[outputs.influxdb]]
  urls = ["http://localhost:8086"] # required
  database = "telegraf" # required
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"

Proposal: add a max-attempts parameter for type conflicts to avoid indefinitely blocking the buffer.

@wiebeytec
Copy link

wiebeytec commented Jan 10, 2017

I guess it would be too much against the design philosophy of Telegraf do bookkeeping of the (type of) measurements as they come it? Influx DB creates the measurement with a type when it sees it first. Telegraf could do the same, so it would be able to do type checking.

@kostasb
Copy link
Author

kostasb commented Jan 10, 2017

@wiebeytec As of the current version Telegraf has no persistence so upon restart it would have no way to know what schema already exists in InfluxDB, unless it performs schema exploration.

@sparrc
Copy link
Contributor

sparrc commented Jan 10, 2017

Telegraf can't track the types of points as they flow through the system, as this doesn't provide any guarantees with data that comes from a separate source anyways.

I'm not sure the best way to handle this, maybe if InfluxDB fails the write then we should discard that batch (Influx does write the well-formed points even when it returns an error code).

@kostasb
Copy link
Author

kostasb commented Jan 10, 2017

@sparrc we need to further test this, in my repro case once the buffer gets blocked due to the type mismatch no new points make it into InfluxDB.

@phemmer
Copy link
Contributor

phemmer commented Jan 10, 2017

Influx does write the well-formed points even when it returns an error code

Actually, this is false. see influxdata/influxdb#4856 (comment)

@sparrc
Copy link
Contributor

sparrc commented Jan 10, 2017

@phemmer, you're right, I thought that InfluxDB handled mismatched types the same as malformed points (it doesn't)

I created an issue but it might be a dupe: influxdata/influxdb#7814

@sparrc
Copy link
Contributor

sparrc commented Jan 10, 2017

Unfortunately I don't think there is anything telegraf can do to fix this at the moment.

I'd like to hear what other users think, but it might be best to simply drop batches when receiving a 400. This could lead to dropped metrics in the case of mismatched types, but the only alternative is to let the mismatched point get stuck in the buffers, which can only be recovered from by restarting telegraf (and thus dropping even more points).

as @phemmer mentioned, the only sure-fire workaround for now will be to use 1-metric batch sizes until influxdata/influxdb#7814 is fixed.

@kostasb
Copy link
Author

kostasb commented Jan 10, 2017

+1 on waiting for Influxdb issue 7814 to be fixed and then go with dropping the points in a batch that receives a 400

@sparrc sparrc added this to the 1.2.0 milestone Jan 11, 2017
sparrc added a commit that referenced this issue Jan 24, 2017
If we write a batch of points and get a "field type conflict" error
message in return, we should drop the entire batch of points because
this indicates that one or more points have a type that doesnt match the
database.

These errors will never go away on their own, and InfluxDB will
successfully write the points that dont have a conflict.

closes #2245
@sparrc
Copy link
Contributor

sparrc commented Jan 24, 2017

I have a fix for this at #2311.

One caveat is that this fix will only work in combination with InfluxDB version 1.2+

sparrc added a commit that referenced this issue Jan 24, 2017
If we write a batch of points and get a "field type conflict" error
message in return, we should drop the entire batch of points because
this indicates that one or more points have a type that doesnt match the
database.

These errors will never go away on their own, and InfluxDB will
successfully write the points that dont have a conflict.

closes #2245
sparrc added a commit that referenced this issue Jan 24, 2017
If we write a batch of points and get a "field type conflict" error
message in return, we should drop the entire batch of points because
this indicates that one or more points have a type that doesnt match the
database.

These errors will never go away on their own, and InfluxDB will
successfully write the points that dont have a conflict.

closes #2245
njwhite pushed a commit to njwhite/telegraf that referenced this issue Jan 31, 2017
If we write a batch of points and get a "field type conflict" error
message in return, we should drop the entire batch of points because
this indicates that one or more points have a type that doesnt match the
database.

These errors will never go away on their own, and InfluxDB will
successfully write the points that dont have a conflict.

closes influxdata#2245
maxunt pushed a commit that referenced this issue Jun 26, 2018
If we write a batch of points and get a "field type conflict" error
message in return, we should drop the entire batch of points because
this indicates that one or more points have a type that doesnt match the
database.

These errors will never go away on their own, and InfluxDB will
successfully write the points that dont have a conflict.

closes #2245
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants