Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

union(): Inconsistent results between runs #3053

Closed
SGStino opened this issue Jul 30, 2020 · 10 comments
Closed

union(): Inconsistent results between runs #3053

SGStino opened this issue Jul 30, 2020 · 10 comments
Assignees
Labels
kind/bug Something isn't working team/query

Comments

@SGStino
Copy link

SGStino commented Jul 30, 2020

measure = "Tags.PLC_252.TANK1.STAPNR"
windowPeriod = 1s

previous = from(bucket: "Bucket")
  |> range(stop: v.timeRangeStart, start: 0) 
  |> filter(fn: (r) => r["_measurement"] == measure)
  |> filter(fn: (r) => r["_field"] == "value")    
  |> last()  
  
next = from(bucket: "Bucket")
  |> range(start: v.timeRangeStop) 
  |> filter(fn: (r) => r["_measurement"] == measure)
  |> filter(fn: (r) => r["_field"] == "value")  
  |> first() 

union(tables: [previous,  next])    
  |> yield()

Every time i execute the query with a fixed timeRange, i get a random result:

#group false false true true false false true true
#datatype string long dateTime:RFC3339 dateTime:RFC3339 dateTime:RFC3339 long string string
#default _result
result table _start _stop _time _value _field _measurement
0 1970-01-01T00:00:00Z 2020-07-29T04:53:55Z 2020-07-29T04:52:52.044Z 10 value Tags.PLC_252.TANK1.STAPNR

or

#group false false true true false false true true
#datatype string long dateTime:RFC3339 dateTime:RFC3339 dateTime:RFC3339 long string string
#default _result
result table _start _stop _time _value _field _measurement
0 1970-01-01T00:00:00Z 2020-07-29T04:53:55Z 2020-07-29T04:52:52.044Z 10 value Tags.PLC_252.TANK1.STAPNR
1 2020-07-30T10:53:55.983Z 2020-07-30T11:04:28.372395611Z 2020-07-30T10:54:00.779Z 31 value Tags.PLC_252.TANK1.STAPNR

or

#group false false true true false false true true
#datatype string long dateTime:RFC3339 dateTime:RFC3339 dateTime:RFC3339 long string string
#default _result
result table _start _stop _time _value _field _measurement
0 2020-07-30T10:53:55.983Z 2020-07-30T11:05:45.208092895Z 2020-07-30T10:54:00.779Z 31 value Tags.PLC_252.TANK1.STAPNR

This is on the currently live version of Influx Cloud

Executing the individual from() statements results in one value every single time

@SGStino SGStino changed the title Inconsistent results between runs union(): Inconsistent results between runs Jul 30, 2020
@SGStino
Copy link
Author

SGStino commented Aug 1, 2020

I've extracted the data for that measure with the data explorer as csv, if that might help reproducing it.
2020-08-01-19-25_chronograf_data.zip

@wolffcm wolffcm added team/query kind/bug Something isn't working labels Jan 8, 2021
@nathanielc nathanielc added this to the Sprint 21-Q1-1 milestone Jan 11, 2021
@jsternberg jsternberg self-assigned this Jan 13, 2021
@jsternberg
Copy link
Contributor

Thank you for reporting this bug. I'll be taking a look at it.

This might be related to a bug in the planner. I thought it was fixed but I'll double check. It seems like this could happen if the transformation received the signals in different orders which is very plausible.

Process -> Process -> Finish -> Finish
Process -> Finish -> Process -> Finish

The table is supposed to get sent when the final finish is received regardless of the order, but it seems like it determines which parents have finished by the dataset id. If there's a conflict where both of the dataset ids are the same, then we might end up in a situation where the first finish marks the current transformation as finished.

These dataset ids are intended to be uuids. I expect this to be the case. But, looking at the code, those dataset ids being in conflict is the easiest explanation so I'm going to check that assumption about the dataset ids.

If this is the case, this may only happen in cloud when a pushdown happens. That may mean this won't occur when attempting to replicate it with csv.from().

@jsternberg
Copy link
Contributor

I think the above is likely correct but I need to figure out why the plan node ids are the same since I thought we previously fixed that.

This link shows where the UUIDs are generated:

ec.parents[i] = DatasetIDFromNodeID(pred.ID())

This link shows how those UUIDs are generated:

return DatasetID(uuid.NewV5(uuid.UUID{}, string(id)))

Based on the documentation, UUID v5 doesn't seem to have any random component. Testing that: https://play.golang.org/p/24I88SIF5y5

While the plan node ids are supposed to be different, it seems a problem that we are using UUID v5.

My proposal to fix this is:

  1. Fix the node ids to be different instead of the same.
  2. Stop using UUID v5 and switch to something like UUID v4 so the UUIDs are actually random as is expected.

@nathanielc @wolffcm does that sound like a good idea? I am not aware if there is a specific reason why UUID v5 was chosen or if it was just a misunderstanding of what that function did.

@jsternberg
Copy link
Contributor

@SGStino looking into this the commit for fixing this was introduced on November 2. If you have time, can you retest this? I suspect it will work.

@nathanielc @wolffcm I still think we should consider using UUID v4 instead of v5, but I believe this bug shouldn't exist anymore.

@wolffcm
Copy link

wolffcm commented Jan 14, 2021

@jsternberg I think I’m missing some context on that issue. Is it that the two inputs to the union have the same dataset ID because they both had the same plan node ID?

The plan node IDs should definitely be different, it could cause a lot of problems otherwise. That should be enforced reliably. That could be put somewhere around here where we validate the finished plan in validatePhysicalPlan:
https://github.com/influxdata/flux/blob/master/plan/physical.go#L87

I see that this PR was merged but it looks like it only fixes one instance of the problem:
#3297

I don't really think changing to use a random UUID for the dataset ID is needed if uniqueness of plan nodes is enforced. I think it's nice that the mapping from plan node ID to dataset ID is deterministic, but I don't feel that strongly about it.

@jsternberg
Copy link
Contributor

I think we then need to enforce that plan nodes must be unique then. I'm concerned though that we'll introduce another piece of code that accidentally doesn't generate a unique plan node and then it will error with an opaque error message then since plan nodes aren't something that's exposed to the end user. But, at least it's better than potentially breaking behavior.

@wolffcm
Copy link

wolffcm commented Jan 14, 2021

We could keep a hash set of IDs in the Plan itself and a have a helper function to generate unique IDs. It's been a while since I've been in that code, but it seems feasible?

@jsternberg
Copy link
Contributor

I think we could do that. I'll dive into it and see what I can manage.

@jsternberg
Copy link
Contributor

I'm not sure I'm familiar enough with the code to identify where the proper place for that would be. It seems like the node names are kept with the plan node itself rather than as part of the plan spec and it is hidden behind an interface. The plan spec would be the one to know whether node names conflicted with each other, but it doesn't really seem like it has a mechanism to rename nodes.

I'm going to close this issue and open a new one since the bug itself is fixed and we are just talking about potentially refactoring the code or adding an error message.

@jsternberg
Copy link
Contributor

Fixed and will be followed up on in #3438.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working team/query
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants