Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISTINCT does not return correct values #6615

Closed
madshall opened this issue May 12, 2016 · 8 comments
Closed

DISTINCT does not return correct values #6615

madshall opened this issue May 12, 2016 · 8 comments

Comments

@madshall
Copy link

madshall commented May 12, 2016

Bug report

System info: 0.12.2, Linux 2.6.32-504.16.2.el6.x86_64 SMP Tue Mar 10 17:01:00 EDT 2015 x86_64 GNU/Linux

Steps to reproduce:

Case 1

  1. Create a measurement and put some mixed string/numeric data into it
  2. Assuming "activeClient" is a string field
  3. SELECT DISTINCT(activeClient) FROM "test"

Expected behavior: unique values for "activeClient" field are returned along with correct timestapms of first occurrence

Actual behavior: Nothing is returned

Case 2

  1. Create a measurement and put some mixed string/numeric data into it
  2. Assuming "activeClient" is a string field
  3. SELECT DISTINCT(activeClient) FROM "test" WHERE time > now() - 1h

Expected behavior: unique values for "activeClient" field are returned along with correct timestapms of first occurrence

Actual behavior: Unique values for "activeClient" field are returned, but they all have the same timestamp that is equal to exactly now() - 1h, which is not true as the values appear in database one after another

@jsternberg
Copy link
Contributor

This looks like a duplicate of #6495 which should be in 0.13. Please reopen if this is still a problem after upgrading. Thanks!

@stevenviola
Copy link

@madshall was #6495 related to your issue? I'm trying to get the timestamps of the first occurrence of distinct values, using Influx 1.2.0, and seeing the same thing you are, where the timestamps are equal to exactly now() - 1h. According to the distinct function docs, it says:

Note: Aggregation functions return epoch 0 (1970-01-01T00:00:00Z) as the timestamp unless you specify a lower bound on the time range. Then they return the lower bound as the timestamp.

I guess according to the docs, what you and I are seeing is the expected behavior, but I'm wondering if there's a workaround to have the timestamps be the first occurrence of the value, when using distinct.

@madshall
Copy link
Author

madshall commented Feb 1, 2017

@stevenviola Looks like DISTINCT in INFLUXDB doesn't do what SQL-thinking people would expect. I think I made my field both field and tag and did SELECT FIRST(activeClient) from "test" GROUP BY activeClient and parsed the output then. But I'm not 100% sure, it was a long time ago.

@stevenviola
Copy link

Yeah, it feels like it might be possible with a subquery to get the distinct values, and then be able to query for the first times those values appeared, but I'm not sure how elegant or resource intensive that is. Running your command returns the timestamp of the first row of the oldest value, but not the timestamps of the other values for activeClient .

@clarkj
Copy link

clarkj commented Feb 3, 2017

I'm also looking for an elegant way to do grab the first time a distinct value exists. The documentation seems to suggest that in a previous version, DISTINCT returned the first timestamp. Does anyone know why this was changed? @stevenviola how effective was the subquery workaround? I was also considering using INTO to downsample, but I feel like that runs into most of the same problems as a subquery and takes up unnecessary space.

@stevenviola
Copy link

@clarkj, I think I was incorrect in saying that a subquery would do the job, but @madshall had a good hack, by storing the values in the field and tag, grouping by the tag, and then taking the first value in each group to get the first timestamp for each distinct value.

If you're able to get an INTO statement which does the trick, I'd me interested in seeing it, as that might be easier for me to implement than altering the tags that are set when writing data.

@clarkj
Copy link

clarkj commented Feb 6, 2017

@stevenviola I've been using a version of @madshall 's workaround. It's turned into a rather complex implementation.

For context:
I'm building a database for a resort that wants to better understand daily customer usage in real time. Each day, somewhere between 1,000 and 10,000 distinct customers show up at the resort, out of a pool of 250,000 total customers. Every time a customer uses one of the resort's facilities, they must use their RFID pass to scan in. Each scan is verified and saved, which is what I'm pulling into Influx.

Each scan has tags passType and facilityLocation, and field customerId.

I want to use this to determine the first time a distinct customerId scans (i.e. when they arrived at the resort). I then I want to aggregate the results of this over time, so I can visualize the rate of customer arrival over the day in Grafana.

If DISTINCT , or an alternative DISTINCT function, returned the first timestamp and not epoch 0, this would be trivial. If you wanted to get the last distinct value, you'd have to follow the process below anyway.

select count(distinct) from (select distinct(customerId) from resortScans) where time < $timeMax group by time(15m)

Unfortunately, it does not. As a result, we have to group by customerId. This is problematic because customerId is a field.

Because the high cardinality of total customers, I can't store customerID as a tag on entry. I've created a new database with a 24 hour retention policy that clears at 2am every morning (the resort is only open 8am-9pm, so this is not problematic). I don't think INTO allows transformations, so I'm between using Kapacitor to query+add to the daily DB while adding customerID as a tag or (temporarily) having Telegraf just write to the long-term database with ID being a field and to the daily database with ID being a field and a tag (Below default cardinality limit of InfluxDB as there will never be more than 10,000 distinct customer visits, and therefore ids, in a day). In this new daily database, our data now looks like:

scanDaily,custIdTag=1,passLocation=golf,passType=season custIdField=1

At this point, I used @madshall 's workaround:

select first(custIdField) from scanDaily groupBy(custIdTag)

This returns the correct results. I would think I could use this in a subquery to calculate the total number of first scans on a time interval, but even the simpler query

select count(custIdFeild) from (select first(custIdFeild),custIdFeild from scanDaily group by custIdTag)

returns a completely incorrect count (may be related to #7885). So next, I have a continuous query that does the equivalent of:

select first(custIdField), custIdFeild INTO scanFirstDistinct from scanDaily group by custIdTag

Finally, I can query the count of first distinct customerId's (i.e. arrival time of customer):

select count(custIdFeild) from scanFirstDistinct time = $timeRange group by time($interval)

So I'm preforming 1 transformation (customerId as a field in the main db to a daily copy where customerId is a tag) and 1 continuous query to get my data in a form that can be queried to build a chart in Grafana. All this for something that could be replaced by a DISTINCT function that returns the first timestamp instead of epoch 0. This is even more important because of the inability to GROUP BY fields #7200.

In short, a couple different issues make this particular use case extremely convoluted to implement. If anyone sees a simpler way of going about this, let me know, but I think the combination of having to turn a high cardinality unique ID into a tag AND the current bug with querying subqueries with a group by tag are to blame for a lot of the difficulty.

@srivassid
Copy link

Count(Distinct("field")) works, at least on grafana.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants