Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISTINCT does not operate on string fields #6495

Closed
beckettsean opened this issue Apr 28, 2016 · 10 comments
Closed

DISTINCT does not operate on string fields #6495

beckettsean opened this issue Apr 28, 2016 · 10 comments
Assignees
Milestone

Comments

@beckettsean
Copy link
Contributor

Bug report

System info: [Include InfluxDB version, operating system name, and other relevant details]

InfluxDB 0.12.2 on Ubuntu 14.04, fresh install

Steps to reproduce:

  1. insert string fields
  2. query string field with DISTINCT

Expected behavior: [What you expected to happen]

Return from DISTINCT(<string>) would be the number of distinct strings.

Actual behavior: [What actually happened]

Return was null.

Additional info: [Include gist of relevant config, logs, etc.]

> show field keys from a
name: a
-------
fieldKey
string
value

> select * from a
name: a
-------
time            string  value
1461695896546713026 12  12
1461856121749763139 11  11
1461856127357529714 10  10
1461856128790092843 10  10
1461856129686020379 10  10
1461856131452838575 11  11
1461856136245585327 13  13

> select distinct(value) from a
name: a
-------
time    distinct
0   12
0   11
0   10
0   13

> select distinct(string) from a
> select count(string) from a
name: a
-------
time    count
0   7

> format json
> select distinct(string) from a
{"results":[{}]}
@beckettsean
Copy link
Contributor Author

@jsternberg ^^

@jsternberg jsternberg self-assigned this Apr 28, 2016
@beckettsean
Copy link
Contributor Author

@jsternberg I created the data with the following:

> insert a value=12,string="12"
> insert a value=11,string="11"
> insert a value=10,string="10"
> insert a value=10,string="10"
> insert a value=10,string="10"
> insert a value=11,string="11"
> insert a value=13,string="13"

@jsternberg
Copy link
Contributor

@beckettsean I'm unable to reproduce this with 0.12.2. I'm using one of the pending Docker images to ensure I have a clean slate.

> create database mydb
> use mydb
Using database mydb
> insert a value=12,string="12"
> insert a value=11,string="11"
> insert a value=10,string="10"
> insert a value=10,string="10"
> insert a value=10,string="10"
> insert a value=11,string="11"
> insert a value=13,string="13"
> select * from a
name: a
-------
time                    string  value
1461857260244815974     12      12
1461857267066315206     11      11
1461857271181150496     10      10
1461857275369303495     10      10
1461857279493054542     10      10
1461857287331499229     11      11
1461857292976276325     13      13

> select distinct(value) from a
name: a
-------
time    distinct
0       12
0       11
0       10
0       13

> select distinct(string) from a
name: a
-------
time    distinct
0       12
0       11
0       10
0       13

> select count(string) from a
name: a
-------
time    count
0       7

> format json
> select distinct(string) from a
{"results":[{"series":[{"name":"a","columns":["time","distinct"],"values":[[0,"12"],[0,"11"],[0,"10"],[0,"13"]]}]}]}

Can you confirm you're using the correct version?

@beckettsean
Copy link
Contributor Author

On checking, my system is actually 0.12.1. Is there reason to believe this is fixed on 0.12.2?

# influx
Visit https://enterprise.influxdata.com to register for updates, InfluxDB server management, and monitoring.
Connected to http://localhost:8086 version 0.12.1
InfluxDB shell 0.12.1
> 

@beckettsean
Copy link
Contributor Author

Still happens for me on 0.12.2:

# influx
Visit https://enterprise.influxdata.com to register for updates, InfluxDB server management, and monitoring.
Connected to http://localhost:8086 version 0.12.2
InfluxDB shell 0.12.2
> use mydb
Using database mydb
> select distinct(value) from a
name: a
-------
time    distinct
0   12
0   11
0   10
0   13

> select distinct(string) from a
> 

@beckettsean
Copy link
Contributor Author

Replicated on new data:

> insert b value=1,string="a"
> insert b value=2,string="b"
> insert b value=2,string="b"
> insert b value=1,string="a"
> insert b value=3,string="c"
> select * from b
name: b
-------
time            string  value
1461859961560359056 a   1
1461859965408355832 b   2
1461859966311601860 b   2
1461859967615202434 a   1
1461859972269952068 c   3

> select distinct(value) from b
name: b
-------
time    distinct
0   1
0   2
0   3

> select distinct(string) from b
> 

@beckettsean
Copy link
Contributor Author

Not a syntax issue as far as I can tell:

> select distinct("string") from b
> select distinct('string') from b
ERR: error parsing query: expected field argument in distinct()
> 

@beckettsean
Copy link
Contributor Author

I spun up a brand new instance, installed 0.12.2, and DISTINCT operates as expected:

# influx
Visit https://enterprise.influxdata.com to register for updates, InfluxDB server management, and monitoring.
Connected to http://localhost:8086 version 0.12.2
InfluxDB shell 0.12.2
> create database mydb
> user mydb
ERR: error parsing query: found USER, expected SELECT, DELETE, SHOW, CREATE, DROP, GRANT, REVOKE, ALTER, SET, KILL at line 1, char 1
Warning: It is possible this error is due to not setting a database.
Please set a database with the command "use <database>".
> use mydb
Using database mydb
> insert a number=1,string="a"
> insert a number=1,string="a"
> insert a number=2,string="b"
> insert a number=2,string="b"
> insert a number=3,string="c"
> select * from a
name: a
-------
time            number  string
1461860834329907105 1   a
1461860836969349116 1   a
1461860841620218718 2   b
1461860842520594123 2   b
1461860848010369577 3   c

> select distinct(number) from a
name: a
-------
time    distinct
0   1
0   2
0   3

> select distinct(string) from a
name: a
-------
time    distinct
0   a
0   b
0   c

I'm not sure what it is about my other installations that prevents this, but it appears that it cannot be reproduced on a brand new box.

@jsternberg
Copy link
Contributor

I looked at the underlying data set and I think it's caused by a side-effect of a bad decision in the underlying query engine. If a shard returns no iterators for a query, it returns a fake float iterator. When casting is done, floats are given priority and all non-float iterators are discarded, making it appear like there's no data.

When selecting as a raw field, this doesn't happen since the work is done through auxiliary iterators which don't have this problem. I'll start working on a fix.

@jsternberg
Copy link
Contributor

To be clear, this should happen with any aggregation function for anything lower than a float that has differing output types (it should not affect count()).

jsternberg added a commit that referenced this issue May 3, 2016
If a shard is empty for a specific field and the field type is something
other than a float, a nil iterator would get returned from one of the
empty shards and cause the combined iterators to be cast to the float
type and all other iterator types to be discarded (or for integers, to
be cast).

This is rare since most aggregates don't accept strings or booleans, but
for queries like:

    SELECT distinct(string) FROM mydata

It would result in nothing getting returned if one of the shards didn't
have a value for `string`.

This change modifies the query engine to return nil for the shards
instead of a fake iterator and then to only use the fake iterator if the
final aggregate iterator is nil (meaning that no iterators could be
constructed for the field from any shard).

Fixes #6495.
@timhallinflux timhallinflux added this to the 0.13.0 milestone Dec 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants