Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect query result in vector runtime aggregation on multi-step generated CSUP #5679

Open
philrz opened this issue Feb 26, 2025 · 2 comments · Fixed by #5690 · May be fixed by #5693
Open

Incorrect query result in vector runtime aggregation on multi-step generated CSUP #5679

philrz opened this issue Feb 26, 2025 · 2 comments · Fixed by #5690 · May be fixed by #5693
Assignees
Labels
bug Something isn't working

Comments

@philrz
Copy link
Contributor

philrz commented Feb 26, 2025

tl;dr

Given the WHERE clause, the top line of output in the following aggregation result is incorrect.

$ SUPER_VAM=1 super -c "SELECT SearchPhrase, COUNT(*) AS c FROM 'bench0.csup' WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;"
{SearchPhrase:"",c:4016547(uint64)}
{SearchPhrase:"карелки",c:70263(uint64)}
{SearchPhrase:"албатрутдин",c:34675(uint64)}
{SearchPhrase:"смотреть онлайн",c:24580(uint64)}
{SearchPhrase:"смотреть онлайн бесплатно",c:21647(uint64)}
{SearchPhrase:"смотреть",c:19707(uint64)}
{SearchPhrase:"мангу в зарабей грама",c:19195(uint64)}
{SearchPhrase:"дружке помещение",c:17284(uint64)}
{SearchPhrase:"galaxy table",c:16746(uint64)}
{SearchPhrase:"экзоидные",c:16620(uint64)}

Details

Repro is with super commit 2f1a964. This is the ClickBench q12 query.

For reasons unrelated to this particular issue (but that ended up surfacing this issue anyway) I happened to be generating my CSUP test data from the original hits CSV in multiple steps:

  1. First loaded it into DuckDB as a table created with DDL ddl_duckdb.sql
  2. Then dumped it out of the DuckDB table as a Parquet file
  3. Then turned that Parquet file into CSUP.

Here I'll show those steps while issuing a query along the way to see the presumed correct result when querying the DuckDB table.

$ duckdb --version
v1.1.3 19864453f7

$ duckdb q12.db < ddl_duckdb.sql 

$ duckdb q12.db -c "COPY bench0 FROM 'hits.csv'"
100% ▕████████████████████████████████████████████████████████████▏ 

$ duckdb q12.db -c "SELECT SearchPhrase, COUNT(*) AS c FROM bench0 WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;"
┌───────────────────────────┬───────┐
│       SearchPhrase        │   c   │
│          varchar          │ int64 │
├───────────────────────────┼───────┤
│ карелки                   │ 70263 │
│ албатрутдин               │ 34675 │
│ смотреть онлайн           │ 24580 │
│ смотреть онлайн бесплатно │ 21647 │
│ смотреть                  │ 19707 │
│ мангу в зарабей грама     │ 19195 │
│ дружке помещение          │ 17284 │
│ galaxy table              │ 16746 │
│ экзоидные                 │ 16620 │
│ сколько мытищи            │ 12317 │
├───────────────────────────┴───────┤
│ 10 rows                 2 columns │
└───────────────────────────────────┘

$ duckdb q12.db -c "COPY (FROM bench0) TO 'bench0.parquet'"
100% ▕████████████████████████████████████████████████████████████▏ 

$ super -version
Version: v1.18.0-302-g2f1a9643

$ super -f csup -o bench0.csup bench0.parquet

At this point if we issue the same query we did in DuckDB in sequential runtime against the CSUP file, our result matches the one from DuckDB.

$ super -c "SELECT SearchPhrase, COUNT(*) AS c FROM 'bench0.csup' WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;"
{SearchPhrase:"карелки",c:70263(uint64)}
{SearchPhrase:"албатрутдин",c:34675(uint64)}
{SearchPhrase:"смотреть онлайн",c:24580(uint64)}
{SearchPhrase:"смотреть онлайн бесплатно",c:21647(uint64)}
{SearchPhrase:"смотреть",c:19707(uint64)}
{SearchPhrase:"мангу в зарабей грама",c:19195(uint64)}
{SearchPhrase:"дружке помещение",c:17284(uint64)}
{SearchPhrase:"galaxy table",c:16746(uint64)}
{SearchPhrase:"экзоидные",c:16620(uint64)}
{SearchPhrase:"сколько мытищи",c:12317(uint64)}

However, once we repeat it in vector runtime, now we get that first row that includes a count for SearchPhrase:"" which should have been excluded per the WHERE clause.

$ SUPER_VAM=1 super -c "SELECT SearchPhrase, COUNT(*) AS c FROM 'bench0.csup' WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;"
{SearchPhrase:"",c:4016547(uint64)}
{SearchPhrase:"карелки",c:70263(uint64)}
{SearchPhrase:"албатрутдин",c:34675(uint64)}
{SearchPhrase:"смотреть онлайн",c:24580(uint64)}
{SearchPhrase:"смотреть онлайн бесплатно",c:21647(uint64)}
{SearchPhrase:"смотреть",c:19707(uint64)}
{SearchPhrase:"мангу в зарабей грама",c:19195(uint64)}
{SearchPhrase:"дружке помещение",c:17284(uint64)}
{SearchPhrase:"galaxy table",c:16746(uint64)}
{SearchPhrase:"экзоидные",c:16620(uint64)}
@philrz philrz added the bug Something isn't working label Feb 26, 2025
@philrz
Copy link
Contributor Author

philrz commented Feb 26, 2025

Curiously, the problem doesn't show up if I start from the original hits Parquet (instead of the CSV) and convert that directly to CSUP and query that in vector runtime.

$ super -f csup -o hits.csup hits.parquet

$ SUPER_VAM=1 super -c "SELECT SearchPhrase, COUNT(*) AS c FROM 'hits.csup' WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;"
{SearchPhrase:"карелки",c:70263(uint64)}
{SearchPhrase:"албатрутдин",c:34675(uint64)}
{SearchPhrase:"смотреть онлайн",c:24580(uint64)}
{SearchPhrase:"смотреть онлайн бесплатно",c:21647(uint64)}
{SearchPhrase:"смотреть",c:19707(uint64)}
{SearchPhrase:"мангу в зарабей грама",c:19195(uint64)}
{SearchPhrase:"дружке помещение",c:17284(uint64)}
{SearchPhrase:"galaxy table",c:16746(uint64)}
{SearchPhrase:"экзоидные",c:16620(uint64)}
{SearchPhrase:"сколько мытищи",c:12317(uint64)}

Comparing the types in the two Parquet files, we do see some differences, but only in the time-related fields. Not any of the fields referenced in the query.

$ diff -y bench0.type hits.type
<{								<{
    WatchID: int64,						    WatchID: int64,
    JavaEnable: int16,						    JavaEnable: int16,
    Title: string,						    Title: string,
    GoodEvent: int16,						    GoodEvent: int16,
    EventTime: arrow_timestamp_us=time,			      |	    EventTime: int64,
    EventDate: arrow_date32=time,			      |	    EventDate: uint16,
    CounterID: int32,						    CounterID: int32,
    ClientIP: int32,						    ClientIP: int32,
    RegionID: int32,						    RegionID: int32,
    UserID: int64,						    UserID: int64,
    CounterClass: int16,					    CounterClass: int16,
    OS: int16,							    OS: int16,
    UserAgent: int16,						    UserAgent: int16,
    URL: string,						    URL: string,
    Referer: string,						    Referer: string,
    IsRefresh: int16,						    IsRefresh: int16,
    RefererCategoryID: int16,					    RefererCategoryID: int16,
    RefererRegionID: int32,					    RefererRegionID: int32,
    URLCategoryID: int16,					    URLCategoryID: int16,
    URLRegionID: int32,						    URLRegionID: int32,
    ResolutionWidth: int16,					    ResolutionWidth: int16,
    ResolutionHeight: int16,					    ResolutionHeight: int16,
    ResolutionDepth: int16,					    ResolutionDepth: int16,
    FlashMajor: int16,						    FlashMajor: int16,
    FlashMinor: int16,						    FlashMinor: int16,
    FlashMinor2: string,					    FlashMinor2: string,
    NetMajor: int16,						    NetMajor: int16,
    NetMinor: int16,						    NetMinor: int16,
    UserAgentMajor: int16,					    UserAgentMajor: int16,
    UserAgentMinor: string,					    UserAgentMinor: string,
    CookieEnable: int16,					    CookieEnable: int16,
    JavascriptEnable: int16,					    JavascriptEnable: int16,
    IsMobile: int16,						    IsMobile: int16,
    MobilePhone: int16,						    MobilePhone: int16,
    MobilePhoneModel: string,					    MobilePhoneModel: string,
    Params: string,						    Params: string,
    IPNetworkID: int32,						    IPNetworkID: int32,
    TraficSourceID: int16,					    TraficSourceID: int16,
    SearchEngineID: int16,					    SearchEngineID: int16,
    SearchPhrase: string,					    SearchPhrase: string,
    AdvEngineID: int16,						    AdvEngineID: int16,
    IsArtifical: int16,						    IsArtifical: int16,
    WindowClientWidth: int16,					    WindowClientWidth: int16,
    WindowClientHeight: int16,					    WindowClientHeight: int16,
    ClientTimeZone: int16,					    ClientTimeZone: int16,
    ClientEventTime: arrow_timestamp_us,		      |	    ClientEventTime: int64,
    SilverlightVersion1: int16,					    SilverlightVersion1: int16,
    SilverlightVersion2: int16,					    SilverlightVersion2: int16,
    SilverlightVersion3: int32,					    SilverlightVersion3: int32,
    SilverlightVersion4: int16,					    SilverlightVersion4: int16,
    PageCharset: string,					    PageCharset: string,
    CodeVersion: int32,						    CodeVersion: int32,
    IsLink: int16,						    IsLink: int16,
    IsDownload: int16,						    IsDownload: int16,
    IsNotBounce: int16,						    IsNotBounce: int16,
    FUniqID: int64,						    FUniqID: int64,
    OriginalURL: string,					    OriginalURL: string,
    HID: int32,							    HID: int32,
    IsOldCounter: int16,					    IsOldCounter: int16,
    IsEvent: int16,						    IsEvent: int16,
    IsParameter: int16,						    IsParameter: int16,
    DontCountHits: int16,					    DontCountHits: int16,
    WithHash: int16,						    WithHash: int16,
    HitColor: string,						    HitColor: string,
    LocalEventTime: arrow_timestamp_us,			      |	    LocalEventTime: int64,
    Age: int16,							    Age: int16,
    Sex: int16,							    Sex: int16,
    Income: int16,						    Income: int16,
    Interests: int16,						    Interests: int16,
    Robotness: int16,						    Robotness: int16,
    RemoteIP: int32,						    RemoteIP: int32,
    WindowName: int32,						    WindowName: int32,
    OpenerName: int32,						    OpenerName: int32,
    HistoryLength: int16,					    HistoryLength: int16,
    BrowserLanguage: string,					    BrowserLanguage: string,
    BrowserCountry: string,					    BrowserCountry: string,
    SocialNetwork: string,					    SocialNetwork: string,
    SocialAction: string,					    SocialAction: string,
    HTTPError: int16,						    HTTPError: int16,
    SendTiming: int32,						    SendTiming: int32,
    DNSTiming: int32,						    DNSTiming: int32,
    ConnectTiming: int32,					    ConnectTiming: int32,
    ResponseStartTiming: int32,					    ResponseStartTiming: int32,
    ResponseEndTiming: int32,					    ResponseEndTiming: int32,
    FetchTiming: int32,						    FetchTiming: int32,
    SocialSourceNetworkID: int16,				    SocialSourceNetworkID: int16,
    SocialSourcePage: string,					    SocialSourcePage: string,
    ParamPrice: int64,						    ParamPrice: int64,
    ParamOrderID: string,					    ParamOrderID: string,
    ParamCurrency: string,					    ParamCurrency: string,
    ParamCurrencyID: int16,					    ParamCurrencyID: int16,
    OpenstatServiceName: string,				    OpenstatServiceName: string,
    OpenstatCampaignID: string,					    OpenstatCampaignID: string,
    OpenstatAdID: string,					    OpenstatAdID: string,
    OpenstatSourceID: string,					    OpenstatSourceID: string,
    UTMSource: string,						    UTMSource: string,
    UTMMedium: string,						    UTMMedium: string,
    UTMCampaign: string,					    UTMCampaign: string,
    UTMContent: string,						    UTMContent: string,
    UTMTerm: string,						    UTMTerm: string,
    FromTag: string,						    FromTag: string,
    HasGCLID: int16,						    HasGCLID: int16,
    RefererHash: int64,						    RefererHash: int64,
    URLHash: int64,						    URLHash: int64,
    CLID: int32							    CLID: int32
}>								}>

@philrz
Copy link
Contributor Author

philrz commented Mar 5, 2025

In attempting to verify the fix merged from linked PR #5690, I see we've changed the result when this query is executed in vector runtime against CSUP to show a row of null values rather than empty strings. But those null values are not present when the query is executed in sequential runtime against CSUP nor in vector runtime against Parquet. So I'm reopening this issue.

$ super -version
Version: v1.18.0-310-g7f901d3b

$ SUPER_VAM=1 super -c "SELECT SearchPhrase, COUNT(*) AS c FROM 'bench0.csup' WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;"
{SearchPhrase:null(string),c:4016547(uint64)}
{SearchPhrase:"карелки",c:70263(uint64)}
{SearchPhrase:"албатрутдин",c:34675(uint64)}
{SearchPhrase:"смотреть онлайн",c:24580(uint64)}
{SearchPhrase:"смотреть онлайн бесплатно",c:21647(uint64)}
{SearchPhrase:"смотреть",c:19707(uint64)}
{SearchPhrase:"мангу в зарабей грама",c:19195(uint64)}
{SearchPhrase:"дружке помещение",c:17284(uint64)}
{SearchPhrase:"galaxy table",c:16746(uint64)}
{SearchPhrase:"экзоидные",c:16620(uint64)}

$ super -c "SELECT SearchPhrase, COUNT(*) AS c FROM 'bench0.csup' WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;"
{SearchPhrase:"карелки",c:70263(uint64)}
{SearchPhrase:"албатрутдин",c:34675(uint64)}
{SearchPhrase:"смотреть онлайн",c:24580(uint64)}
{SearchPhrase:"смотреть онлайн бесплатно",c:21647(uint64)}
{SearchPhrase:"смотреть",c:19707(uint64)}
{SearchPhrase:"мангу в зарабей грама",c:19195(uint64)}
{SearchPhrase:"дружке помещение",c:17284(uint64)}
{SearchPhrase:"galaxy table",c:16746(uint64)}
{SearchPhrase:"экзоидные",c:16620(uint64)}
{SearchPhrase:"сколько мытищи",c:12317(uint64)}

$ SUPER_VAM=1 super -c "SELECT SearchPhrase, COUNT(*) AS c FROM 'bench0.parquet' WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;"
{SearchPhrase:"карелки",c:70263(uint64)}
{SearchPhrase:"албатрутдин",c:34675(uint64)}
{SearchPhrase:"смотреть онлайн",c:24580(uint64)}
{SearchPhrase:"смотреть онлайн бесплатно",c:21647(uint64)}
{SearchPhrase:"смотреть",c:19707(uint64)}
{SearchPhrase:"мангу в зарабей грама",c:19195(uint64)}
{SearchPhrase:"дружке помещение",c:17284(uint64)}
{SearchPhrase:"galaxy table",c:16746(uint64)}
{SearchPhrase:"экзоидные",c:16620(uint64)}
{SearchPhrase:"сколько мытищи",c:12317(uint64)}

@philrz philrz reopened this Mar 5, 2025
mattnibs added a commit that referenced this issue Mar 5, 2025
This commit fixes an issue with filters in the vector runtime where
boolean vectors with true values that were also null were not getting
filtered.

Closes #5679
mattnibs added a commit that referenced this issue Mar 5, 2025
This commit fixes an issue with filters in the vector runtime where
boolean vectors where null values that were also set to true were not
getting filtered as they should.

Closes #5679
@mattnibs mattnibs linked a pull request Mar 5, 2025 that will close this issue
mattnibs added a commit that referenced this issue Mar 5, 2025
This commit fixes an issue with filters in the vector runtime where
boolean vectors where null values that were also set to true were not
getting filtered as they should.

Closes #5679
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants