Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPC-H Query 15 #166

Closed
Tracked by #158
alamb opened this issue Apr 26, 2021 · 7 comments · Fixed by #3393
Closed
Tracked by #158

TPC-H Query 15 #166

alamb opened this issue Apr 26, 2021 · 7 comments · Fixed by #3393
Labels
datafusion Changes in the datafusion crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11528

CREATE VIEW/multiple statement support: "The context currently only supports a single SQL statement"

{{Error: NotImplemented("The context currently only supports a single SQL statement")}}

@alamb alamb added the datafusion Changes in the datafusion crate label Apr 26, 2021
@andygrove
Copy link
Member

Latest failure:

Error: NotImplemented("The context currently only supports a single SQL statement")

@DaltonModlin
Copy link
Contributor

I'm looking into this issue and seeing a new error message:

Error: NotImplemented("Unsupported SQL statement: Some(\"CREATE VIEW revenue0 (supplier_no, total_revenue) AS SELECT l_suppkey, sum(l_extendedprice * (1 - l_discount)) FROM lineitem WHERE l_shipdate >= DATE '1996-01-01' AND l_shipdate < DATE '1996-01-01' + INTERVAL '3' MONTH GROUP BY l_suppkey\")")

This comes from missing a match case's if statement at datafusion/sql/src/planner.rs: line 192. Query 15's defined aliases (supplier_no, total_revenue) are populating the columns, thus, columns.is_empty() returns false. Removing the check for empty columns allows the view to be generated, but new issues arise when the second SQL statement in Q15 attempts to query the view via:

select
	s_suppkey,
	s_name,
	s_address,
	s_phone,
	total_revenue
from
	supplier,
	revenue0

where 'total_revenue' is a column in 'revenue0' which is the view.

The table default_catalog.default_schema.revenue0 can't be found in session context via the code path:
datafusion/sql/src/planner.rs: line 684 -> datafusion/core/src/execution/context.rs: line 1539

This doesn't currently generate useful error output, but debugging led me here.

Before moving forward too much further, I'd like context on the columns.is_empty() check if possible. I feel I may be missing something here.

@alamb
Copy link
Contributor Author

alamb commented Aug 9, 2022

git blame https://github.com/apache/arrow-datafusion/blame/master/datafusion/sql/src/planner.rs#L185-L198

seems to show it came in via the initial support for views in #2279 by @matthewmturner

@avantgardnerio
Copy link
Contributor

Removing the check for empty columns allows the view to be generated

@DaltonModlin if the check is removed, do the other tests still pass? If so, perhaps a new test (with aliases) and a new PR targeted at this one isolated change is a good next step. Folks like small PRs around here ;)

@DaltonModlin
Copy link
Contributor

As mentioned in #3266 I believe there are currently 3 more issues to be solved for a successful Q15 run.

  1. The current failure in which the view created in the first part doesn't populate the context's table names hashmap before the select portion's logical plan is being built. I'll create a new issue and mention this issue for easier reference.
  2. Add support for non-correlated subqueries #3266
  3. implement drop view #3267

@kmitchener
Copy link
Contributor

This runs now, but returns 0 results most of the time. The way views are executed still looks a little funny to me, so to exclude the possibility of some bug in the view code, I converted it to a with statement like below:

with revenue as (
	select
		l_suppkey as supplier_no,
		sum(l_extendedprice * (1 - l_discount)) as total_revenue
	from
		lineitem
	where
		l_shipdate >= date '1996-01-01'
		and l_shipdate < date '1996-01-01' + interval '3' month
	group by
		l_suppkey)
select
	s_suppkey,
	s_name,
	s_address,
	s_phone,
	total_revenue
from
	supplier,
	revenue
where
	s_suppkey = supplier_no
	and total_revenue = (
		select
			max(total_revenue)
		from
			revenue
	)
order by
	s_suppkey;

Running this query back to back multiple times usually returns 0 results, but sometimes it correctly returns the top supplier as it's supposed to -- 1 result:

+-----------+--------------------+-------------------+-----------------+--------------------+
| s_suppkey | s_name             | s_address         | s_phone         | total_revenue      |
+-----------+--------------------+-------------------+-----------------+--------------------+
| 8449      | Supplier#000008449 | Wp34zim9qYFbVctdW | 20-469-856-8873 | 1772627.2086999998 |
+-----------+--------------------+-------------------+-----------------+--------------------+
1 row in set. Query took 3.363 seconds.

I extracted just the WITH section to see what it's returning. And 2 back to back runs of this query shows different results for total_revenue:

with revenue as (
        select
                l_suppkey as supplier_no,
                sum(l_extendedprice * (1 - l_discount)) as total_revenue
        from
                lineitem
        where
                l_shipdate >= date '1996-01-01'
                and l_shipdate < date '1996-01-01' + interval '3' month
        group by
                l_suppkey) select * from revenue order by 2 desc limit 1;
+-------------+---------------+
| supplier_no | total_revenue |
+-------------+---------------+
| 8449        | 1772627.2087  |
+-------------+---------------+
1 row in set. Query took 2.959 seconds.

then

+-------------+--------------------+
| supplier_no | total_revenue      |
+-------------+--------------------+
| 8449        | 1772627.2086999998 |
+-------------+--------------------+
1 row in set. Query took 2.554 seconds.

I understand floating point results are uncertain. Is that what's going on here?

@Dandandan
Copy link
Contributor

I think for this to be correct we should switch to using a decimal type, as using floating point will result in small rounding errors, because parallel execution does not have deterministic ordering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants