Add DataFrame reference to the user guide #3067

andygrove · 2022-08-07T18:03:40Z

Which issue does this PR close?

Closes #3066

Rationale for this change

We need to document what is available otherwise how will users know where to start?

What changes are included in this PR?

New section in user guide (see rendered version)

This is just a starting point and we'll need to continue improving and extending once this is merged.

Are there any user-facing changes?

No

andygrove · 2022-08-07T18:21:13Z

@kmitchener WDYT?

codecov-commenter · 2022-08-07T19:12:19Z

Codecov Report

Merging #3067 (83feb9a) into master (0e0931d) will increase coverage by 0.08%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3067      +/-   ##
==========================================
+ Coverage   85.85%   85.93%   +0.08%     
==========================================
  Files         289      289              
  Lines       51890    52118     +228     
==========================================
+ Hits        44548    44790     +242     
+ Misses       7342     7328      -14

Impacted Files	Coverage Δ
datafusion/core/src/dataframe.rs	`89.05% <ø> (ø)`
datafusion/expr/src/expr_fn.rs	`90.85% <100.00%> (+0.15%)`	⬆️
datafusion/core/tests/sql/mod.rs	`97.79% <0.00%> (-0.31%)`	⬇️
datafusion/proto/src/from_proto.rs	`35.32% <0.00%> (-0.22%)`	⬇️
datafusion/proto/src/to_proto.rs	`52.94% <0.00%> (-0.10%)`	⬇️
datafusion/expr/src/columnar_value.rs	`100.00% <0.00%> (ø)`
datafusion/core/tests/sql/timestamp.rs	`100.00% <0.00%> (ø)`
datafusion/expr/src/built_in_function.rs	`100.00% <0.00%> (ø)`
datafusion/core/tests/dataframe_functions.rs	`100.00% <0.00%> (ø)`
...afusion/core/src/physical_plan/file_format/avro.rs	`0.00% <0.00%> (ø)`
... and 13 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

docs/source/user-guide/dataframe.md

loic-sharma · 2022-08-07T19:57:37Z

Looks great!

Co-authored-by: Loïc Sharma <737941+loic-sharma@users.noreply.github.com>

kmitchener · 2022-08-07T22:20:08Z

@kmitchener WDYT?

Yeah, this is nice and was completely missing from the docs before.

Re the function listing, should we combine the dataframe functions/conditional expressions, etc with the SQL version? So one definition of the function with maybe 2 examples -- one for SQL, one for Dataframe. Although that consolidation can come later in a follow-on.

…afusion into user-guide-dataframe

andygrove · 2022-08-07T23:49:49Z

datafusion/expr/src/expr_fn.rs

-// TODO(kszucs): this seems buggy, unary_scalar_expr! is used for many
-// varying arity functions


Filed #3069 to track this

andygrove · 2022-08-08T00:17:15Z

Re the function listing, should we combine the dataframe functions/conditional expressions, etc with the SQL version? So one definition of the function with maybe 2 examples -- one for SQL, one for Dataframe. Although that consolidation can come later in a follow-on.

I'm not sure if we should combine them or not. Combining them is easier for the maintainers for sure, but having separate references for DataFrame and SQL might be easier for users?

andygrove · 2022-08-09T14:15:17Z

Here is the related PR to add SQL function documentation: #3090

alamb

❤️

This looks like a really nice improvment @andygrove . Thank you!

alamb · 2022-08-09T14:15:02Z

datafusion/expr/src/expr_fn.rs

-unary_scalar_expr!(Log10, log10);
-unary_scalar_expr!(Ln, ln);
-unary_scalar_expr!(NullIf, nullif);
+unary_scalar_expr!(Sqrt, sqrt, "square root of a number");


Love the names

docs/source/user-guide/dataframe.md

alamb · 2022-08-09T14:19:41Z

docs/source/user-guide/dataframe.md

+  under the License.
+-->
+
+# DataFrame API


I wonder if it is important to mention somewhere that the computations are deferred until collect() is called? Maybe that is common across other dataframe implementations and can be assumed.

docs/source/user-guide/dataframe.md

alamb · 2022-08-09T14:22:15Z

docs/source/user-guide/dataframe.md

+col("a").gt(lit(5)).and(col("b").lt(lit(7)))
+```
+
+## Identifiers


Since these are not data frame specific, maybe we should put them into a different guide -- and reference that here

Something like docs/source/user-guide/expressions.md perhaps

This is just a suggestion, I think it would be fine to do in a follow on PR or never

I will move into a separate page as a follow-on. How would you see users using expressions outside the context of DataFrame usage?

We use them all the time in IOx when we need to make Exprs -- for example to create LogicalPlans / use the LogicalPlanBuilder. I also think they are useful for writing tests when people are writing extensions for DataFusion

docs/source/user-guide/dataframe.md

alamb · 2022-08-09T14:24:05Z

docs/source/user-guide/dataframe.md

+| case     | CASE expression. Example: `case(expr).when(expr, expr).when(expr, expr).otherwise(expr).end()`.                                                                                                          |
+| nullif   | Returns a null value if value1 equals value2; otherwise it returns value1. This can be used to perform the inverse operation of the `coalesce` expression.                                               |
+
+## String Expressions


I think it is ok to leave the Notes column blank for now and fill it out going forward

Or maybe we can link to the content in docs/source/user-guide/sql/scalar_functions.md

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

ursabot · 2022-08-09T16:22:15Z

Benchmark runs are scheduled for baseline = 098f0b0 and contender = 1e44417. 1e44417 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

DataFrame reference for user guide

ad52722

andygrove added the documentation Improvements or additions to documentation label Aug 7, 2022

andygrove changed the title ~~[WIP] DataFrame reference to user guide~~ [WIP] Add DataFrame reference to the user guide Aug 7, 2022

andygrove added 2 commits August 7, 2022 12:15

Add DataFrame methods

b3b6602

improvements

e50811b

github-actions bot added the core Core DataFusion crate label Aug 7, 2022

andygrove added 2 commits August 7, 2022 12:36

improvements

b46cb41

improvements

0ac274e

andygrove changed the title ~~[WIP] Add DataFrame reference to the user guide~~ Add DataFrame reference to the user guide Aug 7, 2022

andygrove marked this pull request as ready for review August 7, 2022 18:38

formatting

397eefe

loic-sharma reviewed Aug 7, 2022

View reviewed changes

docs/source/user-guide/dataframe.md Outdated Show resolved Hide resolved

loic-sharma reviewed Aug 7, 2022

View reviewed changes

docs/source/user-guide/dataframe.md Outdated Show resolved Hide resolved

andygrove and others added 2 commits August 7, 2022 15:47

Update docs/source/user-guide/dataframe.md

c8f91d1

Co-authored-by: Loïc Sharma <737941+loic-sharma@users.noreply.github.com>

Update docs/source/user-guide/dataframe.md

6ceb6cc

Co-authored-by: Loïc Sharma <737941+loic-sharma@users.noreply.github.com>

andygrove added 2 commits August 7, 2022 17:48

more docs

81a12bd

Merge branch 'user-guide-dataframe' of github.com:andygrove/arrow-dat…

eb20e80

…afusion into user-guide-dataframe

github-actions bot added the logical-expr Logical plan and expressions label Aug 7, 2022

andygrove commented Aug 7, 2022

View reviewed changes

more docs

9ae1428

andygrove requested a review from alamb August 9, 2022 13:13

alamb approved these changes Aug 9, 2022

View reviewed changes

andygrove and others added 2 commits August 9, 2022 08:53

Update docs/source/user-guide/dataframe.md

0c1534d

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Update docs/source/user-guide/dataframe.md

8187fe9

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

andygrove and others added 3 commits August 9, 2022 08:54

Update docs/source/user-guide/dataframe.md

9b769c2

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Update docs/source/user-guide/dataframe.md

25d086d

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

add note on lazy evaluation

83feb9a

andygrove merged commit 1e44417 into apache:master Aug 9, 2022

andygrove deleted the user-guide-dataframe branch August 9, 2022 16:20

andygrove mentioned this pull request Aug 15, 2022

Release DataFusion 11.0.0 #3012

Closed

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DataFrame reference to the user guide #3067

Add DataFrame reference to the user guide #3067

andygrove commented Aug 7, 2022 •

edited

Loading

andygrove commented Aug 7, 2022

codecov-commenter commented Aug 7, 2022 •

edited

Loading

loic-sharma commented Aug 7, 2022

kmitchener commented Aug 7, 2022

andygrove Aug 7, 2022

andygrove commented Aug 8, 2022

andygrove commented Aug 9, 2022

alamb left a comment

alamb Aug 9, 2022

alamb Aug 9, 2022

andygrove Aug 9, 2022

alamb Aug 9, 2022

andygrove Aug 9, 2022

alamb Aug 9, 2022

alamb Aug 9, 2022

alamb Aug 9, 2022

ursabot commented Aug 9, 2022

		// TODO(kszucs): this seems buggy, unary_scalar_expr! is used for many
		// varying arity functions

Add DataFrame reference to the user guide #3067

Add DataFrame reference to the user guide #3067

Conversation

andygrove commented Aug 7, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

andygrove commented Aug 7, 2022

codecov-commenter commented Aug 7, 2022 • edited Loading

Codecov Report

loic-sharma commented Aug 7, 2022

kmitchener commented Aug 7, 2022

Choose a reason for hiding this comment

andygrove commented Aug 8, 2022

andygrove commented Aug 9, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ursabot commented Aug 9, 2022

andygrove commented Aug 7, 2022 •

edited

Loading

codecov-commenter commented Aug 7, 2022 •

edited

Loading