Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataFrame reference to the user guide #3067

Merged
merged 16 commits into from
Aug 9, 2022

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Aug 7, 2022

Which issue does this PR close?

Closes #3066

Rationale for this change

We need to document what is available otherwise how will users know where to start?

What changes are included in this PR?

New section in user guide (see rendered version)

This is just a starting point and we'll need to continue improving and extending once this is merged.

Are there any user-facing changes?

No

@andygrove andygrove added the documentation Improvements or additions to documentation label Aug 7, 2022
@andygrove andygrove changed the title [WIP] DataFrame reference to user guide [WIP] Add DataFrame reference to the user guide Aug 7, 2022
@github-actions github-actions bot added the core Core DataFusion crate label Aug 7, 2022
@andygrove
Copy link
Member Author

@kmitchener WDYT?

@andygrove andygrove changed the title [WIP] Add DataFrame reference to the user guide Add DataFrame reference to the user guide Aug 7, 2022
@andygrove andygrove marked this pull request as ready for review August 7, 2022 18:38
@codecov-commenter
Copy link

codecov-commenter commented Aug 7, 2022

Codecov Report

Merging #3067 (83feb9a) into master (0e0931d) will increase coverage by 0.08%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3067      +/-   ##
==========================================
+ Coverage   85.85%   85.93%   +0.08%     
==========================================
  Files         289      289              
  Lines       51890    52118     +228     
==========================================
+ Hits        44548    44790     +242     
+ Misses       7342     7328      -14     
Impacted Files Coverage Δ
datafusion/core/src/dataframe.rs 89.05% <ø> (ø)
datafusion/expr/src/expr_fn.rs 90.85% <100.00%> (+0.15%) ⬆️
datafusion/core/tests/sql/mod.rs 97.79% <0.00%> (-0.31%) ⬇️
datafusion/proto/src/from_proto.rs 35.32% <0.00%> (-0.22%) ⬇️
datafusion/proto/src/to_proto.rs 52.94% <0.00%> (-0.10%) ⬇️
datafusion/expr/src/columnar_value.rs 100.00% <0.00%> (ø)
datafusion/core/tests/sql/timestamp.rs 100.00% <0.00%> (ø)
datafusion/expr/src/built_in_function.rs 100.00% <0.00%> (ø)
datafusion/core/tests/dataframe_functions.rs 100.00% <0.00%> (ø)
...afusion/core/src/physical_plan/file_format/avro.rs 0.00% <0.00%> (ø)
... and 13 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@loic-sharma
Copy link
Contributor

Looks great!

andygrove and others added 2 commits August 7, 2022 15:47
Co-authored-by: Loïc Sharma <737941+loic-sharma@users.noreply.github.com>
Co-authored-by: Loïc Sharma <737941+loic-sharma@users.noreply.github.com>
@kmitchener
Copy link
Contributor

@kmitchener WDYT?

Yeah, this is nice and was completely missing from the docs before.

Re the function listing, should we combine the dataframe functions/conditional expressions, etc with the SQL version? So one definition of the function with maybe 2 examples -- one for SQL, one for Dataframe. Although that consolidation can come later in a follow-on.

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Aug 7, 2022
Comment on lines -245 to -246
// TODO(kszucs): this seems buggy, unary_scalar_expr! is used for many
// varying arity functions
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #3069 to track this

@andygrove
Copy link
Member Author

Re the function listing, should we combine the dataframe functions/conditional expressions, etc with the SQL version? So one definition of the function with maybe 2 examples -- one for SQL, one for Dataframe. Although that consolidation can come later in a follow-on.

I'm not sure if we should combine them or not. Combining them is easier for the maintainers for sure, but having separate references for DataFrame and SQL might be easier for users?

@andygrove andygrove requested a review from alamb August 9, 2022 13:13
@andygrove
Copy link
Member Author

Here is the related PR to add SQL function documentation: #3090

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

This looks like a really nice improvment @andygrove . Thank you!

unary_scalar_expr!(Log10, log10);
unary_scalar_expr!(Ln, ln);
unary_scalar_expr!(NullIf, nullif);
unary_scalar_expr!(Sqrt, sqrt, "square root of a number");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the names

docs/source/user-guide/dataframe.md Outdated Show resolved Hide resolved
docs/source/user-guide/dataframe.md Outdated Show resolved Hide resolved
under the License.
-->

# DataFrame API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is important to mention somewhere that the computations are deferred until collect() is called? Maybe that is common across other dataframe implementations and can be assumed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

docs/source/user-guide/dataframe.md Show resolved Hide resolved
col("a").gt(lit(5)).and(col("b").lt(lit(7)))
```

## Identifiers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are not data frame specific, maybe we should put them into a different guide -- and reference that here

Something like docs/source/user-guide/expressions.md perhaps

This is just a suggestion, I think it would be fine to do in a follow on PR or never

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move into a separate page as a follow-on. How would you see users using expressions outside the context of DataFrame usage?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use them all the time in IOx when we need to make Exprs -- for example to create LogicalPlans / use the LogicalPlanBuilder. I also think they are useful for writing tests when people are writing extensions for DataFusion

docs/source/user-guide/dataframe.md Outdated Show resolved Hide resolved
| case | CASE expression. Example: `case(expr).when(expr, expr).when(expr, expr).otherwise(expr).end()`. |
| nullif | Returns a null value if value1 equals value2; otherwise it returns value1. This can be used to perform the inverse operation of the `coalesce` expression. |

## String Expressions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok to leave the Notes column blank for now and fill it out going forward

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we can link to the content in docs/source/user-guide/sql/scalar_functions.md

andygrove and others added 2 commits August 9, 2022 08:53
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
andygrove and others added 3 commits August 9, 2022 08:54
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@andygrove andygrove merged commit 1e44417 into apache:master Aug 9, 2022
@andygrove andygrove deleted the user-guide-dataframe branch August 9, 2022 16:20
@ursabot
Copy link

ursabot commented Aug 9, 2022

Benchmark runs are scheduled for baseline = 098f0b0 and contender = 1e44417. 1e44417 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@andygrove andygrove mentioned this pull request Aug 15, 2022
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate documentation Improvements or additions to documentation logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add DataFrame section to user guide
6 participants