Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation #1252

Merged
merged 3 commits into from
Nov 15, 2017
Merged

Update documentation #1252

merged 3 commits into from
Nov 15, 2017

Conversation

cjprybol
Copy link
Contributor

@cjprybol cjprybol commented Oct 11, 2017

Goals:

  • Add missing exported functions and types to online documentation
  • Move documentation examples to doctests
    • Make sure the broken examples get fixed before merging

@coveralls
Copy link

Coverage Status

Coverage remained the same at 72.552% when pulling a7136b9 on cjp/doctoberfest into 8fd0851 on master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage remained the same at 72.552% when pulling a7136b9 on cjp/doctoberfest into 8fd0851 on master.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! I think there are already enough changes to justify one PR. Better finish this one, and open another one for the remaining checkboxes.

README.md Outdated
@@ -2,7 +2,7 @@ DataFrames.jl
=============

[![0.6](http://pkg.julialang.org/badges/DataFrames_0.6.svg)](http://pkg.julialang.org/?pkg=DataFrames)
[![0.7](http://pkg.julialang.org/badges/DataFrames_0.7.svg)](http://pkg.julialang.org/?pkg=DataFrames)
<!-- [![0.7](http://pkg.julialang.org/badges/DataFrames_0.7.svg)](http://pkg.julialang.org/?pkg=DataFrames) -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't render a badge on the README. I think because v0.7 doesn't have tests on pkg.julialang.org link. I'd also like to change the CI tests to allow failures on nightly until a release candidate comes out for v0.7. 👍 or 👎 ? These can all be moved to another PR too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK. Just remove it then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm going to remove the change here in favor of a badge-specific PR in #1253

to know to get up and running with tabular data manipulation using the DataFrames.jl package
and the Julia language. If there is something you expect DataFrames to be capable of, but
cannot figure out how to do, please reach out with questions in Domains/Data on
[Discourse](https://discourse.julialang.org/new-topic?title=[DataFrames%20Question]:%20&body=%23%20Question:%0A%0A%23%20Dataset%20(if%20applicable):%0A%0A%23%20MWE%20(if%20applicable):%0A&category=Domains/Data&tags=question).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"MWE" could be spelled in full.

and the Julia language. If there is something you expect DataFrames to be capable of, but
cannot figure out how to do, please reach out with questions in Domains/Data on
[Discourse](https://discourse.julialang.org/new-topic?title=[DataFrames%20Question]:%20&body=%23%20Question:%0A%0A%23%20Dataset%20(if%20applicable):%0A%0A%23%20MWE%20(if%20applicable):%0A&category=Domains/Data&tags=question).
You can also give feedback and suggest features or improvements by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"give feedback" is too general, better say "report bugs". Also "suggest features" is likely to attract many little useful posts which we will have to close, creating frustration on both sides. Better leave people discuss their problems on Discourse.

CurrentModule = DataFrames
```

# Functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't everything listed here a function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, everything on that page is a function, I'm not sure this H1 header adds anything. We could remove it and only show the H2 headers of the function-groups (and bump them up to being H1 headers). But before I act on any of your other comments on where/how functions should be grouped, what are your thoughts on just listing all of the functions alphabetically in a single section? It might make navigation difficult since the functions wouldn't be grouped by concept, but the conceptual groupings are pretty arbitrary and no matter how we group them I'm sure we won't be able to make the groups intuitive to everyone. It will probably be easier to keep these docstrings up to date if we just list the functions in the same way they are exported in the DataFrames.jl source file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, due to the presence of the isolated meltdf, I had misread this as a H2 heading. I'm OK with keeping it as H1, meltdf should simply be moved.

I'd say grouping is a better approach, but we don't need a lot of groups. Maybe one for basics, and one for grouping/joining & co.?

meltdf
```

## Data Manipulation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "Split-Apply-Combine"? Though colwise is different.


All other columns are assumed to be measured variables (they are stacked).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove these two sentences?

Copy link
Contributor Author

@cjprybol cjprybol Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the example they describe because it didn't work

julia> iris = CSV.read(joinpath(Pkg.dir("DataFrames"), "test/data/iris.csv"));

julia> d = stackdf(iris);
ERROR: StackOverflowError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-)

│ 5 │ SepalLength │ 5.0 │ setosa │
│ 6 │ SepalLength │ 5.4 │ setosa │

julia> x = by(d, [:variable, :Species], df -> DataFrame(vsum = mean(parse.(Float64, df[:value]))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a line break. To work around the CSV.jl parsing bug, probably better go back to latest version for now.

"""
gennames(n::Integer)

Generate standardized names for columns of a DataFrame. The first name will be :x1, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`:x1`

"""
countnull(a::AbstractArray)

Count the number of null values in an array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`null`

@cjprybol
Copy link
Contributor Author

I agree that the other goals can be changed in different PRs, I'll go with that approach rather than turning this into a monolithic behemoth of code churn

@cjprybol
Copy link
Contributor Author

Ok I think that's most of it. Right now the docstrings aren't rendering for the types and functions API man pages, but I think that's because not every type/function in those sections have docstrings. That'll be a followup PR

@@ -49,7 +49,6 @@ export AbstractDataFrame,
nrow,
nullable!,
order,
printtable,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will stop exporting the printtable function as suggested. If we go with that then we need to fix the tests to call DataFrames.printtable() https://travis-ci.org/JuliaData/DataFrames.jl/jobs/287109648#L606

"Data manipulation" => "lib/manipulation.md",
],
"About" => Any[
"Release Notes" => "NEWS.md",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should point to the releases GitHub page somewhere on the homepage?

```julia
cv = categorical(v)
```jldoctest categorical
julia> cv = categorical(v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That (existing) example doesn't make a lot of sense as it just repeats the previous ones. Better show how to use categorical!(df, :col), and mention that this is in-place.


```julia
mean(Nulls.skip(df[1]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was useful to show how to skip nulls again. Maybe put a null in one of the columns, and show both?

@where i.age > 40
@select {number_of_children=i.children, i.name}
end
Query.EnumerableSelect{NamedTuples._NT_number__of__children_name{Int64,String},Query.EnumerableWhere{NamedTuples._NT_name_age_children{String,Float64,Int64},Query.EnumerableIterable{NamedTuples._NT_name_age_children{String,Float64,Int64},IterableTables.DataFrameIterator{NamedTuples._NT_name_age_children{String,Float64,Int64},Tuple{Array{String,1},Array{Float64,1},Array{Int64,1}}}},##5#7},##6#8}(Query.EnumerableWhere{NamedTuples._NT_name_age_children{String,Float64,Int64},Query.EnumerableIterable{NamedTuples._NT_name_age_children{String,Float64,Int64},IterableTables.DataFrameIterator{NamedTuples._NT_name_age_children{String,Float64,Int64},Tuple{Array{String,1},Array{Float64,1},Array{Int64,1}}}},##5#7}(Query.EnumerableIterable{NamedTuples._NT_name_age_children{String,Float64,Int64},IterableTables.DataFrameIterator{NamedTuples._NT_name_age_children{String,Float64,Int64},Tuple{Array{String,1},Array{Float64,1},Array{Int64,1}}}}(IterableTables.DataFrameIterator{NamedTuples._NT_name_age_children{String,Float64,Int64},Tuple{Array{String,1},Array{Float64,1},Array{Int64,1}}}(3×3 DataFrames.DataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Looks like Query should provide a simplified printing for that type! :-)

Would [...] work here to avoid copying the full output?

3: Array{Float64}((150,)) [1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5 … 5.6, 5.1, 5.1, 5.9, 5.7, 5.2, 5.0, 5.2, 5.4, 5.1]
4: Array{Float64}((150,)) [0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1 … 2.4, 2.3, 1.9, 2.3, 2.5, 2.3, 1.9, 2.0, 2.3, 1.8] Species: DataFrames.RepeatedVector{CategoricalArrays.CategoricalValue{String,UInt32}}
parent: CategoricalArrays.CategoricalArray{String,1,UInt32,String,Union{}}
refs: Array{UInt32}((150,)) UInt32[0x00000001, 0x00000001, 0x00000001, 0x00000001, 0x00000001, 0x00000001, 0x00000001, 0x00000001, 0x00000001, 0x00000001 … 0x00000003, 0x00000003, 0x00000003, 0x00000003, 0x00000003, 0x00000003, 0x00000003, 0x00000003, 0x00000003, 0x00000003]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe replace everything related to CategoricalArray with [...] if that works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't pass the doctests, but if we just want to truncate the output then that's fine. To be honest I'm not sure I see the value of showing dump in the documentation (unless we add an advanced section to the documentation, since this is very useful but not something I'd show to a newcomer)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's just remove this.

│ 1 │ setosa │ 1.462 │ 0.0301592 │
│ 2 │ versicolor │ 4.26 │ 0.220816 │
│ 3 │ virginica │ 5.552 │ 0.304588 │

```

A second approach to the Split-Apply-Combine strategy is implemented in the `aggregate` function, which also takes three arguments: (1) a DataFrame, (2) one or more columns to split the DataFrame on, and (3) one or more functions that are used to compute a summary of each subset of the DataFrame. Each function is applied to each column, that was not used to split the DataFrame, creating new columns of the form `$name_$function` e.g. `SepalLength_mean`. Anonymous functions and expressions that do not have a name will be called `λ1`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently "will be called λ1" is no longer true.

│ 2 │ versicolor │ 50 │ 50 │ 50 │ 50 │
│ 3 │ virginica │ 50 │ 50 │ 50 │ 50 │

julia> aggregate(iris, :Species, [sum, x->mean(x)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without Nulls.skip, it's confusing that you use x -> mean(x) rather than just mean. Maybe keep Nulls.skip even if that's not needed, given that it's a common pattern that's useful to show.

@coveralls
Copy link

coveralls commented Oct 18, 2017

Coverage Status

Coverage increased (+0.2%) to 72.722% when pulling a916059 on cjp/doctoberfest into 8fd0851 on master.

julia> df = DataFrame(A = [1, 1, 1, 2, 2, 2],
B = ["X", "X", "X", "Y", "Y", "Y"])
julia> df = DataFrame(A = ["A", "B", "C", "D", "D", "A"],
B = ["X", "X", "X", "Y", "Y", "Y"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug?

│ 5 │ D │ Y │
│ 6 │ A │ Y │

julia> allcols = deepcopy(df); bothcols = deepcopy(df); onecol = deepcopy(df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's confusing to do all that setup. Maybe just show first that you can specify a single column, and then call the function without an argument and show that the other column was also converted? Then people can read the docstring to find out more.

@coveralls
Copy link

coveralls commented Oct 18, 2017

Coverage Status

Coverage increased (+0.02%) to 72.568% when pulling de766cb on cjp/doctoberfest into 8fd0851 on master.

@nalimilan
Copy link
Member

Anything left to do here?

@cjprybol
Copy link
Contributor Author

Anything left to do here?

Not anymore!

@nalimilan
Copy link
Member

Doctests are failing on CI (this is hidden by default, need to unroll the section). I think you'll have to install CSV in the same line that installs Documenter.

@cjprybol
Copy link
Contributor Author

Unsatisfiable requirements detected for package WeakRefStrings 🙁. Not sure if we can do anything about that until we tag the 11.0 release here since CSV, WeakRefStrings, and DataStreams are upstream. Thoughts on merging and cleaning up any remaining issues post version tag? Everything looks good locally, only issues I think we might still have are on the variability of the # of columns omitted during DataFrame printing and the ordering of union printing (Union{Null, T} and Union{T, Null}) being different on 0.6 and 0.7

@coveralls
Copy link

coveralls commented Nov 14, 2017

Coverage Status

Coverage remained the same at 72.956% when pulling 84e7da2 on cjp/doctoberfest into 9d0e065 on master.

@coveralls
Copy link

coveralls commented Nov 15, 2017

Coverage Status

Coverage remained the same at 72.956% when pulling 95ad4ca on cjp/doctoberfest into 9d0e065 on master.

@nalimilan
Copy link
Member

Yeah, we need to tag releases in order to be able to run our own examples...

@nalimilan nalimilan changed the title WIP: Update documentation Update documentation Nov 15, 2017
@nalimilan nalimilan merged commit 4315cdd into master Nov 15, 2017
@nalimilan nalimilan deleted the cjp/doctoberfest branch November 15, 2017 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants