Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.6/0.7 compat & various minor fixes #143

Merged
merged 8 commits into from
Jan 8, 2018
Merged

0.6/0.7 compat & various minor fixes #143

merged 8 commits into from
Jan 8, 2018

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Jan 6, 2018

No description provided.

src/Source.jl Outdated
@@ -86,7 +84,7 @@ function Source(;fullpath::Union{AbstractString,IO}="",
datarow = datarow == -1 ? (isa(header, Vector) ? 0 : last(header)) + 1 : datarow # by default, data starts on line after header
rows = fs == 0 ? -1 : max(-1, rows - datarow + 1 - footerskip) # rows now equals the actual number of rows in the dataset

# figure out # of columns and header, either an Integer, Range, or Vector{String}
# figure out # of columns and header, either an Integer, AbstractRange, or Vector{String}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra ws

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

src/Source.jl Outdated
@@ -239,7 +239,7 @@ Keyword Arguments:
* `decimal::Union{Char,UInt8}`: character to recognize as the decimal point in a float number, e.g. `3.14` or `3,14`; default `'.'`
* `truestring`: string to represent `true::Bool` values in a csv file; default `"true"`. Note that `truestring` and `falsestring` cannot start with the same character.
* `falsestring`: string to represent `false::Bool` values in a csv file; default `"false"`
* `header`: column names can be provided manually as a complete Vector{String}, or as an Int/Range which indicates the row/rows that contain the column names
* `header`: column names can be provided manually as a complete Vector{String}, or as an Int/ AbstractRange which indicates the row/rows that contain the column names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ws

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@@ -233,6 +234,9 @@ function TransposedSource(;fullpath::Union{AbstractString,IO}="",
columntypes[c] = typ
end
end
if !weakrefstrings
columntypes = [T <: WeakRefString ? String : T for T in columntypes]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since T could be e.g. Union{Missing, WeakRefString{UInt8}} (and WeakRefStringArray now supports that) the code here should do the similar thing as a little bit above when converting to categorical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same thing should be done for Source.
There's also #138 to consolidate column type detection code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

src/Source.jl Outdated
Note by default, "string" or text columns will be parsed as the [`WeakRefString`](https://github.com/quinnj/WeakRefStrings.jl) type. This is a custom type that only stores a pointer to the actual byte data + the number of bytes.
To convert a `String` to a standard Julia string type, just call `string(::WeakRefString)`, this also works on an entire column.
Oftentimes, however, it can be convenient to work with `WeakRefStrings` depending on the ultimate use, such as transfering the data directly to another system and avoiding all the intermediate copying.
* `weakrefstrings::Bool=true`: whether WeakRefStrings should be used internally to speed up file parsing; can only be `=true` for Sinks that support WeakRefStringArrays; note that regular Strings are returned from WeakRefStringArray; WeakRefStrings are only used internally.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"WeakRefStringArrays" is not a proper type name, but maybe for clarity it's possible to adjust the sentence a bit, so that it uses "WeakRefString", "String" etc, e.g

whether to use [`WeakRefStrings`](https://github.com/quinnj/WeakRefStrings.jl) package to speed up file parsing; can only be `=true` for the `Sink` objects that support `WeakRefStringArray` columns. Note that `WeakRefStringArray` still returns regular `String` elements.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@codecov
Copy link

codecov bot commented Jan 6, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@8fcce46). Click here to learn what that means.
The diff coverage is 87.5%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #143   +/-   ##
=========================================
  Coverage          ?   86.76%           
=========================================
  Files             ?        8           
  Lines             ?      869           
  Branches          ?        0           
=========================================
  Hits              ?      754           
  Misses            ?      115           
  Partials          ?        0
Impacted Files Coverage Δ
src/validate.jl 85.71% <ø> (ø)
src/float.jl 85.27% <ø> (ø)
src/CSV.jl 100% <100%> (ø)
src/parsefields.jl 90.68% <100%> (ø)
src/Sink.jl 85.36% <100%> (ø)
src/io.jl 91.76% <100%> (ø)
src/TransposedSource.jl 67.74% <71.42%> (ø)
src/Source.jl 96.69% <83.33%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8fcce46...ebfe12d. Read the comment docs.

@@ -38,6 +38,9 @@ end
return byte
end

substitute(::Type{Union{T, Missing}}, ::Type{T1}) where {T, T1} = Union{T1, Missing}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that's the kind of function that I wanted to have in Missings/0.7 Base, because it's implicitly used in several places at JuliaData (cc @nalimilan).
Maybe it should be called substitute_type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Base the convention seems to be type*, e.g. typeintersect and typejoin (we have discussed adding typesubtract). So why not typesubstitute (but of course it's much less general than the others).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true, for the Base it should be more specific. substitute_nonmissing_type? nonmissingtype_substitute? Don't have a good variant that starts with type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we'll be able to include all of these helper functions in Base, so for now it could be just Missings.typesubstitute or something like that.

@@ -38,6 +38,9 @@ end
return byte
end

substitute(::Type{Union{T, Missing}}, ::Type{T1}) where {T, T1} = Union{T1, Missing}
substitute(::Type{T}, ::Type{T1}) where {T, T1} = T1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there also should be

substitute(::Type{Missing}, ::Type{T1}) where {T1} = Missing

rule.

src/Source.jl Outdated
@@ -179,7 +175,7 @@ function Source(;fullpath::Union{AbstractString,IO}="",
end
end
if !weakrefstrings
columntypes = [T <: WeakRefString ? String : T for T in columntypes]
columntypes = [Missings.T(T) <: WeakRefString ? substitute(T, String) : T for T in columntypes]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's where substitute(Missing, T) rule is required, because Missings.T(Missing) == Union{} <: WeakRefString.

src/Source.jl Outdated
if T >: Missing
columntypes[i] = Union{columntypes[i], Missing}
end
if length(levels[i]) / sum(values(levels[i])) < .67 && Missings.T(T) <: WeakRefString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T !== Missing check was here to exclude all-missing columns from the conversion to categorical.

@quinnj quinnj merged commit 127041e into master Jan 8, 2018
@quinnj quinnj deleted the jq/fixes branch January 8, 2018 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants