-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sortedcontainerupdates #787
Sortedcontainerupdates #787
Conversation
Address issue 239: Dict constructors on typed data but without explicit type declarations should no longer return {Any,Any} items because collect is now used when explicit types are omitted Address issue 479: Deprecate insert!, pop!(s), introduce popfirst!, poplast!, ss_push!, sd_push!, smd_push! Address issue 528: push! documentation improved Address issue 667: Docs extensively cleaned up and much greater use of documenter.jl Address issue 669: Expanded use of tokens including a:b notation for tokens Address issue 671: Add reverse iteration Address issue 742: New union, merge, symdiff, setdiff, intersect algorithms for sorted containers to improve performance Address issue 726: Clarify docs on insert! (now deprecated) Address issue 767: Implement firstindex, lastindex, reverse for sorted container iteration Address issue 781: Use collect on an intermediate step to infer types in untyped construction of sorted containers Also, this commit cleans up the constructors and adds two new iteration-helper functions inclusive_key and exclusive_key that were useful in my own work.
The following test from the new testset, which is meant to exercise a new my_assert(remove_spaces(repr(MIME{Symbol("text/plain")}(), factors5)) ==
"SortedMultiDict{Char,Int64,ForwardOrdering}" *
"(ForwardOrdering(),['a'=>1,'b'=>2,'c'=>3])") Here, function remove_spaces(s::String)
b = Vector{UInt8}()
for c in s
if !isspace(c)
push!(b,UInt8(c))
end
end
String(b)
end |
c9bffaf
to
4b8c108
Compare
The reason the test is failing is: In Julia 1.0, printing the vector |
I replaced the failing unit test for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I have not finished reviewing this.
This broadly speaking looks good and if all the existing tests still pass then it should be good
|
||
const SDMContainer = Union{SortedDict, SortedMultiDict} | ||
const SortedContainer = Union{SDMContainer, SortedSet} | ||
const Token = Tuple{SortedContainer, IntSemiToken} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a surprising definition for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you asking why Token
is implemented as a tuple rather than a struct
? This is a legacy issue. In pre-1.0 Julia, any struct that contained an immutable object had itself to be heap allocated. However, in pre-1.0 Julia, operations on tuples containing an immutable object didn't always allocate. In particular, deref((mysorteddict, semitok))
did not heap-allocate its 2-tuple argument, whereas a hypothetical deref(Token(mysorteddict, semitok))
, where Token
is the constructor of a hypothetical struct, would have allocated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More my surprise was that a token included both the container and the indexing object.
I would have expected the token to just be the indexing object (but that is defined as a semitoken)
is the idea that the token is enough to fully retrieve the stored object?
But semitoken is only half of what is expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think full tokens are useful, e.g., deref(token_firstindex(s))
as opposed to deref((s,firstindex(s)))
where the user needs to be consistent with the argument s
. So if it is OK with you, I would like to leave these.
token_firstindex(m::SortedContainer) | ||
|
||
|
||
Return the token of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again this definition of token feels weird to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to leave these. See other comment.
src/sorted_container_iteration.jl
Outdated
function deref(ii::SortedDictToken) | ||
has_data(ii) | ||
deref_nocheck(ii) | ||
end | ||
|
||
@inline function deref_nocheck(ii::SortedMultiDictToken) | ||
@inbounds kdrec = ii[1].bt.data[ii[2].address] | ||
return kdrec.k => kdrec.d | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should be implementing these deref
and deref_nocheck
pairs, using Base.@propagate_inbounds
and Base.@boundscheck
.
Then when we want deref_nocheck(k)
we could just write @inbounds deref(k)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar for the other foo
/foo_nocheck
pairs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to implement this, but it is not working properly, so perhaps I am misunderstanding the rules. Here is the new definition of deref
:
Base.@propagate_inbounds function deref(ii::SortedDictToken)
Base.@boundscheck has_data(ii)
@inbounds kdrec = ii[1].bt.data[ii[2].address]
return Pair(kdrec.k, kdrec.d)
end
My intention is that if a deref
operation is prefixed by @inbounds
, then has_data
will not be invoked. But this is not happening:
julia> s = SortedDict{Int,String}()
SortedDict{Int64, String, Base.Order.ForwardOrdering}()
julia> deref(token_firstindex(s))
ERROR: BoundsError
Stacktrace:
[1] has_data
@ c:\Users\vavasis\OneDrive - University of Waterloo\Julia_pkg_devdir\DataStructures.jl\src\sorted_container_iteration.jl:442 [inlined]
[2] deref(ii::Tuple{SortedDict{Int64, String, Base.Order.ForwardOrdering}, DataStructures.Tokens.IntSemiToken})
@ DataStructures c:\Users\vavasis\OneDrive - University of Waterloo\Julia_pkg_devdir\DataStructures.jl\src\sorted_container_iteration.jl:242
[3] top-level scope
@ REPL[24]:1
julia> @inbounds deref(token_firstindex(s))
ERROR: BoundsError
Stacktrace:
[1] has_data
@ c:\Users\vavasis\OneDrive - University of Waterloo\Julia_pkg_devdir\DataStructures.jl\src\sorted_container_iteration.jl:442 [inlined]
[2] deref(ii::Tuple{SortedDict{Int64, String, Base.Order.ForwardOrdering}, DataStructures.Tokens.IntSemiToken})
@ DataStructures c:\Users\vavasis\OneDrive - University of Waterloo\Julia_pkg_devdir\DataStructures.jl\src\sorted_container_iteration.jl:242
[3] top-level scope
@ REPL[25]:1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, nvm, I figured this out too. Apparently @inbounds
is not necessarily active if it is called directly from the REPL as opposed to its being called from a compiled function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some further progress, sorry this is taking so long
docs/Project.toml
Outdated
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4" | ||
|
||
[compat] | ||
Documenter = "0.23" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we should keep this line, so when/if Documenter makes a new breaking release we don't get it automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I put it back. I'm not sure how it got deleted.
docs/src/sorted_containers.md
Outdated
is similar to SortedDict except that each key can be associated with | ||
multiple values. The key=>value pairs in a SortedMultiDict are stored | ||
according to the sorted order for keys, and key=>value pairs with the | ||
same key are stored in order of insertion. | ||
|
||
The containers internally use a 2-3 tree, which is a kind of balanced | ||
tree and is described in many elementary data structure textbooks. | ||
tree and is described in data structure textbooks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should mention that it is backed by a single array, rather than linked-list style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docs/src/sorted_containers.md
Outdated
is not a bits-type. | ||
If code profiling indicates that statements using tokens are allocating memory, | ||
then it may be advisable to rewrite the application code using semitokens | ||
more than tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more than tokens. | |
rather than tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docs/src/sorted_containers.md
Outdated
searchsortedfirst(sc,k) | ||
```@docs | ||
Base.first(sc::SortedContainer) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason there are in seperate @docs
blocks?
rather than
```@docs
deref_key(token::Token)
...
Base.first(sc::SortedContainer)
```
I think that works doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know about the other syntax for documenter, but now that you told me, I have fixed it.
docs/src/sorted_containers.md
Outdated
ss_push!(ss::SortedSet, k) | ||
``` | ||
|
||
```@docs | ||
pop!(ss::SortedSet, k) | ||
sd_push!(sd::SortedDict, p::Pair) | ||
``` | ||
|
||
```@docs | ||
pop!(ss::SortedSet) | ||
smd_push!(smd::SortedMultiDict, p::Pair) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to put the Base
functions before our special new ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/DataStructures.jl
Outdated
import Base.+ | ||
import Base.- | ||
import Base.isequal | ||
export sd_push!, ss_push!, smd_push!, poplast! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than all 3 sd_push!, ss_push!, smd_push!
is there a single name we can overload for these that refers to what it returns, rather than hungarian notation for it's first argument?
Like push_and_get_token!
or something? (is that what they are doing?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. The new function is called push_return_token!
. Since it is a single function name, I unified the APIs by making the return variables Bool,IntSemiToken
. The Bool
return variable is redundant for SortedMultiDict because push always creates a new record for SortedMultiDict (i.e., the first return variable is always true
for a SortedMultiDict).
|
||
const SDMContainer = Union{SortedDict, SortedMultiDict} | ||
const SortedContainer = Union{SDMContainer, SortedSet} | ||
const Token = Tuple{SortedContainer, IntSemiToken} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More my surprise was that a token included both the container and the indexing object.
I would have expected the token to just be the indexing object (but that is defined as a semitoken)
is the idea that the token is enough to fully retrieve the stored object?
But semitoken is only half of what is expected?
the first entry of the sorted container `m`, or the past-end token | ||
if the container is empty. Time: O(log *n*) | ||
""" | ||
token_firstindex(m::SortedContainer) = (m, firstindex(m)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this to be exported?
It seems like if someone needs it is is a trivial and obvious definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment about token_firstindex
.
src/sorted_container_iteration.jl
Outdated
if the container is empty. Time: O(log *n*) | ||
""" | ||
Base.lastindex(m::SortedContainer) = endof(m) | ||
endof(m::SortedContainer) = IntSemiToken(endloc(m.bt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as per startof
do we need this? or can we just use lastindex
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have deprecated startof
and endof
in favor of firstindex
and lastindex
.
last entry of the sorted container `m`, or the before-start semitoken | ||
if the container is empty. Time: O(log *n*) | ||
""" | ||
token_lastindex(m::SortedContainer) = (m, lastindex(m)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as per token_firstindex
do we need to export this?
It is a trivial and obvious definition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment about token_firstindex
|
||
These five invocations of `in` use the index structure of the sorted | ||
container and test equality based on the order object of the keys | ||
rather than `isequal`. Therefore, these five are all faster than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we comment here that if isequal
is defined consistently with isless
etc this should have same behavour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here is not whether isless
and isequal
are consistent. Rather, the issue is whether the user has defined an ordering (such as case-insensitive for strings) for the container different from isequal
and isless
.
src/sorted_container_iteration.jl
Outdated
|
||
(:)(t1::Token, t2::Token) = _colon(t1,t2) | ||
|
||
function _colon(t1::Token, t2::Token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not define this as :
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The definition function Base.(:)(a,b)
gives a syntax error (noted above). However, function (:)(a,b)
works, so I have fixed this (but I still have the issue with import
noted above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I figured this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work.
I assume that the logic is at least as correct as before as all previous tests still pass.
Some mostly stylistic comments to resolve
src/sorted_container_iteration.jl
Outdated
+(t1::Token, numstep::Integer) = | ||
numstep >= 0 ? stepforward(t1, numstep) : stepback(t1, -numstep) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this easily fits on 1 line, no?
+(t1::Token, numstep::Integer) = | |
numstep >= 0 ? stepforward(t1, numstep) : stepback(t1, -numstep) | |
+(t1::Token, numstep::Integer) = numstep >= 0 ? stepforward(t1, numstep) : stepback(t1, -numstep) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/sorted_container_iteration.jl
Outdated
-(t1::Token, numstep::Integer) = | ||
numstep >= 0 ? stepback(t1, numstep) : stepforward(t1, -numstep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this easily fits on 1 line, no?
-(t1::Token, numstep::Integer) = | |
numstep >= 0 ? stepback(t1, numstep) : stepforward(t1, -numstep) | |
-(t1::Token, numstep::Integer) = numstep >= 0 ? stepback(t1, numstep) : stepforward(t1, -numstep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
function stepforward(t1::Token, numstep::Integer) | ||
m = t1[1] | ||
j = t1[2].address | ||
!(j in m.bt.useddatacells) && throw(BoundsError()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think of
!(j in m.bt.useddatacells) && throw(BoundsError()) | |
(j in m.bt.useddatacells) || throw(BoundsError()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make this change, but I have used the idiom !(correct) && throw(...)
(rather than correct || throw(...)
) throughout my code. Do you recommend that I change them all.
src/sorted_dict.jl
Outdated
|
||
""" | ||
in(p, sc) | ||
@deprecate insert!(m::SortedDict, k, d) sd_push!(m::SortedDict, k=>d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this into src/deprecations.jl
so we remember to delete it in next release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/test_sorted_containers.jl
Outdated
using Base: Ordering, Forward, Reverse, ForwardOrdering, | ||
ReverseOrdering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fits on one line
using Base: Ordering, Forward, Reverse, ForwardOrdering, | |
ReverseOrdering | |
using Base: Ordering, Forward, Reverse, ForwardOrdering, ReverseOrdering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
import Base.lt | ||
import Base.ForwardOrdering | ||
import Base.ReverseOrdering | ||
import DataStructures.IntSemiToken | ||
import DataStructures.eq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make the imports into using
s i think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
import DataStructures.eq | ||
using Base: Ordering, Forward, Reverse, ForwardOrdering, | ||
ReverseOrdering | ||
using DataStructures: IntSemiToken | ||
import Base.lt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this called Base.lessthan
now days?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked ordering.jl
in Base
, and I still see lt
. Where did you see lessthan
?
|
||
struct CaseInsensitive <: Ordering | ||
end | ||
|
||
lt(::CaseInsensitive, a, b) = isless(lowercase(a), lowercase(b)) | ||
eq(::CaseInsensitive, a, b) = isequal(lowercase(a), lowercase(b)) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# for testing custom ordering |
test/test_sorted_containers.jl
Outdated
function remove_spaces(s::String) | ||
b = Vector{UInt8}() | ||
for c in s | ||
if !isspace(c) | ||
push!(b,UInt8(c)) | ||
end | ||
end | ||
String(b) | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function remove_spaces(s::String) | |
b = Vector{UInt8}() | |
for c in s | |
if !isspace(c) | |
push!(b,UInt8(c)) | |
end | |
end | |
String(b) | |
end | |
remove_spaces(s::String) = replace(s, re"\s+"=>"") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found out that it is actually r"\s+"
rather than re"\s+"
, but aside from that, this is done.
@@ -1765,3 +2093,4 @@ end | |||
end | |||
end | |||
end | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we label the new tests with a comment including the URL of the github issues that they prove is solved?
Renamed push_return_token! to push_return_semitoken!, which accurately describes the operation. Documented the fact that values(s) is not currently writeable.
@oxinabox I believe I have addressed all of your comments in this PR. In the last commit that I just made, I have added comments to the test sets to indicate where the various issues are tested. Let me know if you need anything else before you can merge this PR. |
src/DataStructures.jl
Outdated
@@ -39,12 +42,14 @@ module DataStructures | |||
export SortedDict, SortedMultiDict, SortedSet | |||
export SDToken, SDSemiToken, SMDToken, SMDSemiToken | |||
export SetToken, SetSemiToken | |||
export startof | |||
export startof, endof ## both are deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to explictly export deprecated things @deprecate
does that for us
export startof, endof ## both are deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
You have merge rights.
Merge when you are happy
Major update to sorted containers.
Address issue #239: Dict constructors on typed data but without explicit type declarations should no longer return
{Any,Any}
items because collect is now used when explicit types are omittedAddress issue #479: Deprecate
insert!
,pop!(s)
, introducepopfirst!
,poplast!
,ss_push!
,sd_push!
,smd_push!
Address issue #528:
push!
documentation improvedAddress issue #667: Docs extensively cleaned up and much greater use of documenter.jl
Address issue #669: Expanded use of tokens including
a:b
notation for tokensAddress issue #671: Add reverse iteration
Address issue #742: New union, merge, symdiff, setdiff, intersect algorithms for sorted containers to improve performance
Address issue #726: Clarify docs on
ss_push!
,sd_push!
, andsdm_push!
(formerlyinsert!
)Address issue #767: Implement firstindex, lastindex, reverse for sorted container iteration
Address issue #781: Use collect on an intermediate step
to infer types in untyped construction of sorted containers
Also, this commit cleans up the constructors and adds two new iteration-helper functions
inclusive_key
andexclusive_key
that were useful in my own work.