Skip to content

Extend std.range.chunks to work with non-forward input ranges.#5624

Merged
dlang-bot merged 2 commits intodlang:masterfrom
quickfur:input-chunks
Aug 1, 2017
Merged

Extend std.range.chunks to work with non-forward input ranges.#5624
dlang-bot merged 2 commits intodlang:masterfrom
quickfur:input-chunks

Conversation

@quickfur
Copy link
Member

Rationale: sometimes all you can get is an input range (e.g., File.byLine), and it is undesirable to have to buffer the entire input range just to be able to call chunks on it. This extension of chunks to support non-forward input ranges allows for the possibility of caching such ranges by chunks so that you process it incrementally while offering better range primitives on the cached portion of the range.

For example:

File("data").byLineCopy
    .chunks(bufSize)
    .map!(chunk => chunk.array) // cache each chunk on a buffer
    ... // now each chunk is a forward range and
        // you can do more stuff with it
        // Even though outer range is still only input range

For now, I'm implementing this as a separate overload of Chunks, due to the unfortunate historical accident that Chunks was exposed as a public API rather than encapsulated as a Voldemort or private module-global struct. The implementation is fundamentally different, since without forward range primitives the current code simply cannot be made to work correctly. Originally I tried to use static if inside the struct to separate the two implementations, but it was very messy and the diff was a sight to behold. So I'm keeping it as a separate overload for now.

@dlang-bot
Copy link
Contributor

dlang-bot commented Jul 17, 2017

Thanks for your pull request, @quickfur!

Bugzilla references

Auto-close Bugzilla Description
15759 chunks should work with only an input range

Copy link
Contributor

@wilzbach wilzbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I needed exactly this a while ago :)
So you can add "Fix 15759" to the commit message.

I made a few comments on my first round.

private size_t curSizeLeft;
bool empty;

this(Source _r, size_t _chunkSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these parameter are publicly exposed, so I think it's better to use this.x = x below (even if it's a bit uglier).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer applicable in the new code.

private Source r;
private size_t chunkSize;
private size_t curSizeLeft;
bool empty;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is writable by the user -> should be private and wrapped .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

{
private Chunks* impl;

@property bool empty() { return impl.curSizeLeft == 0 || impl.r.empty; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be const if impl.r.empty is const. This is typically checked with sth. like:

static if (is(typeof((cast(const typeof(impl.r))impl.r).empty))) 

but it's not very clean and thus not done at many places in Phobos ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't dmd infer attributes on these methods, since the outer struct is a template?

empty = r.empty;
}

@property Chunk front() { return Chunk(&this); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about this. Won't the wrapping Chunks get deconstructed when you leave the scope, e.g. by returning its front?
(we should at least test this)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, just saw this comment... but this is already causing a breakage when compiled with -dip25 -dip1000 because the new isInputRange tests front with typeof((R r) => r.front), which will escape the reference to this, and causes it to reject this as being a range.

I suppose I should turn this into a class rather than a struct... in the original non-Phobos version of this in my own code, the wrapping Chunks is always allocated on the heap. The problem with -dip25 -dip1000 still exists, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, just reworked the code a bit to move the implementation bits into a heap-allocated struct, so that the outer Chunks is just a wrapper around a pointer to it. This makes it cleaner and also fixes the problems with -dip25 -dip1000.

@wilzbach
Copy link
Contributor

due to the unfortunate historical accident that Chunks was exposed as a public API rather than encapsulated as a Voldemort or private module-global struct.

Btw two simple things that we could do to improve the status quo:

  • add the missing options (e.g. SearchPolicy, TransverseOptions) to bookmark table (or create a new one) and hide the quick index
  • undocument the publicly exposed structs, s.t. at least the docs look nicer for newcomers

Opinions?

@quickfur
Copy link
Member Author

Sounds like a good idea, kill off the autoindex (which is very ugly anyway) and undocument the wrapper structs. Don't make them private just yet, to avoid breaking any existing code that may actually refer to their names. But hide them from the docs so that no new code will continue to rely on them.

@wilzbach
Copy link
Contributor

Sounds like a good idea, kill off the autoindex (which is very ugly anyway) and undocument the wrapper structs. Don't make them private just yet, to avoid breaking any existing code that may actually refer to their names. But hide them from the docs so that no new code will continue to rely on them.

Great that we are in agreement -> #5625


@safe unittest
{
import std.algorithm.comparison : equal;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import std.internal.test.dummyrange : ReferenceInputRange;
auto data = [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ];
auto r = new ReferenceInputRange!int(data);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


private this(Source r, size_t chunkSize)
{
impl = new Impl(r, chunkSize, chunkSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is putting this on the heap really nessesary? It's just that you're throwing out @nogc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a shame that RefCounted is still unsafe...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence about this. Ideally, we should use RefCounted here, I think byChunk does that, but then it's a toss-up between whether you want to keep @nogc or @safe, but you can't have both.

I'm OK either way -- if you guys feel it's more consistent to use RefCounted ala byChunk, then it's a relatively easy change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more arguments for RefCounted:

  • making safe RC (core.rc) is on the agenda, so if we use RefCounted now, chances are good that it might turn @safe in the future
  • it's easier to use @trusted on your code than the hacky and rather unknown assumeNoGc.
    In any case it's probably a good idea to open an issue about it, s.t. it's not forgotten.

fewer than $(D chunkSize) elements.

If `Source` is an input range but not a forward range, the resulting range and
chunks will be single-pass only: iterating over `front` will shrink the chunk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO there's no need for the long explaination; people should already know what an input range is if they're reading this

If Source is an input range and not a forward range, each chunk will also be an input range. Any calls to popFront on Chunks will also invalidate any lingering references to previous values in each chunk.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I just wanted to make sure people come with no expectations that we will cache range elements or otherwise make the resulting range "better behaved". A lot of Phobos algorithms will act funny when given a non-forward input range (try it sometime), and I have heard complaints in the past about why something doesn't behave "intuitively", even though that is impossible given input range semantics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If Source is a forward range, the resulting chunks will be forward ranges as well. Otherwise (i.e. Source is an input range), the resulting chunks will be input ranges consuming the same input."

@@ -7147,6 +7208,43 @@ if (isForwardRange!Source)
assert(equal(retro(array(chunks)), array(retro(chunks))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add the following example to show the new functionality

int i;
auto inputRange = generate!(() => ++i).take(10);
auto chunked = inputRange.chunks(2);

assert(inputRange.front.equal([1, 2]));
assert(inputRange.front.empty);
inputRange.popFront;
assert(inputRange.front.equal([3, 4]));

or something similar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@quickfur
Copy link
Member Author

quickfur commented Jul 19, 2017

Hmm, I tried converting the heap allocation to use RefCounted instead, but got this error:

std/conv.d(4366): Error: static assert  "Impl cannot be emplaced from (RefCounted!(Impl, cast(RefCountedAutoInitialize)1)*)."
std/conv.d(4447):        instantiated from here: emplaceRef!(Impl, Impl, RefCounted!(Impl, cast(RefCountedAutoInitialize)1)*)
std/typecons.d(5221):        instantiated from here: emplace!(Impl, RefCounted!(Impl, cast(RefCountedAutoInitialize)1)*)
std/typecons.d(5312):        instantiated from here: initialize!(RefCounted!(Impl, cast(RefCountedAutoInitialize)1)*)
std/range/package.d(7160):        ... (1 instantiations, -v to show) ...
std/range/package.d(7185):        instantiated from here: Chunks!(Take!(Generator!(delegate () => i += 1)))
std/range/package.d(7217):        instantiated from here: chunks!(Take!(Generator!(delegate () => i += 1)))

Any ideas what might be wrong?

@JackStouffer
Copy link
Contributor

TBH when I asked about putting it on the heap, I was talking about in general and not about using the GC. Is there a specific reason why you're using an indirection here?

@quickfur
Copy link
Member Author

quickfur commented Jul 21, 2017

Very good question. The original code I drew this from actually did not have the indirection... but I quickly discovered that mixing non-forward input ranges with by-value semantics was a very bad combination, because in a complex UFCS pipeline, you easily end up inadvertently passing part of the state by value, thus causing the range wrapper to go out of sync with the input range being wrapped (esp. when the latter has by-reference semantics, or references external state). After hacking the code with various workarounds, eventually I just gave up and decided that the cleanest way to guarantee correct behaviour was to simply make the whole thing by-reference, including the Chunks.

Even that turned out to be insufficient in the face of structs' pass-by-value semantics, so eventually all state has to be stored in the wrapper's state, and Chunk itself became basically just a wrapper to this "mothership" wrapper in order to work in all cases.

@quickfur
Copy link
Member Author

I'd very much like the RefCounted version to be merged, actually. Following groupBy's pattern, I think this is a better solution in terms of the general direction of making Phobos @nogc. If I can only figure out why RefCounted is asserting... :-(

@quickfur
Copy link
Member Author

Gah, I'm an idiot. Was trying to assign new RefCounted!Impl(...) to impl... forgot to take out the "new" (what was I thinking?!). Works fine now, except the unittests are no longer @safe.

@JackStouffer
Copy link
Contributor

Hmm, do we want @safe or @nogc?

@quickfur
Copy link
Member Author

Haha, that was also my reaction. :-P Ideally both, but that will have to wait for RefCounted to become @safe. I haven't been following recent developments on that front, though... not sure what exactly is involved or how soon we can expect it.

@wilzbach
Copy link
Contributor

Hmm, do we want @safe or @nogc?

Copy/pasting my PoV from the sub-comments:

  • making safe RC (core.rc) is on the agenda, so if we use RefCounted now, chances are good that it might turn @safe in the future
  • it's easier to use @trusted on your code than the hacky and rather unknown assumeNoGc

@JackStouffer
Copy link
Contributor

not sure what exactly is involved or how soon we can expect it.

IIRC it requires the DIP1000 stuff to be fully working

fewer than $(D chunkSize) elements.

If `Source` is an input range but not a forward range, the resulting range and
chunks will be single-pass only: iterating over `front` will shrink the chunk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If Source is a forward range, the resulting chunks will be forward ranges as well. Otherwise (i.e. Source is an input range), the resulting chunks will be input ranges consuming the same input."


Returns: Forward range of all chunks with propagated bidirectionality,
conditional random access and slicing.
Returns: Input range of all chunks with propagated forwardness,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"forwardness" is a bit awkward, rephrase? Just say what you mean, you've already explained the details: "Returns: Range of chunks"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

bidirectionality, conditional random access and slicing.
*/
struct Chunks(Source)
if (isForwardRange!Source)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that implementation is distinct for input and forward ranges is an implementation detail. Use the weakest constraint here and use static if inside the struct to fork the implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, this is what I did but it made the diff look really ugly, that's why I went with splitting it into two overloads. But since you asked for it, I'll go back to doing it that way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

size_t _chunkSize;
}

/// ditto
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge with the struct above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

private Source r;
private size_t chunkSize;
private size_t curSizeLeft;
private bool _empty;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the underscore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent being overwritten from outside, as pointed out by @wilzbach's review.

private this(Source r, size_t chunkSize)
{
impl = RefCounted!Impl(r, chunkSize, chunkSize);
impl._empty = r.empty;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does impl = RefCounted!Impl(r, chunkSize, chunkSize, r.empty); work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

{
impl.curSizeLeft--;
impl.r.popFront();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this impl.curSizeLeft -= impl.r.popFrontN(impl.curSizeLeft);?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (!impl.r.empty)
impl.curSizeLeft = impl.chunkSize;
else
impl._empty = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like impl._empty is just a cache for impl.r.empty, why not eliminate? Alternatively, put a 0 in impl.chunkSize when done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't eliminate it because the original range may have been exhausted after the last chunk is iterated, but we still need to allow one more call to popFront to pop off the last chunk (conceptually, of course it's a no-op).

But your idea of setting impl.chunkSize to 0 when empty is a good one, it lets us eliminate _empty after all. I'll go with that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@quickfur
Copy link
Member Author

@andralex The diff now looks rather ugly, but if you add ?w=1 to the Github URL, it will suppress whitespace diffs and will look cleaner. Here's the link.

Copy link
Member

@andralex andralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for a solid piece of work

@andralex
Copy link
Member

please submit a changelog entry too

}

/// Non-forward input ranges are supported, but with limited semantics.
/*@safe*/ unittest // FIXME: can't be @safe because RefCounted isn't.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @quickfur, this needs to be marked @system in order to pass circle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

assert(chunked.front.equal([3, 4]));
}

/*@safe*/ unittest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

`std.range.chunks` was extended to support non-forward input ranges.

Now `std.range.chunks` can be used with input ranges that are not forward
ranges, albeit with limited semantics as imposed by the underlying range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quickfur Trailing whitespace

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants