Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement chunkBySize and splitInto in Seq, List and Array #261

Closed
wants to merge 1 commit into from

Conversation

PatrickMcDonald
Copy link
Contributor

This PR implements chunksOf: int -> M('T) -> M(M('T)) and divideInto: int -> M('T) -> M(M('T))

See fsharp/fslang-design#25

CheckThrowsArgumentException (fun () -> Seq.divideInto 3 nullSeq |> ignore)

// invalidArg
CheckThrowsArgumentException (fun () -> Seq.divideInto 0 [1..10] |> ignore)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check for argument name (property of ArgumentException) on test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't done elsewhere in the tests that I can see and if done here it should probably be changed for all checks for ArgumentException

@rojepp
Copy link
Contributor

rojepp commented Feb 23, 2015

This is confusing for me. I thought it was supposed to be chunkBy ('t -> 'key) instead of int?

@PatrickMcDonald
Copy link
Contributor Author

@rojepp I was asked to implement this version. I guess this would not prevent the original version from being implemented if the demand is there for it

[<CompiledName("ChunksOf")>]
let chunksOf chunkSize (array:'T[]) =
checkNonNull "array" array
if chunkSize <= 0 then invalidArg "chunkSize" (SR.GetString(SR.inputMustBeNonNegative))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an error message inputMustBePositive, since 0 is non-negative but still invalid. From a quick scan, about 80% of usages for this message are correct (check for < 0) but a handful are inaccurate (check for <= 0).

Should not block this contribution, just something to look at later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I knew it wasn't quite right, but it was used in windowed with the same check and I assumed there was more to it than just adding a new constant :)

@latkin
Copy link
Contributor

latkin commented Feb 23, 2015

Implementation looks good, and nice test coverage, just a few comments there. On design I still wonder about a few things.


I understand overall goal of keeping return types same as input types, but I wonder whether it's worth it for functions like this.

  • e.g. for Seq, internally return object is Seq<'t[]> but we spend time/memory to hide this and return Seq<Seq<'t>>
  • On the other hand, for List it is Lists all the way through (at least with current impl) and that offers various benefits

For divideInto, I wonder what the most intuitive design is. Current design results in user having 2 unfortunate uncertainties, all the time:

  1. When I ask for divideInto n <length m> I don't know if the result will actually be length n (regardless of whether n < m)
  2. I don't know if each of the returned collections will have the same length.

No. 2 is unavoidable unless n|m, and we are fine in that case.

If n > m then the only way to provide guaranteed result length is to include a bunch of empty results... maybe that's ok but seems wasteful.

If n <= m then we can provide guaranteed result length by chunking differently: return n objs with length (m/n), with first element getting the remainder. e.g. for divideInto 5 <length 7> current impl returns 4 objs, lengths 2, 2, 2, 1. Alternative is to return 5 objs, lengths 3, 1, 1, 1, 1.


Sorry for long post, just going through possibilities in my head. Would love for someone to convince me that these are bad ideas. 😆

@PatrickMcDonald
Copy link
Contributor Author

On the types, one alternative option would be to use the following:

Seq.chunksOf: int -> Seq<'T> -> Seq<'T []>
Array.chunksOf: int -> 'T [] -> 'T [] []
List.chunksOf: int -> 'T list -> 'T list list

It could be argued that maybe we should have gone this route with windowed, given that we already had that signature for Seq.windowed. However we chose consistency over the first option. So another option is to reuse the windowed signature for divideInto and chunksOf:

Seq.chunksOf: int -> Seq<'T> -> Seq<'T []>
Array.chunksOf: int -> 'T [] -> 'T [] []
List.chunksOf: int -> 'T list -> 'T [] list

(One way of thinking about chunksOf is that it's kind of like windowed only with fewer windows, so maybe they should have the same signatures.)

If we go with the first option, we should consider revisiting List.windowed

@latkin
Copy link
Contributor

latkin commented Feb 23, 2015

Hmm, interesting. Yes, windowed for array and list hasn't shipped yet, so I guess we can still tweak the signature. Your option 1 looks like a nice possibility.

@dsyme
Copy link
Contributor

dsyme commented Feb 23, 2015

Hi all,

First, thank you @PatrickMcDonald for looking at this. As I've returned to doing bread-and-butter F# programming (rather than compiler work) in the context of updating Expert F# and doing some distributed cloud programming with MBrace, and based on years of seeing people want these functions, I've become convinced that we should add these to F#.

So I felt we could at least put good implementations/designs of them on the table for F# 4.0, incorporating them should schedule permit. This is why I mentioned to @PatrickMcDonald that he might take a look at this. Thanks again for this. Equally, we all understand that schedule might not permit, and we'll all be ok with that :)

OK, so on to the feedback...

First naming.... I suggest we use chunkBySize and dividedInto.

I'm uncomfortable with both of the names "chunksOf" and "chunkBy"

  • "chunks" literally makes me feel queasy (there are connotations). That said, I quite like the verb "chunk" for some reason - it's sufficiently undefined that we can give it a behaviour.
  • "Of" is used for type conversions in F# collections parlance
  • "chunks" is a noun and all the other aggregate operators use verbs (map) or adjectives (windowed).
  • a naked "By" indicates a key projection.

I'm somewhat loathe to use up "Into" as a suffix, as it has a potential variety of other meanings - for some collection libraries, for example, it means "give a continuation which consumes the results", as in "zipInto". But I can't find anything better than divideInto, it's a tricky operation to name.

For the semantics of divideInto/splitByGroupCount, we should definitely split groups as evenly as possible. One of the typical use cases of this function is to divide work into evenly balanced groups in multi-core or distributed algorithms, e.g. see here and here. It seems simplest to use a design that puts extra elements to the left, e.g. [2; 2; 1; 1] and [2; 2; 2; 1; 1] and [2; 2; 1; 1; 1] when uneven splitting is needed.

For the types, I agree with option 1 of @PatrickMcDonald's suggestion here. Indeed I would suggest we also consider changing List.windowed (which is new in F# 4.0) to return a list-of-lists, for consistency, which would undo a previous decision we've made but consistency is very important here.

@latkin
Copy link
Contributor

latkin commented Feb 24, 2015

Other possible verbs: partition, bin, split.

So @dsyme you would suggest the bin sizes never differing by more than 1, with bigger bins front-loaded?

Fascinating that words in this space ("chunks", "runs") evoke such unpleasant stuff 😷

@tpetricek
Copy link
Contributor

The chunking functions in Deedle might be relevant to the discussion. See the relevant documentation section.

  • We use fooInto when specifying continuation.
  • We use a couple of options to specify what to do with incomplete chunks.
  • We also have chunkWhile where you specify 'k -> 'k -> bool (to start new chunk after specified condition on keys holds) and (even fancier :-)) chunkDist.
  • We don't have a divide function yet (AFAIK)

@dsyme
Copy link
Contributor

dsyme commented Feb 24, 2015

Thanks Tomas, that's very useful

It looks like in Deedle chunkWhile should be better called chunkWith (unless there's some Pandas reason for the former?) - following the FSharp.Core convention that With indicates a comparer function (though in this case it's an equality function, rather than a comparer).

@PatrickMcDonald
Copy link
Contributor Author

I initially used chunk but changed to chunksOf on the basis that there were many ways of chunking. Seeing as Deedle uses the same meaning for chunk then this is also something we could consider.

Some other possibilities for divideInto: apportion, divvy, divide

@dsyme
Copy link
Contributor

dsyme commented Feb 24, 2015

@latkin - yes, bins never differing by more than 1, with bigger bins front-loaded

To finalize the design for this, we need to

  1. choose between "chunk" and "chunkBySize". I favour the latter.
  2. determine if there are any other viable names for "divideInto". Of the above only "apportion" strikes me as viable (partition is used, spilt doesn't give the right sense, bin has other meanings).

@dsyme
Copy link
Contributor

dsyme commented Feb 24, 2015

Here are the online dictionary synonyms for apportion :)

Synonyms: accord, admeasure, administer, allocate, allot, assign, bestow, cut, cut up, deal, dispense, distribute, divvy, divvy up, dole out, give, lot, measure, mete, parcel, part, partition, piece up, prorate, ration, slice, split, split up

@dsyme
Copy link
Contributor

dsyme commented Feb 24, 2015

Giving possibilities "cutInto" and "splitInto" (the Into clarifies the meaning of split).

If you had to vote between "splitInto" and "divideInto" which would you choose and why? Any other suggestions? Thx

@PatrickMcDonald
Copy link
Contributor Author

There's also distribute

there's two hard problems in computer science: we only have one joke and it's not funny

@PatrickMcDonald
Copy link
Contributor Author

An alternative to ...Into is ...Between

@enricosada
Copy link
Contributor

I vote for chunk because is easy to understand.

chunkBySize seem wrong. Whats is size? The position?

When i search for chunk in Google i get always the answer for "split list in sublist of evenly size"

Others: cluster groupBySize

quick search for other languages ( consistency is good , easy to search )

@dsyme
Copy link
Contributor

dsyme commented Feb 25, 2015

Thanks, great feedback. Could you add links to the library functions of these other languages?

@rojepp
Copy link
Contributor

rojepp commented Feb 25, 2015

[1 .. 10] |> chunksOf 3 reads nicely. :)

| len ->
let chunkCons = freshConsNoTail list.Head
let res = freshConsNoTail chunkCons
let (lenDivCount, lenModCount) = System.Math.DivRem(len, min len count)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is failing in one of the portable builds, I'll fix it

@PatrickMcDonald PatrickMcDonald changed the title Implement chunksOf and divideInto in Seq, List and Array Implement chunkBySize and divideInto in Seq, List and Array Feb 25, 2015
@enricosada
Copy link
Contributor

@dsyme updated with reference
sry added these from mobile this morning, now the list is more accurate

@latkin
Copy link
Contributor

latkin commented Feb 25, 2015

Mathematica uses Partition. Though as @dsyme mentioned earlier, we already use this name in List and Array.

@latkin
Copy link
Contributor

latkin commented Feb 25, 2015

I like splitInto over divideInto. Given existing purely mathematical operations sum, average, min, and max, I don't think "divide" should be a structural operation. Especially when we have "splitAt" which is a closely-related structural operation. "splitInto" and "splitAt" seems like a natural pair, both name-wise and action-wise.

@dsyme
Copy link
Contributor

dsyme commented Feb 26, 2015

@latkin - yes, I think "arr |> Array.splitInto 10" does read very well.

OK, so let's settle on

chunkBySize
splitInto

@PatrickMcDonald - can you make the updates to finalize this?

Thanks
Don

@PatrickMcDonald PatrickMcDonald changed the title Implement chunkBySize and divideInto in Seq, List and Array Implement chunkBySize and splitInto in Seq, List and Array Feb 26, 2015
@PatrickMcDonald
Copy link
Contributor Author

Fantastic, I'll make these changes this evening.

@latkin Do you have a strong opinion against me rebasing this branch to tidy up the history? I know they get rebased anyway but the history of commits gets left in the final commit message.

@latkin
Copy link
Contributor

latkin commented Feb 26, 2015

@PatrickMcDonald I have no preference either way, feel free to rearrange as you see fit

@latkin
Copy link
Contributor

latkin commented Feb 27, 2015

LGTM. Some tricky index arithmetic in this one!

@latkin latkin closed this in a1a27a4 Feb 27, 2015
@latkin latkin added the fixed label Feb 27, 2015
@PatrickMcDonald PatrickMcDonald deleted the chunk branch March 1, 2015 10:12
@ErikSchierboom
Copy link
Contributor

Great addition! Looking forward to using this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants