11# ` DataViewRowCursor ` Notes
22
3- This document includes some more in depth notes on some expert topics for
3+ This document includes some more in- depth notes on some expert topics for
44` DataViewRow ` and ` DataViewRowCursor ` derived classes.
55
66## ` Batch `
77
88Multiple cursors can be returned through a method like
99` IDataView.GetRowCursorSet ` . Operations can happen on top of these cursors --
1010most commonly, transforms creating new cursors on top of them for parallel
11- evaluation of a data pipeline. But the question is, if you need to "recombine"
12- them into a sequence again, how do to it? The ` Batch ` property is the
13- mechanism by which the data from these multiple cursors returned by
11+ evaluation of a data pipeline. But the question is if you need to "recombine"
12+ them into a sequence again, how to do it? The ` Batch ` property is the
13+ mechanism by which the data from these multiple cursors, returned by
1414` IDataView.GetRowCursorSet ` can be reconciled into a single, cohesive,
1515sequence.
1616
17- The question might be, why recombine. This can be done for several reasons: we
17+ The question might be, why recombine? This can be done for several reasons: we
1818may want repeatability and determinism in such a way that requires we view the
1919rows in a simple sequence, or the cursor may be stateful in some way that
2020precludes partitioning it, or some other consideration. And, since a core
21- ` IDataView ` design principle is repeatability, we now have a problem of how to
22- reconcile those separate partitioning .
21+ ` IDataView ` design principle is repeatability, we now have a problem with how to
22+ reconcile those separate partitions .
2323
2424Incidentally, for those working on the ML.NET codebase, there is an internal
2525method ` DataViewUtils.ConsolidateGeneric ` utility method to perform this
26- function. It may be helpful to understand how it works intuitively, so that we
26+ function. It may be helpful to understand how it works intuitively so that we
2727can understand ` Batch ` 's requirements: when we reconcile the outputs of
2828multiple cursors, the consolidator will take the set of cursors. It will find
29- the one with the "lowest" ` Batch ` ID. (This must be uniquely determined: that
30- is, no two cursors should ever return the same ` Batch ` value.) It will iterate
29+ the one with the "lowest" ` Batch ` ID. (This must be uniquely determined:
30+ no two cursors should ever return the same ` Batch ` value.) It will iterate
3131on that cursor until the ` Batch ` ID changes. Whereupon, the consolidator will
3232find the next cursor with the next lowest batch ID (which should be greater,
3333of course, than the ` Batch ` value we were just iterating on).
@@ -60,7 +60,7 @@ typical and perfectly fine for `Batch` to just be `0`.
6060
6161## ` MoveNext `
6262
63- Once ` MoveNext ` returns ` false ` , naturally all subsequent calls to either of
63+ Once ` MoveNext ` returns ` false ` , naturally, all subsequent calls to either of
6464that method should return ` false ` . It is important that they not throw, return
6565` true ` , or have any other behavior.
6666
@@ -73,7 +73,7 @@ over what is supposed to be the same data, for example, in an `IDataView` a
7373cursor set will produce the same data as a serial cursor, just partitioned,
7474and a shuffled cursor will produce the same data as a serial cursor or any
7575other shuffled cursor, only shuffled. The ID exists for applications that need
76- to reconcile which entry is actually which. Ideally this ID should be unique,
76+ to reconcile which entry is actually which. Ideally, this ID should be unique,
7777but for practical reasons, it suffices if collisions are simply extremely
7878improbable.
7979
@@ -104,18 +104,18 @@ follow, in order to ensure that downstream components have a fair shake at
104104producing unique IDs themselves, which I will here attempt to do:
105105
106106Duplicate IDs being improbable is practically accomplished with a
107- hashing-derived mechanism. For this we have the ` DataViewRowId ` methods
107+ hashing-derived mechanism. For this, we have the ` DataViewRowId ` methods
108108` Fork ` , ` Next ` , and ` Combine ` . See their documentation for specifics, but they
109109all have in common that they treat the ` DataViewRowId ` as some sort of
110- intermediate hash state, then return a new hash state based on hashing of a
110+ intermediate hash state, then return a new hash state based on the hashing of a
111111block of additional bits. (Since the additional bits hashed in ` Fork ` and
112112` Next ` are specific, that is, effectively ` 0 ` , and ` 1 ` , this can be very
113113efficient.) The basic assumption underlying all of this is that collisions
114114between two different hash states on the same data, or hashes on the same hash
115115state on different data, are unlikely to collide.
116116
117117Note that this is also the reason why ` DataViewRowId ` was introduced;
118- collisions become likely when we have the number of elements on the order of
118+ collisions become likely when we have the number of elements in the order of
119119the square root of the hash space. The square root of ` UInt64.MaxValue ` is
120120only several billion, a totally reasonable number of instances in a dataset,
121121whereas a collision in a 128-bit space is less likely.
@@ -142,24 +142,24 @@ operate on acceptable sets.
142142
1431434 . As a generalization of the above, if for each element of an acceptable set,
144144 you built the set comprised of the single application of ` Fork ` on that ID
145- followed by the set of any number of application of ` Next ` , the union of
145+ followed by the set of any number of applications of ` Next ` , the union of
146146 all such sets would itself be an acceptable set. (This is useful, for
147147 example, for operations that produce multiple items per input item. So, if
148- you produced two rows based on every single input row, if the input ID were
148+ you produced two rows based on every single input row and if the input ID were
149149 _ id_ , then, the ID of the first row could be ` Fork ` of _ id_ , and the second
150150 row could have ID of ` Fork ` then ` Next ` of the same _ id_ .)
151151
1521525 . If you have potentially multiple acceptable sets, while the union of them
153153 obviously might not be acceptable, if you were to form a mapping from each
154154 set, to a different ID of some other acceptable set (each such ID should be
155155 different), and then for each such set/ID pairing, create the set created
156- from ` Combine ` of the items of that set with that ID, and then union of
156+ from ` Combine ` of the items of that set with that ID, and then the union of
157157 those sets will be acceptable. (This is useful, for example, if you had
158158 something like a join, or a Cartesian product transform, or something like
159159 that.)
160160
1611616 . Moreover, similar to the note about the use of ` Fork ` , and ` Next ` , if
162- during the creation of one of those sets describe above, you were to form
162+ during the creation of one of those sets described above, you were to form
163163 for each item of that set, a set resulting from multiple applications of
164164 ` Next ` , the union of all those would also be an acceptable set.
165165
@@ -193,12 +193,12 @@ transformations, or other such things like this, in which case the details
193193above become important.
194194
195195One common thought that comes up is the idea that we can have some "global
196- position" instead of ID. This was actually the first idea by the original
197- implementor, and if if it * were* possible it would definitely make for a
196+ position" instead of ID. This was actually the first idea of the original
197+ implementor, and if it * were* possible it would definitely make for a
198198cleaner, simpler solution, and multiple people have asked the question to the
199199point where it would probably be best to have a ready answer about where it
200- broke down, to undersatnd how it fails. It runs afoul of the earlier desire
201- with regard to data view cursor sets, that is, that ` IDataView ` cursors
200+ broke down, to understand how it fails. It runs afoul of the earlier desire
201+ with regard to data view cursor sets, that is, ` IDataView ` cursors
202202should, if possible, present split cursors that can run independently on
203203"batches" of the data. But, let's imagine something like the operation for
204204filtering; if I have a batch ` 0 ` comprised of 64 rows, and a batch ` 1 ` with
@@ -209,4 +209,4 @@ why we wanted to have cursor sets in the first place. The same is true also
209209for one-to-many ` IDataView ` implementations (for example, joins, or something
210210like that), where even a strictly increasing (but not necessarily contiguous)
211211value may not be possible, since you cannot even bound the number. So,
212- regrettably, that simpler solution would not work.
212+ regrettably, that simpler solution would not work.
0 commit comments