Skip to content

Commit 3ea882c

Browse files
committed
Merge remote-tracking branch 'origin/main' into reorganize_dataframe_files
# Conflicts: # src/Microsoft.Data.Analysis/DataFrameColumns/ArrowStringDataFrameColumn.cs # src/Microsoft.Data.Analysis/DataFrameColumns/StringDataFrameColumn.cs # src/Microsoft.Data.Analysis/DataFrameColumns/VBufferDataFrameColumn.cs # src/Microsoft.Data.Analysis/PrimitiveColumnContainer.cs
2 parents 8e8aa01 + 796cb35 commit 3ea882c

File tree

7 files changed

+40
-28
lines changed

7 files changed

+40
-28
lines changed

docs/code/DataViewRowCursor.md

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,33 @@
11
# `DataViewRowCursor` Notes
22

3-
This document includes some more in depth notes on some expert topics for
3+
This document includes some more in-depth notes on some expert topics for
44
`DataViewRow` and `DataViewRowCursor` derived classes.
55

66
## `Batch`
77

88
Multiple cursors can be returned through a method like
99
`IDataView.GetRowCursorSet`. Operations can happen on top of these cursors --
1010
most commonly, transforms creating new cursors on top of them for parallel
11-
evaluation of a data pipeline. But the question is, if you need to "recombine"
12-
them into a sequence again, how do to it? The `Batch` property is the
13-
mechanism by which the data from these multiple cursors returned by
11+
evaluation of a data pipeline. But the question is if you need to "recombine"
12+
them into a sequence again, how to do it? The `Batch` property is the
13+
mechanism by which the data from these multiple cursors, returned by
1414
`IDataView.GetRowCursorSet` can be reconciled into a single, cohesive,
1515
sequence.
1616

17-
The question might be, why recombine. This can be done for several reasons: we
17+
The question might be, why recombine? This can be done for several reasons: we
1818
may want repeatability and determinism in such a way that requires we view the
1919
rows in a simple sequence, or the cursor may be stateful in some way that
2020
precludes partitioning it, or some other consideration. And, since a core
21-
`IDataView` design principle is repeatability, we now have a problem of how to
22-
reconcile those separate partitioning.
21+
`IDataView` design principle is repeatability, we now have a problem with how to
22+
reconcile those separate partitions.
2323

2424
Incidentally, for those working on the ML.NET codebase, there is an internal
2525
method `DataViewUtils.ConsolidateGeneric` utility method to perform this
26-
function. It may be helpful to understand how it works intuitively, so that we
26+
function. It may be helpful to understand how it works intuitively so that we
2727
can understand `Batch`'s requirements: when we reconcile the outputs of
2828
multiple cursors, the consolidator will take the set of cursors. It will find
29-
the one with the "lowest" `Batch` ID. (This must be uniquely determined: that
30-
is, no two cursors should ever return the same `Batch` value.) It will iterate
29+
the one with the "lowest" `Batch` ID. (This must be uniquely determined:
30+
no two cursors should ever return the same `Batch` value.) It will iterate
3131
on that cursor until the `Batch` ID changes. Whereupon, the consolidator will
3232
find the next cursor with the next lowest batch ID (which should be greater,
3333
of course, than the `Batch` value we were just iterating on).
@@ -60,7 +60,7 @@ typical and perfectly fine for `Batch` to just be `0`.
6060

6161
## `MoveNext`
6262

63-
Once `MoveNext` returns `false`, naturally all subsequent calls to either of
63+
Once `MoveNext` returns `false`, naturally, all subsequent calls to either of
6464
that method should return `false`. It is important that they not throw, return
6565
`true`, or have any other behavior.
6666

@@ -73,7 +73,7 @@ over what is supposed to be the same data, for example, in an `IDataView` a
7373
cursor set will produce the same data as a serial cursor, just partitioned,
7474
and a shuffled cursor will produce the same data as a serial cursor or any
7575
other shuffled cursor, only shuffled. The ID exists for applications that need
76-
to reconcile which entry is actually which. Ideally this ID should be unique,
76+
to reconcile which entry is actually which. Ideally, this ID should be unique,
7777
but for practical reasons, it suffices if collisions are simply extremely
7878
improbable.
7979

@@ -104,18 +104,18 @@ follow, in order to ensure that downstream components have a fair shake at
104104
producing unique IDs themselves, which I will here attempt to do:
105105

106106
Duplicate IDs being improbable is practically accomplished with a
107-
hashing-derived mechanism. For this we have the `DataViewRowId` methods
107+
hashing-derived mechanism. For this, we have the `DataViewRowId` methods
108108
`Fork`, `Next`, and `Combine`. See their documentation for specifics, but they
109109
all have in common that they treat the `DataViewRowId` as some sort of
110-
intermediate hash state, then return a new hash state based on hashing of a
110+
intermediate hash state, then return a new hash state based on the hashing of a
111111
block of additional bits. (Since the additional bits hashed in `Fork` and
112112
`Next` are specific, that is, effectively `0`, and `1`, this can be very
113113
efficient.) The basic assumption underlying all of this is that collisions
114114
between two different hash states on the same data, or hashes on the same hash
115115
state on different data, are unlikely to collide.
116116

117117
Note that this is also the reason why `DataViewRowId` was introduced;
118-
collisions become likely when we have the number of elements on the order of
118+
collisions become likely when we have the number of elements in the order of
119119
the square root of the hash space. The square root of `UInt64.MaxValue` is
120120
only several billion, a totally reasonable number of instances in a dataset,
121121
whereas a collision in a 128-bit space is less likely.
@@ -142,24 +142,24 @@ operate on acceptable sets.
142142

143143
4. As a generalization of the above, if for each element of an acceptable set,
144144
you built the set comprised of the single application of `Fork` on that ID
145-
followed by the set of any number of application of `Next`, the union of
145+
followed by the set of any number of applications of `Next`, the union of
146146
all such sets would itself be an acceptable set. (This is useful, for
147147
example, for operations that produce multiple items per input item. So, if
148-
you produced two rows based on every single input row, if the input ID were
148+
you produced two rows based on every single input row and if the input ID were
149149
_id_, then, the ID of the first row could be `Fork` of _id_, and the second
150150
row could have ID of `Fork` then `Next` of the same _id_.)
151151

152152
5. If you have potentially multiple acceptable sets, while the union of them
153153
obviously might not be acceptable, if you were to form a mapping from each
154154
set, to a different ID of some other acceptable set (each such ID should be
155155
different), and then for each such set/ID pairing, create the set created
156-
from `Combine` of the items of that set with that ID, and then union of
156+
from `Combine` of the items of that set with that ID, and then the union of
157157
those sets will be acceptable. (This is useful, for example, if you had
158158
something like a join, or a Cartesian product transform, or something like
159159
that.)
160160

161161
6. Moreover, similar to the note about the use of `Fork`, and `Next`, if
162-
during the creation of one of those sets describe above, you were to form
162+
during the creation of one of those sets described above, you were to form
163163
for each item of that set, a set resulting from multiple applications of
164164
`Next`, the union of all those would also be an acceptable set.
165165

@@ -193,12 +193,12 @@ transformations, or other such things like this, in which case the details
193193
above become important.
194194

195195
One common thought that comes up is the idea that we can have some "global
196-
position" instead of ID. This was actually the first idea by the original
197-
implementor, and if if it *were* possible it would definitely make for a
196+
position" instead of ID. This was actually the first idea of the original
197+
implementor, and if it *were* possible it would definitely make for a
198198
cleaner, simpler solution, and multiple people have asked the question to the
199199
point where it would probably be best to have a ready answer about where it
200-
broke down, to undersatnd how it fails. It runs afoul of the earlier desire
201-
with regard to data view cursor sets, that is, that `IDataView` cursors
200+
broke down, to understand how it fails. It runs afoul of the earlier desire
201+
with regard to data view cursor sets, that is, `IDataView` cursors
202202
should, if possible, present split cursors that can run independently on
203203
"batches" of the data. But, let's imagine something like the operation for
204204
filtering; if I have a batch `0` comprised of 64 rows, and a batch `1` with
@@ -209,4 +209,4 @@ why we wanted to have cursor sets in the first place. The same is true also
209209
for one-to-many `IDataView` implementations (for example, joins, or something
210210
like that), where even a strictly increasing (but not necessarily contiguous)
211211
value may not be possible, since you cannot even bound the number. So,
212-
regrettably, that simpler solution would not work.
212+
regrettably, that simpler solution would not work.

src/Microsoft.Data.Analysis/DataFrameColumns/ArrowStringDataFrameColumn.cs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -226,7 +226,7 @@ private int GetBufferIndexContainingRowIndex(long rowIndex, out int indexInBuffe
226226
{
227227
if (rowIndex >= Length)
228228
{
229-
throw new ArgumentOutOfRangeException(Strings.RowIndexOutOfRange, nameof(rowIndex));
229+
throw new ArgumentOutOfRangeException(Strings.IndexIsGreaterThanColumnLength, nameof(rowIndex));
230230
}
231231

232232
// Since the strings here could be of variable length, scan linearly

src/Microsoft.Data.Analysis/DataFrameColumns/StringDataFrameColumn.cs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ private int GetBufferIndexContainingRowIndex(long rowIndex)
8282
{
8383
if (rowIndex >= Length)
8484
{
85-
throw new ArgumentOutOfRangeException(Strings.RowIndexOutOfRange, nameof(rowIndex));
85+
throw new ArgumentOutOfRangeException(Strings.IndexIsGreaterThanColumnLength, nameof(rowIndex));
8686
}
8787
return (int)(rowIndex / MaxCapacity);
8888
}

src/Microsoft.Data.Analysis/DataFrameColumns/VBufferDataFrameColumn.cs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ private int GetBufferIndexContainingRowIndex(long rowIndex)
8484
{
8585
if (rowIndex >= Length)
8686
{
87-
throw new ArgumentOutOfRangeException(Strings.RowIndexOutOfRange, nameof(rowIndex));
87+
throw new ArgumentOutOfRangeException(Strings.IndexIsGreaterThanColumnLength, nameof(rowIndex));
8888
}
8989

9090
return (int)(rowIndex / MaxCapacity);

src/Microsoft.Data.Analysis/PrimitiveColumnContainer.cs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -313,7 +313,7 @@ public int GetIndexOfBufferContainingRowIndex(long rowIndex)
313313
{
314314
if (rowIndex >= Length)
315315
{
316-
throw new ArgumentOutOfRangeException(Strings.RowIndexOutOfRange, nameof(rowIndex));
316+
throw new ArgumentOutOfRangeException(Strings.IndexIsGreaterThanColumnLength, nameof(rowIndex));
317317
}
318318
return (int)(rowIndex / ReadOnlyDataFrameBuffer<T>.MaxCapacity);
319319
}

src/Microsoft.Data.Analysis/Strings.Designer.cs

Lines changed: 9 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

src/Microsoft.Data.Analysis/Strings.resx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,9 @@
183183
<data name="InconsistentNullBitMapAndNullCount" xml:space="preserve">
184184
<value>Inconsistent null bitmaps and NullCounts</value>
185185
</data>
186+
<data name="IndexIsGreaterThanColumnLength" xml:space="preserve">
187+
<value>Index cannot be greater than the Column's Length</value>
188+
</data>
186189
<data name="InvalidColumnName" xml:space="preserve">
187190
<value>Column '{0}' does not exist</value>
188191
</data>

0 commit comments

Comments
 (0)