Merge remote-tracking branch 'origin/main' into reorganize_dataframe_files

asmirnov82 · asmirnov82 · commit 3ea882c868e9 · 2023-10-23T14:11:22.000+03:00
# Conflicts:
#	src/Microsoft.Data.Analysis/DataFrameColumns/ArrowStringDataFrameColumn.cs
#	src/Microsoft.Data.Analysis/DataFrameColumns/StringDataFrameColumn.cs
#	src/Microsoft.Data.Analysis/DataFrameColumns/VBufferDataFrameColumn.cs
#	src/Microsoft.Data.Analysis/PrimitiveColumnContainer.cs
diff --git a/docs/code/DataViewRowCursor.md b/docs/code/DataViewRowCursor.md
@@ -1,33 +1,33 @@
 ﻿# `DataViewRowCursor` Notes
 
-This document includes some more in depth notes on some expert topics for
+This document includes some more in-depth notes on some expert topics for
 `DataViewRow` and `DataViewRowCursor` derived classes.
 
 ## `Batch`
 
 Multiple cursors can be returned through a method like
 `IDataView.GetRowCursorSet`. Operations can happen on top of these cursors --
 most commonly, transforms creating new cursors on top of them for  parallel
-evaluation of a data pipeline. But the question is, if you need to "recombine"
-them into a sequence again, how do to it? The `Batch` property is the
-mechanism by which the data from these multiple cursors returned by
+evaluation of a data pipeline. But the question is if you need to "recombine"
+them into a sequence again, how to do it? The `Batch` property is the
+mechanism by which the data from these multiple cursors, returned by
 `IDataView.GetRowCursorSet` can be reconciled into a single, cohesive,
 sequence.
 
-The question might be, why recombine. This can be done for several reasons: we
+The question might be, why recombine? This can be done for several reasons: we
 may want repeatability and determinism in such a way that requires we view the
 rows in a simple sequence, or the cursor may be stateful in some way that
 precludes partitioning it, or some other consideration. And, since a core
-`IDataView` design principle is repeatability, we now have a problem of how to
-reconcile those separate partitioning.
+`IDataView` design principle is repeatability, we now have a problem with how to
+reconcile those separate partitions.
 
 Incidentally, for those working on the ML.NET codebase, there is an internal
 method `DataViewUtils.ConsolidateGeneric` utility method to perform this
-function. It may be helpful to understand how it works intuitively, so that we
+function. It may be helpful to understand how it works intuitively so that we
 can understand `Batch`'s requirements: when we reconcile the outputs of
 multiple cursors, the consolidator will take the set of cursors. It will find
-the one with the "lowest" `Batch` ID. (This must be uniquely determined: that
-is, no two cursors should ever return the same `Batch` value.) It will iterate
+the one with the "lowest" `Batch` ID. (This must be uniquely determined: 
+no two cursors should ever return the same `Batch` value.) It will iterate
 on that cursor until the `Batch` ID changes. Whereupon, the consolidator will
 find the next cursor with the next lowest batch ID (which should be greater,
 of course, than the `Batch` value we were just iterating on).
@@ -60,7 +60,7 @@ typical and perfectly fine for `Batch` to just be `0`.
 
 ## `MoveNext`
 
-Once `MoveNext` returns `false`, naturally all subsequent calls to either of
+Once `MoveNext` returns `false`, naturally, all subsequent calls to either of
 that method should return `false`. It is important that they not throw, return
 `true`, or have any other behavior.
 
@@ -73,7 +73,7 @@ over what is supposed to be the same data, for example, in an `IDataView` a
 cursor set will produce the same data as a serial cursor, just partitioned,
 and a shuffled cursor will produce the same data as a serial cursor or any
 other shuffled cursor, only shuffled. The ID exists for applications that need
-to reconcile which entry is actually which. Ideally this ID should be unique,
+to reconcile which entry is actually which. Ideally, this ID should be unique,
 but for practical reasons, it suffices if collisions are simply extremely
 improbable.
 
@@ -104,18 +104,18 @@ follow, in order to ensure that downstream components have a fair shake at
 producing unique IDs themselves, which I will here attempt to do:
 
 Duplicate IDs being improbable is practically accomplished with a
-hashing-derived mechanism. For this we have the `DataViewRowId` methods
+hashing-derived mechanism. For this, we have the `DataViewRowId` methods
 `Fork`, `Next`, and `Combine`. See their documentation for specifics, but they
 all have in common that they treat the `DataViewRowId` as some sort of
-intermediate hash state, then return a new hash state based on hashing of a
+intermediate hash state, then return a new hash state based on the hashing of a
 block of additional bits. (Since the additional bits hashed in `Fork` and
 `Next` are specific, that is, effectively `0`, and `1`, this can be very
 efficient.) The basic assumption underlying all of this is that collisions
 between two different hash states on the same data, or hashes on the same hash
 state on different data, are unlikely to collide.
 
 Note that this is also the reason why `DataViewRowId` was introduced;
-collisions become likely when we have the number of elements on the order of
+collisions become likely when we have the number of elements in the order of
 the square root of the hash space. The square root of `UInt64.MaxValue` is
 only several billion, a totally reasonable number of instances in a dataset,
 whereas a collision in a 128-bit space is less likely.
@@ -142,24 +142,24 @@ operate on acceptable sets.
 
 4. As a generalization of the above, if for each element of an acceptable set,
    you built the set comprised of the single application of `Fork` on that ID
-   followed by the set of any number of application of `Next`, the union of
+   followed by the set of any number of applications of `Next`, the union of
    all such sets would itself be an acceptable set. (This is useful, for
    example, for operations that produce multiple items per input item. So, if
-   you produced two rows based on every single input row, if the input ID were
+   you produced two rows based on every single input row and if the input ID were
    _id_, then, the ID of the first row could be `Fork` of _id_, and the second
    row could have ID of `Fork` then `Next` of the same _id_.)
 
 5. If you have potentially multiple acceptable sets, while the union of them
    obviously might not be acceptable, if you were to form a mapping from each
    set, to a different ID of some other acceptable set (each such ID should be
    different), and then for each such set/ID pairing, create the set created
-   from `Combine` of the items of that set with that ID, and then union of
+   from `Combine` of the items of that set with that ID, and then the union of
    those sets will be acceptable. (This is useful, for example, if you had
    something like a join, or a Cartesian product transform, or something like
    that.)
 
 6. Moreover, similar to the note about the use of `Fork`, and `Next`, if
-   during the creation of one of those sets describe above, you were to form
+   during the creation of one of those sets described above, you were to form
    for each item of that set, a set resulting from multiple applications of
    `Next`, the union of all those would also be an acceptable set.
 
@@ -193,12 +193,12 @@ transformations, or other such things like this, in which case the details
 above become important.
 
 One common thought that comes up is the idea that we can have some "global
-position" instead of ID.  This was actually the first idea by the original
-implementor, and if if it *were* possible it would definitely make for a
+position" instead of ID.  This was actually the first idea of the original
+implementor, and if it *were* possible it would definitely make for a
 cleaner, simpler solution, and multiple people have asked the question to the
 point where it would probably be best to have a ready answer about where it
-broke down, to undersatnd how it fails. It runs afoul of the earlier desire
-with regard to data view cursor sets, that is, that `IDataView` cursors
+broke down, to understand how it fails. It runs afoul of the earlier desire
+with regard to data view cursor sets, that is, `IDataView` cursors
 should, if possible, present split cursors that can run independently on
 "batches" of the data. But, let's imagine something like the operation for
 filtering; if I have a batch `0` comprised of 64 rows, and a batch `1` with
@@ -209,4 +209,4 @@ why we wanted to have cursor sets in the first place. The same is true also
 for one-to-many `IDataView` implementations (for example, joins, or something
 like that), where even a strictly increasing (but not necessarily contiguous)
 value may not be possible, since you cannot even bound the number. So,
-regrettably, that simpler solution would not work.
+regrettably, that simpler solution would not work.
diff --git a/src/Microsoft.Data.Analysis/DataFrameColumns/ArrowStringDataFrameColumn.cs b/src/Microsoft.Data.Analysis/DataFrameColumns/ArrowStringDataFrameColumn.cs
@@ -226,7 +226,7 @@ private int GetBufferIndexContainingRowIndex(long rowIndex, out int indexInBuffe
         {
             if (rowIndex >= Length)
             {
-                throw new ArgumentOutOfRangeException(Strings.RowIndexOutOfRange, nameof(rowIndex));
+                throw new ArgumentOutOfRangeException(Strings.IndexIsGreaterThanColumnLength, nameof(rowIndex));
             }
 
             // Since the strings here could be of variable length, scan linearly
diff --git a/src/Microsoft.Data.Analysis/DataFrameColumns/StringDataFrameColumn.cs b/src/Microsoft.Data.Analysis/DataFrameColumns/StringDataFrameColumn.cs
@@ -82,7 +82,7 @@ private int GetBufferIndexContainingRowIndex(long rowIndex)
         {
             if (rowIndex >= Length)
             {
-                throw new ArgumentOutOfRangeException(Strings.RowIndexOutOfRange, nameof(rowIndex));
+                throw new ArgumentOutOfRangeException(Strings.IndexIsGreaterThanColumnLength, nameof(rowIndex));
             }
             return (int)(rowIndex / MaxCapacity);
         }
diff --git a/src/Microsoft.Data.Analysis/DataFrameColumns/VBufferDataFrameColumn.cs b/src/Microsoft.Data.Analysis/DataFrameColumns/VBufferDataFrameColumn.cs
@@ -84,7 +84,7 @@ private int GetBufferIndexContainingRowIndex(long rowIndex)
         {
             if (rowIndex >= Length)
             {
-                throw new ArgumentOutOfRangeException(Strings.RowIndexOutOfRange, nameof(rowIndex));
+                throw new ArgumentOutOfRangeException(Strings.IndexIsGreaterThanColumnLength, nameof(rowIndex));
             }
 
             return (int)(rowIndex / MaxCapacity);
diff --git a/src/Microsoft.Data.Analysis/PrimitiveColumnContainer.cs b/src/Microsoft.Data.Analysis/PrimitiveColumnContainer.cs
@@ -313,7 +313,7 @@ public int GetIndexOfBufferContainingRowIndex(long rowIndex)
         {
             if (rowIndex >= Length)
             {
-                throw new ArgumentOutOfRangeException(Strings.RowIndexOutOfRange, nameof(rowIndex));
+                throw new ArgumentOutOfRangeException(Strings.IndexIsGreaterThanColumnLength, nameof(rowIndex));
             }
             return (int)(rowIndex / ReadOnlyDataFrameBuffer<T>.MaxCapacity);
         }
diff --git a/src/Microsoft.Data.Analysis/Strings.Designer.cs b/src/Microsoft.Data.Analysis/Strings.Designer.cs
diff --git a/src/Microsoft.Data.Analysis/Strings.resx b/src/Microsoft.Data.Analysis/Strings.resx
@@ -183,6 +183,9 @@
   <data name="InconsistentNullBitMapAndNullCount" xml:space="preserve">
     <value>Inconsistent null bitmaps and NullCounts</value>
   </data>
+  <data name="IndexIsGreaterThanColumnLength" xml:space="preserve">
+    <value>Index cannot be greater than the Column's Length</value>
+  </data>
   <data name="InvalidColumnName" xml:space="preserve">
     <value>Column '{0}' does not exist</value>
   </data>

Original file line number	Diff line number	Diff line change
`@@ -226,7 +226,7 @@ private int GetBufferIndexContainingRowIndex(long rowIndex, out int indexInBuffe`
`226`	`226`	`{`
`227`	`227`	`if (rowIndex >= Length)`
`228`	`228`	`{`
`229`		`- throw new ArgumentOutOfRangeException(Strings.RowIndexOutOfRange, nameof(rowIndex));`
	`229`	`+ throw new ArgumentOutOfRangeException(Strings.IndexIsGreaterThanColumnLength, nameof(rowIndex));`
`230`	`230`	`}`
`231`	`231`
`232`	`232`	`// Since the strings here could be of variable length, scan linearly`
Original file line number	Diff line number	Diff line change
`@@ -82,7 +82,7 @@ private int GetBufferIndexContainingRowIndex(long rowIndex)`
`82`	`82`	`{`
`83`	`83`	`if (rowIndex >= Length)`
`84`	`84`	`{`
`85`		`- throw new ArgumentOutOfRangeException(Strings.RowIndexOutOfRange, nameof(rowIndex));`
	`85`	`+ throw new ArgumentOutOfRangeException(Strings.IndexIsGreaterThanColumnLength, nameof(rowIndex));`
`86`	`86`	`}`
`87`	`87`	`return (int)(rowIndex / MaxCapacity);`
`88`	`88`	`}`
Original file line number	Diff line number	Diff line change
`@@ -84,7 +84,7 @@ private int GetBufferIndexContainingRowIndex(long rowIndex)`
`84`	`84`	`{`
`85`	`85`	`if (rowIndex >= Length)`
`86`	`86`	`{`
`87`		`- throw new ArgumentOutOfRangeException(Strings.RowIndexOutOfRange, nameof(rowIndex));`
	`87`	`+ throw new ArgumentOutOfRangeException(Strings.IndexIsGreaterThanColumnLength, nameof(rowIndex));`
`88`	`88`	`}`
`89`	`89`
`90`	`90`	`return (int)(rowIndex / MaxCapacity);`
Original file line number	Diff line number	Diff line change
`@@ -313,7 +313,7 @@ public int GetIndexOfBufferContainingRowIndex(long rowIndex)`
`313`	`313`	`{`
`314`	`314`	`if (rowIndex >= Length)`
`315`	`315`	`{`
`316`		`- throw new ArgumentOutOfRangeException(Strings.RowIndexOutOfRange, nameof(rowIndex));`
	`316`	`+ throw new ArgumentOutOfRangeException(Strings.IndexIsGreaterThanColumnLength, nameof(rowIndex));`
`317`	`317`	`}`
`318`	`318`	`return (int)(rowIndex / ReadOnlyDataFrameBuffer<T>.MaxCapacity);`
`319`	`319`	`}`