-
Notifications
You must be signed in to change notification settings - Fork 344
Shorten names. Implement string columns and computations #2660
Conversation
// TODO: Using indexing is VERY inefficient here. Each indexer call will find the "right" buffer and then return the value | ||
if (Length != column.Length) | ||
{ | ||
throw new ArgumentException($"Column lengths are mismatched", nameof(column)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be adding the exception messages (and any other localizable strings) to a .resx file from the start? It will be easier to add them now, then go through all the strings and try to do it after the fact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good idea. I'll do that later this week, once I'm done with app building
{ | ||
IList<string> buffer = stringBuffers[i]; | ||
int bufferLen = buffer.Count; | ||
for (int j = 0; j < bufferLen; j++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the Equals
method above use a single for
loop over Length
and the index, but here we are doing 2 for
loops over the stringBuffers
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote that initially thinking the 2 for loop version may be more optimized since I'm avoiding indexing. I changed it now to use the single for loop though. When I fix #2659 though, the single loop version should be super efficient.
} | ||
|
||
[Fact] | ||
public void TestIndexer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to write some more tests. I don't think these tests cover the amount of code that is in the library.
internal static class PrimitiveColumnComputation<T> | ||
where T : struct | ||
{ | ||
public static IPrimitiveColumnComputation<T> Instance => PrimitiveColumnComputation.GetComputation<T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will new up the Computation object every time it is called. That seems wasteful. Instead, we should cache it in a field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? Because this is static, shouldn't this new up a computation object once per T
and reuse that for next time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this new up a computation object once per T and reuse that for next time?
I would think that is the behavior we want, but the code you've written doesn't do that.
public static IPrimitiveColumnComputation<T> Instance => PrimitiveColumnComputation.GetComputation<T>();
That is short-hand for:
public static IPrimitiveColumnComputation<T> Instance
{
get { return PrimitiveColumnComputation.GetComputation<T>(); }
}
Which means every time the property is invoked/called it will call GetComputation<T>()
, which in turn returns a new object every time it is invoked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=>
means execute the following code each time. =
means assignment. You need the last one to initialize the member only once when the static constructor is called.
public static IPrimitiveColumnComputation<T> Instance => PrimitiveColumnComputation.GetComputation<T>(); | |
public static readonly IPrimitiveColumnComputation<T> Instance = PrimitiveColumnComputation.GetComputation<T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the suggested change is turning Instance
from a property into a field. And as written above, allows the field to be publicly settable.
In general, we shouldn't be exposing fields publicly unless there is a strong reason for it. We should keep it a read-only property. But it should reuse the same object each time it is invoked, so it doesn't create unnecessary allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, fixed. Thanks!
Should this rule be applied to internal classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed that it was an internal class. That does change the "property or field" decision.
It is probably personal preference, but I generally tend to keep all fields (static or instance) private by default and exposed through properties. And only expose them as fields if there are specific reasons for it.
namespace Microsoft.Data | ||
{ | ||
/// <summary> | ||
/// A column to hold primitive values such as int, float etc. Other value types are not really supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does Other value types are not really supported
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to say that only primitive types are really supported. A user defined struct for example will throw if some binary op is called. I removed the last sentence and changed it to now say "A column to hold primitive types such as int, float etc"
} | ||
} | ||
|
||
// This method involves boxing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to have T
based indexers as well? That way we can avoid boxing in certain situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First thing I tried. I don't think generic indexers are supported in C# yet. There's an issue for it here:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PrimitiveColumn<T>
is already generic. So you can return T
here. Just like we do in List<T>
.
The csharplang feature you referred to is for when you want to have a generic indexer that isn't in a generic class (or the indexer returns a different type than what the class is generic for).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but I'm overriding a virtual method from BaseColumn
though? Unless I'm not seeing something, I would have to make BaseColumn
's indexer generic and then deal with type conversions in the derived class? I thought boxing to return a single value seemed acceptable in this instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I am thinking is adding a new
indexer on the derived class that is typed as T
.
At first I was thinking:
public class BaseColumn
{
public virtual object this[long rowIndex]
{
get => throw new NotImplementedException();
set => throw new NotImplementedException();
}
}
public class PrimitiveColumn<T> : BaseColumn
where T : struct
{
public override object this[long rowIndex]
{
get => base[rowIndex];
set => base[rowIndex] = value;
}
public new T this[long rowIndex]
{
get => default;
set => base[rowIndex] = value;
}
}
But that doesn't compile:
CS0111 Type 'PrimitiveColumn' already defines a member called 'this' with the same parameter types
So instead, we could do:
public class BaseColumn
{
public object this[long rowIndex]
{
get => GetValue(rowIndex);
set => SetValue(rowIndex, value);
}
protected virtual object GetValue(long rowIndex) => throw new NotImplementedException();
protected virtual void SetValue(long rowIndex, object value) => throw new NotImplementedException();
}
public class PrimitiveColumn<T> : BaseColumn
where T : struct
{
protected override object GetValue(long rowIndex)
{
return base.GetValue(rowIndex);
}
protected override void SetValue(long rowIndex, object value)
{
base.SetValue(rowIndex, value);
}
public new T this[long rowIndex]
{
get => default;
set => base[rowIndex] = value;
}
}
This allows us to not box/unbox when the column is casted to a PrimitiveColumn<T>
.
You can just switch on If not, you should be able to use pattern matching by simply saying Refers to: src/Microsoft.Data/PrimitiveColumn.BinaryOperations.cs:21 in 63bd555. [](commit_id = 63bd555, deletion_comment = False) |
Another possible pattern that could be used here is the same approach that ML.NET uses for conversions. Where you have a map of Refers to: src/Microsoft.Data/PrimitiveColumn.BinaryOperations.cs:985 in 63bd555. [](commit_id = 63bd555, deletion_comment = False) |
} | ||
|
||
private DataFrameTable(IList<BaseDataFrameColumn> columns) | ||
public DataFrameTable(IList<BaseColumn> columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this now public?
Nevermind, the whole class is internal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What criteria do we use to choose between public and internal visibility on members, on an internal class? Assuming the member is not impelmenting an interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question: In this case it doesn't matter if this constructor public or internal. I just chose public, there was no particular reason to do so.
Is the |
{ | ||
public partial class StringColumn : BaseColumn | ||
{ | ||
public override BaseColumn Add(BaseColumn column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want strongly typed versions of these methods? That way if I have a StringColumn
, and I call one of these methods, I should get back a StringColumn
, not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, so at the moment, I've taken the less strongly typed approach and since one of the main use cases is notebooks, it pairs well with returning a copy everytime(since I don't know the return type until runtime). In the case of a StringColumn
, Add
will always returns a StringColumn
, but for intColumn + 5.0
, the result will be a PrimitiveColumn<double>
for example. To enable the strong typed versions, I would have to define APIs for all possible combinations of all columns. How about I consider that in a separate issue? Since these methods are general, it's easy enough to build the strong APIs on top of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a separate issue for this sounds good to me.
As for every combination of type, I don't think I'd go that far. But for the case where I have a strongly-typed string
or Int32
column, if I add a string
or Int32
to that column, I know I get back a string
or Int32
.
The analogy here is this code:
string a = "foo";
string b = "bar";
??? c = a + b;
It's pretty obvious that c
should be a string
, not object
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed here
where T : struct | ||
{ | ||
public static IPrimitiveDataFrameColumnArithmetic<T> Instance => PrimitiveDataFrameColumnArithmetic.GetArithmetic<T>(); | ||
public static IPrimitiveColumnArithmetic<T> Instance = PrimitiveColumnArithmetic.GetArithmetic<T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come this one is a field, but PrimitiveColumnComputations
is a property? It's nice to be consistent in these kinds of things.
FYI, this build is failing. GitHub just isn't showing the results properly from the CI legs: https://dev.azure.com/dnceng/public/_build/results?buildId=183617 |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
|
||
public object this[long startIndex, int length] | ||
{ | ||
get => GetValue(startIndex, length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a note: On a notebook, most calls such as the following will still box at the moment:
var intColumn = df.Column('intColumn'); //intColumn is a PrimitiveColumn<int>, but returned as a BaseColumn var thisWillBox = intColumn[2];
|
||
internal override BaseColumn GetAscendingSortIndices() | ||
{ | ||
// Is Comparer<T>.Default guaranteed to sort in ascending order? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is guaranteed for ALL IComparer
implementations.
Returns: A signed integer that indicates the relative values of x and y, as shown in the following table.
Value | Meaning |
---|---|
Less than zero | x is less than y. |
Zero | x equals y. |
Greater than zero | x is greater than y. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this comment (in the next PR if you want)
In reply to: 284780397 [](ancestors = 284780397)
|
||
public new T this[long startIndex, int length] | ||
{ | ||
get => (T)GetValue(startIndex, length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will still box. Need to replace this with the code inside GetValue itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - that is the point of these "new" strongly typed members. No boxing should be necessary.
public DataFrame Sort(string columnName, bool ascending = true) | ||
{ | ||
BaseColumn column = this[columnName]; | ||
BaseColumn sortIndices = column.GetAscendingSortIndices(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should only call public BaseColumn methods so this works with external derived classes. As it exists today - this will throw an exception because there is no way for an external column to override an internal method.
</Compile> | ||
</ItemGroup> | ||
|
||
<ItemGroup> | ||
<EmbeddedResource Update="Strings.resx"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the casing needs to match the actual file casing - is it possible to rename the file?
See https://stackoverflow.com/questions/1793735/change-case-of-a-file-on-windows for how to do this on Windows.
} | ||
else | ||
{ | ||
throw new ArgumentException(Strings.MismatchedValueType + " string", nameof(value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When throwing exceptions with a message that comes from a resource, you shouldn't concat strings. Instead use formatted strings - {0}
. The reason is because when translating to other languages, the hard-coded type name may need to move in the sentence to make sense in the language.
|
||
public new string this[long rowIndex] | ||
{ | ||
get => (string)GetValue(rowIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to implement this without the cast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good. Keep the good work coming!
feel free to merge now, and address my additional comments in a separate PR.
* Shorten names, implement sort, binary op extensions and computations
More work towards #26854 to add a DataFrame type to .NET Core
This patch:
Improves binary operations on primitive columns
Creates a StringColumn to support columns of strings. It also implements some meaningful binary ops on string columns
Implements basic computations on primitive columns
I added another patch that implements "DataFrame.Sort" too. The changes are mostly new and incremental. Apologies for putting it in this patch and making the review harder. I'm on app building for the rest of the week, so I wanted to get this out soon