Skip to content
This repository has been archived by the owner on Aug 2, 2023. It is now read-only.

Shorten names. Implement string columns and computations #2660

Merged
merged 13 commits into from
May 21, 2019

Conversation

pgovind
Copy link

@pgovind pgovind commented Apr 23, 2019

More work towards #26854 to add a DataFrame type to .NET Core

This patch:
Improves binary operations on primitive columns
Creates a StringColumn to support columns of strings. It also implements some meaningful binary ops on string columns
Implements basic computations on primitive columns

I added another patch that implements "DataFrame.Sort" too. The changes are mostly new and incremental. Apologies for putting it in this patch and making the review harder. I'm on app building for the rest of the week, so I wanted to get this out soon

@eerhardt
Copy link
Member

using System;

need a copyright header on every code file.


Refers to: src/Microsoft.Data/StringColumn.cs:1 in 63bd555. [](commit_id = 63bd555, deletion_comment = False)

// TODO: Using indexing is VERY inefficient here. Each indexer call will find the "right" buffer and then return the value
if (Length != column.Length)
{
throw new ArgumentException($"Column lengths are mismatched", nameof(column));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be adding the exception messages (and any other localizable strings) to a .resx file from the start? It will be easier to add them now, then go through all the strings and try to do it after the fact.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good idea. I'll do that later this week, once I'm done with app building

{
IList<string> buffer = stringBuffers[i];
int bufferLen = buffer.Count;
for (int j = 0; j < bufferLen; j++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the Equals method above use a single for loop over Length and the index, but here we are doing 2 for loops over the stringBuffers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote that initially thinking the 2 for loop version may be more optimized since I'm avoiding indexing. I changed it now to use the single for loop though. When I fix #2659 though, the single loop version should be super efficient.

}

[Fact]
public void TestIndexer()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to write some more tests. I don't think these tests cover the amount of code that is in the library.

internal static class PrimitiveColumnComputation<T>
where T : struct
{
public static IPrimitiveColumnComputation<T> Instance => PrimitiveColumnComputation.GetComputation<T>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will new up the Computation object every time it is called. That seems wasteful. Instead, we should cache it in a field.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? Because this is static, shouldn't this new up a computation object once per T and reuse that for next time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this new up a computation object once per T and reuse that for next time?

I would think that is the behavior we want, but the code you've written doesn't do that.

public static IPrimitiveColumnComputation<T> Instance => PrimitiveColumnComputation.GetComputation<T>();

That is short-hand for:

public static IPrimitiveColumnComputation<T> Instance
{
    get { return PrimitiveColumnComputation.GetComputation<T>(); }
}

Which means every time the property is invoked/called it will call GetComputation<T>(), which in turn returns a new object every time it is invoked.

Copy link
Contributor

@YohDeadfall YohDeadfall May 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=> means execute the following code each time. = means assignment. You need the last one to initialize the member only once when the static constructor is called.

Suggested change
public static IPrimitiveColumnComputation<T> Instance => PrimitiveColumnComputation.GetComputation<T>();
public static readonly IPrimitiveColumnComputation<T> Instance = PrimitiveColumnComputation.GetComputation<T>();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the suggested change is turning Instance from a property into a field. And as written above, allows the field to be publicly settable.

In general, we shouldn't be exposing fields publicly unless there is a strong reason for it. We should keep it a read-only property. But it should reuse the same object each time it is invoked, so it doesn't create unnecessary allocations.

Copy link
Contributor

@YohDeadfall YohDeadfall May 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, fixed. Thanks!

Should this rule be applied to internal classes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that it was an internal class. That does change the "property or field" decision.

It is probably personal preference, but I generally tend to keep all fields (static or instance) private by default and exposed through properties. And only expose them as fields if there are specific reasons for it.

namespace Microsoft.Data
{
/// <summary>
/// A column to hold primitive values such as int, float etc. Other value types are not really supported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does Other value types are not really supported mean?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to say that only primitive types are really supported. A user defined struct for example will throw if some binary op is called. I removed the last sentence and changed it to now say "A column to hold primitive types such as int, float etc"

}
}

// This method involves boxing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to have T based indexers as well? That way we can avoid boxing in certain situations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First thing I tried. I don't think generic indexers are supported in C# yet. There's an issue for it here:

dotnet/csharplang#2054

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrimitiveColumn<T> is already generic. So you can return T here. Just like we do in List<T>.

https://github.com/dotnet/corefx/blob/a26fd9bb1b17510d51a9b5a2ce7aadc1a5b631ee/src/Common/src/CoreLib/System/Collections/Generic/List.cs#L145-L167

The csharplang feature you referred to is for when you want to have a generic indexer that isn't in a generic class (or the indexer returns a different type than what the class is generic for).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but I'm overriding a virtual method from BaseColumn though? Unless I'm not seeing something, I would have to make BaseColumn's indexer generic and then deal with type conversions in the derived class? I thought boxing to return a single value seemed acceptable in this instance

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am thinking is adding a new indexer on the derived class that is typed as T.

At first I was thinking:

    public class BaseColumn
    {
        public virtual object this[long rowIndex]
        {
            get => throw new NotImplementedException();
            set => throw new NotImplementedException();
        }
    }

    public class PrimitiveColumn<T> : BaseColumn
       where T : struct
    {
        public override object this[long rowIndex]
        {
            get => base[rowIndex];
            set => base[rowIndex] = value;
        }

        public new T this[long rowIndex]
        {
            get => default;
            set => base[rowIndex] = value;
        }
    }

But that doesn't compile:

CS0111 Type 'PrimitiveColumn' already defines a member called 'this' with the same parameter types

So instead, we could do:

    public class BaseColumn
    {
        public object this[long rowIndex]
        {
            get => GetValue(rowIndex);
            set => SetValue(rowIndex, value);
        }

        protected virtual object GetValue(long rowIndex) => throw new NotImplementedException();
        protected virtual void SetValue(long rowIndex, object value) => throw new NotImplementedException();
    }

    public class PrimitiveColumn<T> : BaseColumn
       where T : struct
    {
        protected override object GetValue(long rowIndex)
        {
            return base.GetValue(rowIndex);
        }

        protected override void SetValue(long rowIndex, object value)
        {
            base.SetValue(rowIndex, value);
        }

        public new T this[long rowIndex]
        {
            get => default;
            set => base[rowIndex] = value;
        }
    }

This allows us to not box/unbox when the column is casted to a PrimitiveColumn<T>.

@eerhardt
Copy link
Member

<#@ template debug="false" hostspecific="false" language="C#" #>

What's the difference between this file and PrimitiveColumnComputations.tt?


Refers to: src/Microsoft.Data/PrimitiveColumn.Computations.tt:1 in 63bd555. [](commit_id = 63bd555, deletion_comment = False)

@eerhardt
Copy link
Member

    #region Binary Operations

Do we need this region? Everything in this file is a binary operation.


Refers to: src/Microsoft.Data/PrimitiveColumn.BinaryOperations.cs:16 in 63bd555. [](commit_id = 63bd555, deletion_comment = False)

@eerhardt
Copy link
Member

            case Type boolType when boolType == typeof(bool):

You can just switch on column and use case PrimitiveColumn<bool> boolColumn, can't you?

If not, you should be able to use pattern matching by simply saying if (column is PrimitiveColumn<bool> boolColumn)


Refers to: src/Microsoft.Data/PrimitiveColumn.BinaryOperations.cs:21 in 63bd555. [](commit_id = 63bd555, deletion_comment = False)

@eerhardt
Copy link
Member

        switch (typeof(T))

Another possible pattern that could be used here is the same approach that ML.NET uses for conversions.

https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.Data/Data/Conversion.cs#L65-L67

Where you have a map of (Type, Type) tuples that map to Delegates that perform the function. So here we would have a delegate that would take in 2 ints and divide them. Then another that would take an int and a short and divide them. Two doubles. A double and a float, etc. etc.


Refers to: src/Microsoft.Data/PrimitiveColumn.BinaryOperations.cs:985 in 63bd555. [](commit_id = 63bd555, deletion_comment = False)

}

private DataFrameTable(IList<BaseDataFrameColumn> columns)
public DataFrameTable(IList<BaseColumn> columns)
Copy link
Member

@eerhardt eerhardt Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this now public?

Nevermind, the whole class is internal

Copy link
Member

@danmoseley danmoseley Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What criteria do we use to choose between public and internal visibility on members, on an internal class? Assuming the member is not impelmenting an interface.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question: In this case it doesn't matter if this constructor public or internal. I just chose public, there was no particular reason to do so.

@eerhardt
Copy link
Member

Is the ColumnArithmeticTemplate.cs file necessary? It is empty.

{
public partial class StringColumn : BaseColumn
{
public override BaseColumn Add(BaseColumn column)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want strongly typed versions of these methods? That way if I have a StringColumn, and I call one of these methods, I should get back a StringColumn, not?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, so at the moment, I've taken the less strongly typed approach and since one of the main use cases is notebooks, it pairs well with returning a copy everytime(since I don't know the return type until runtime). In the case of a StringColumn, Add will always returns a StringColumn, but for intColumn + 5.0, the result will be a PrimitiveColumn<double> for example. To enable the strong typed versions, I would have to define APIs for all possible combinations of all columns. How about I consider that in a separate issue? Since these methods are general, it's easy enough to build the strong APIs on top of this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a separate issue for this sounds good to me.

As for every combination of type, I don't think I'd go that far. But for the case where I have a strongly-typed string or Int32 column, if I add a string or Int32 to that column, I know I get back a string or Int32.

The analogy here is this code:

string a = "foo";
string b = "bar";

??? c = a + b;

It's pretty obvious that c should be a string, not object.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed here

where T : struct
{
public static IPrimitiveDataFrameColumnArithmetic<T> Instance => PrimitiveDataFrameColumnArithmetic.GetArithmetic<T>();
public static IPrimitiveColumnArithmetic<T> Instance = PrimitiveColumnArithmetic.GetArithmetic<T>();
Copy link
Member

@eerhardt eerhardt May 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this one is a field, but PrimitiveColumnComputations is a property? It's nice to be consistent in these kinds of things.

@ahsonkhan
Copy link
Member

FYI, this build is failing. GitHub just isn't showing the results properly from the CI legs: https://dev.azure.com/dnceng/public/_build/results?buildId=183617

@ahsonkhan
Copy link
Member

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).


public object this[long startIndex, int length]
{
get => GetValue(startIndex, length);
Copy link
Author

@pgovind pgovind May 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a note: On a notebook, most calls such as the following will still box at the moment:
var intColumn = df.Column('intColumn'); //intColumn is a PrimitiveColumn<int>, but returned as a BaseColumn var thisWillBox = intColumn[2];


internal override BaseColumn GetAscendingSortIndices()
{
// Is Comparer<T>.Default guaranteed to sort in ascending order?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is guaranteed for ALL IComparer implementations.

See https://docs.microsoft.com/en-us/dotnet/api/system.collections.generic.icomparer-1.compare?view=netframework-4.8

Returns: A signed integer that indicates the relative values of x and y, as shown in the following table.

Value Meaning
Less than zero x is less than y.
Zero x equals y.
Greater than zero x is greater than y.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this comment (in the next PR if you want)


In reply to: 284780397 [](ancestors = 284780397)


public new T this[long startIndex, int length]
{
get => (T)GetValue(startIndex, length);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will still box. Need to replace this with the code inside GetValue itself

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - that is the point of these "new" strongly typed members. No boxing should be necessary.

Prashanth Govindarajan added 2 commits May 16, 2019 11:02
public DataFrame Sort(string columnName, bool ascending = true)
{
BaseColumn column = this[columnName];
BaseColumn sortIndices = column.GetAscendingSortIndices();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should only call public BaseColumn methods so this works with external derived classes. As it exists today - this will throw an exception because there is no way for an external column to override an internal method.

</Compile>
</ItemGroup>

<ItemGroup>
<EmbeddedResource Update="Strings.resx">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the casing needs to match the actual file casing - is it possible to rename the file?

See https://stackoverflow.com/questions/1793735/change-case-of-a-file-on-windows for how to do this on Windows.

}
else
{
throw new ArgumentException(Strings.MismatchedValueType + " string", nameof(value));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When throwing exceptions with a message that comes from a resource, you shouldn't concat strings. Instead use formatted strings - {0}. The reason is because when translating to other languages, the hard-coded type name may need to move in the sentence to make sense in the language.


public new string this[long rowIndex]
{
get => (string)GetValue(rowIndex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to implement this without the cast?

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. Keep the good work coming!

feel free to merge now, and address my additional comments in a separate PR.

@pgovind pgovind merged commit a4be93c into dotnet:master May 21, 2019
pgovind pushed a commit to pgovind/corefxlab that referenced this pull request Jun 5, 2019
* Shorten names, implement sort, binary op extensions and computations
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants