Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame: Add DateTime column type #6302

Merged
merged 9 commits into from
Sep 20, 2022
Merged

DataFrame: Add DateTime column type #6302

merged 9 commits into from
Sep 20, 2022

Conversation

beccamc
Copy link
Contributor

@beccamc beccamc commented Aug 24, 2022

Fixes #6213, #5698, #5676

Add the DateTime type as a PrimitiveDataFrameColumn. This will better allow conversion between IDataView and the DataFrame.

Some DateTime code exists already, like the DateTime computations. However, the partial implementation left bugs, particularly when trying to convert between IDataView and DataFrame.

I tried to follow the existing patterns but if this shouldn't be bucketed in with the PrimitiveDataFrameColumns I can make changes.

@codecov
Copy link

codecov bot commented Aug 24, 2022

Codecov Report

Merging #6302 (03b7b8a) into main (de9afb5) will increase coverage by 0.33%.
The diff coverage is 35.76%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6302      +/-   ##
==========================================
+ Coverage   68.39%   68.72%   +0.33%     
==========================================
  Files        1141     1171      +30     
  Lines      244820   248493    +3673     
  Branches    25405    26032     +627     
==========================================
+ Hits       167444   170777    +3333     
- Misses      70722    70988     +266     
- Partials     6654     6728      +74     
Flag Coverage Δ
Debug 68.72% <35.76%> (+0.33%) ⬆️
production 63.16% <22.58%> (+0.31%) ⬆️
test 89.11% <100.00%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...imitiveDataFrameColumn.ReversedBinaryOperations.cs 50.00% <ø> (ø)
...a.Analysis/PrimitiveDataFrameColumnComputations.cs 45.66% <0.00%> (-0.03%) ⬇️
...ysis.Tests/DataFrameColumn.BinaryOperationTests.cs 100.00% <ø> (ø)
...FrameColumn.BinaryOperationAPIs.ExplodedColumns.cs 7.59% <16.66%> (+0.02%) ⬆️
...eColumn.BinaryOperationImplementations.Exploded.cs 47.65% <16.66%> (-1.38%) ⬇️
...lysis/PrimitiveDataFrameColumn.BinaryOperations.cs 41.15% <18.00%> (+0.85%) ⬆️
...ata.Analysis/PrimitiveDataFrameColumnArithmetic.cs 48.98% <24.00%> (-0.51%) ⬇️
...Microsoft.Data.Analysis/DateTimeDataFrameColumn.cs 40.00% <40.00%> (ø)
...icrosoft.Data.Analysis/PrimitiveDataFrameColumn.cs 79.34% <87.50%> (+0.19%) ⬆️
src/Microsoft.Data.Analysis/IDataView.Extension.cs 95.91% <100.00%> (+0.17%) ⬆️
... and 83 more

@beccamc
Copy link
Contributor Author

beccamc commented Aug 31, 2022

All the .tt files are re-generated!

Copy link
Contributor

@LittleLittleCloud LittleLittleCloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -65,6 +65,8 @@ namespace Microsoft.Data.Analysis
}

<# foreach (TypeConfiguration type in typeConfiguration) { #>
<# if (type.TypeName == "DateTime") {#>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this current plan that no computation function will be supported for DataTime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an existing file called DateTimeComputations that have all the date time computations. They aren't similar to the other primitive types so I just left it as is. I've added a comment here.

Copy link
Contributor

@dakersnar dakersnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the .tt files in this codebase? I'm not familiar with that extension.

@beccamc
Copy link
Contributor Author

beccamc commented Sep 2, 2022

The .tt files are text generator files. Since there is so much repeated code for the data frames they generate a bunch of classes and methods. For example, the exploded API file.

@beccamc
Copy link
Contributor Author

beccamc commented Sep 6, 2022

@tarekgh Could I get a review/approval on this PR please?

DataFrameBuffer<DateTime> newBuffer = new DataFrameBuffer<DateTime>();
ret.Buffers.Add(newBuffer);
newBuffer.EnsureCapacity(buffer.Length);
ReadOnlySpan<T> span = buffer.ReadOnlySpan;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T

does T will be always DateTime type ? if so, can you please assert for that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, this was leftover from an incorrect comparison of DateTime and Bool column. This isn't actually necessary for the DateTime column and I'm removing it.

@@ -1445,6 +1493,7 @@ internal DataFrameColumn AndImplementation<U>(PrimitiveDataFrameColumn<U> column
case Type uintType when uintType == typeof(uint):
case Type ulongType when ulongType == typeof(ulong):
case Type ushortType when ushortType == typeof(ushort):
case Type DateTimeType when DateTimeType == typeof(DateTime):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is really this code for? I mean what happen if we delete this code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment apply to other similar code in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this is auto-generated from the accompanying .tt file. All of these single line changes are generated off the additional DateTime type added here. I went with a "don't rock the boat" approach for this and just continued with the existing auto generated patterns.

Nothing would happen if we deleted this code, but deleting it requires going through the .tt file and finding all the correct places these pointless cases are generated, removing it, and ensuring it has no ill effects. I didn't consider it high priority but could be swayed if you believe it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess changing the order of the types would change the order of these cases and potentially break the experience. I guess that would a decent reason to remove these in favor of the default case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not getting this. what order change can affect the behavior? it is a switch statement and testing the type against every case. you cannot have 2 cases equal same type. this whole switch statement can be written as simple if statement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I am not asking to change it as part of this PR. but will be good if you can track cleaning up tt template files with some issue.

var span = mutableBuffer.Span;
for (int i = 0; i < span.Length; i++)
{
ret[i] = (span[i] == scalar);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

span[i] == scalar

How much we care comparing 2 DateTime objects which containing same number of ticks but created with different time zone? the remark https://docs.microsoft.com/en-us/dotnet/api/system.datetime.op_equality?view=net-6.0#remarks explain more about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment apply for other equality methods in this file too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question here too. I don't think this is a concern until we hear customer feedback that they need better equality support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we understand the user scenarios? How will the data get supplied? Changing that in the future can be a breaking change. Will be good if we understand the expectation and code for that.

@tarekgh
Copy link
Member

tarekgh commented Sep 6, 2022

I left a few comments here. Does using DateTime is expected to be a common scenario in ML.NET? DateTime in general can cause some issues if users using it combined with a time zone. Do we have any idea how users are going to use it?

In the runtime, to avoid such problem we introduced DateTimeOffset. I was going to suggest to use that instead of DateTime but the only concern with that is DateTimeOffset size is bigger than DateTime which can be issue with big data sets.

@beccamc
Copy link
Contributor Author

beccamc commented Sep 7, 2022

@tarekgh Great question! The main goal for adding DateTime as a type is for parity with the IDataView (MB issue/ ML Issue). Right now this gap makes it hard to use the DataFrame as a way to process large amounts of data and then pass to ML.NET.

Time zones are always a pain for dates, but as we're just following the existing support/expectations of the DateTime type I don't think it'll be a problem. Since we are just trying to light up Data Frame <-> IDataView parity at this point good time zone support isn't necessary. If customers end up needing a lot of date time manipulation we could invest in the DateTimeOffset at that point.

Copy link
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a little comments. I'll leave it to you to decide if you are comfortable enough with the change to merge. Thanks for getting this done.

@beccamc
Copy link
Contributor Author

beccamc commented Sep 8, 2022

/azp run

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 6302 in repo dotnet/machinelearning

@LittleLittleCloud
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@LittleLittleCloud
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@LittleLittleCloud
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@ericstj
Copy link
Member

ericstj commented Sep 14, 2022

So this is failing in some image-based tests on Ubuntu https://dev.azure.com/dnceng-public/public/_build/results?buildId=15034&view=ms.vss-test-web.build-test-results-tab&runId=290276&resultId=101333&paneView=dotnet-dnceng.dnceng-anon-build-release-tasks.helix-anon-test-information-tab

 System.ArgumentException : Parameter is not valid.
   at System.Drawing.SafeNativeMethods.Gdip.CheckStatus(Int32 status)
   at System.Drawing.Bitmap.UnlockBits(BitmapData bitmapdata)
   at Microsoft.ML.Transforms.Image.VectorToImageConvertingTransformer.Mapper.<>c__DisplayClass5_0`1.<GetterFromType>b__0(Bitmap& dst) in /__w/1/s/src/Microsoft.ML.ImageAnalytics/VectorToImageTransform.cs:line 431
   at Microsoft.ML.Tests.ImageTests.TestBackAndForthConversionWithoutAlphaNoInterleaveNoOffset() in /__w/1/s/test/Microsoft.ML.Tests/ImagesTests.cs:line 804

We know that System.Drawing doesn't work well on non-Windows and it's no longer supported. @tarekgh would it make sense to just disable these tests since they're causing noise and re-enable once we have the System.Drawing replacement?

@tarekgh
Copy link
Member

tarekgh commented Sep 14, 2022

We know that System.Drawing doesn't work well on non-Windows and it's no longer supported. @tarekgh would it make sense to just disable these tests since they're causing noise and re-enable once we have the System.Drawing replacement?

I am wondering if such tests are flakey as we didn't disable them till now. We can temporary disable them if it is consistently failing the CI.

@LittleLittleCloud LittleLittleCloud merged commit 50e5068 into dotnet:main Sep 20, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Oct 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot run Transform on DataFrame with DateTime type
5 participants