-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame: Add DateTime column type #6302
Conversation
src/Microsoft.Data.Analysis/PrimitiveDataFrameColumn.BinaryOperations.cs
Outdated
Show resolved
Hide resolved
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #6302 +/- ##
==========================================
+ Coverage 68.39% 68.72% +0.33%
==========================================
Files 1141 1171 +30
Lines 244820 248493 +3673
Branches 25405 26032 +627
==========================================
+ Hits 167444 170777 +3333
- Misses 70722 70988 +266
- Partials 6654 6728 +74
Flags with carried forward coverage won't be shown. Click here to find out more.
|
src/Microsoft.Data.Analysis/PrimitiveDataFrameColumn.BinaryOperations.tt
Show resolved
Hide resolved
src/Microsoft.Data.Analysis/PrimitiveDataFrameColumnArithmetic.tt
Outdated
Show resolved
Hide resolved
All the .tt files are re-generated! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -65,6 +65,8 @@ namespace Microsoft.Data.Analysis | |||
} | |||
|
|||
<# foreach (TypeConfiguration type in typeConfiguration) { #> | |||
<# if (type.TypeName == "DateTime") {#> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this current plan that no computation function will be supported for DataTime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an existing file called DateTimeComputations that have all the date time computations. They aren't similar to the other primitive types so I just left it as is. I've added a comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of the .tt files in this codebase? I'm not familiar with that extension.
The .tt files are text generator files. Since there is so much repeated code for the data frames they generate a bunch of classes and methods. For example, the exploded API file. |
@tarekgh Could I get a review/approval on this PR please? |
DataFrameBuffer<DateTime> newBuffer = new DataFrameBuffer<DateTime>(); | ||
ret.Buffers.Add(newBuffer); | ||
newBuffer.EnsureCapacity(buffer.Length); | ||
ReadOnlySpan<T> span = buffer.ReadOnlySpan; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this help better here https://docs.microsoft.com/en-us/dotnet/api/system.runtime.compilerservices.unsafe.as?view=net-6.0#system-runtime-compilerservices-unsafe-as-2(-0?
In general will be good if we can avoid the boxing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, this was leftover from an incorrect comparison of DateTime and Bool column. This isn't actually necessary for the DateTime column and I'm removing it.
@@ -1445,6 +1493,7 @@ internal DataFrameColumn AndImplementation<U>(PrimitiveDataFrameColumn<U> column | |||
case Type uintType when uintType == typeof(uint): | |||
case Type ulongType when ulongType == typeof(ulong): | |||
case Type ushortType when ushortType == typeof(ushort): | |||
case Type DateTimeType when DateTimeType == typeof(DateTime): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is really this code for? I mean what happen if we delete this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment apply to other similar code in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this is auto-generated from the accompanying .tt file. All of these single line changes are generated off the additional DateTime type added here. I went with a "don't rock the boat" approach for this and just continued with the existing auto generated patterns.
Nothing would happen if we deleted this code, but deleting it requires going through the .tt file and finding all the correct places these pointless cases are generated, removing it, and ensuring it has no ill effects. I didn't consider it high priority but could be swayed if you believe it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess changing the order of the types would change the order of these cases and potentially break the experience. I guess that would a decent reason to remove these in favor of the default case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not getting this. what order change can affect the behavior? it is a switch statement and testing the type against every case. you cannot have 2 cases equal same type. this whole switch statement can be written as simple if statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I am not asking to change it as part of this PR. but will be good if you can track cleaning up tt template files with some issue.
var span = mutableBuffer.Span; | ||
for (int i = 0; i < span.Length; i++) | ||
{ | ||
ret[i] = (span[i] == scalar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much we care comparing 2 DateTime objects which containing same number of ticks but created with different time zone? the remark https://docs.microsoft.com/en-us/dotnet/api/system.datetime.op_equality?view=net-6.0#remarks explain more about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment apply for other equality methods in this file too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question here too. I don't think this is a concern until we hear customer feedback that they need better equality support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we understand the user scenarios? How will the data get supplied? Changing that in the future can be a breaking change. Will be good if we understand the expectation and code for that.
I left a few comments here. Does using In the runtime, to avoid such problem we introduced DateTimeOffset. I was going to suggest to use that instead of DateTime but the only concern with that is DateTimeOffset size is bigger than DateTime which can be issue with big data sets. |
@tarekgh Great question! The main goal for adding DateTime as a type is for parity with the IDataView (MB issue/ ML Issue). Right now this gap makes it hard to use the DataFrame as a way to process large amounts of data and then pass to ML.NET. Time zones are always a pain for dates, but as we're just following the existing support/expectations of the DateTime type I don't think it'll be a problem. Since we are just trying to light up Data Frame <-> IDataView parity at this point good time zone support isn't necessary. If customers end up needing a lot of date time manipulation we could invest in the DateTimeOffset at that point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a little comments. I'll leave it to you to decide if you are comfortable enough with the change to merge. Thanks for getting this done.
/azp run |
Commenter does not have sufficient privileges for PR 6302 in repo dotnet/machinelearning |
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
So this is failing in some image-based tests on Ubuntu https://dev.azure.com/dnceng-public/public/_build/results?buildId=15034&view=ms.vss-test-web.build-test-results-tab&runId=290276&resultId=101333&paneView=dotnet-dnceng.dnceng-anon-build-release-tasks.helix-anon-test-information-tab
We know that System.Drawing doesn't work well on non-Windows and it's no longer supported. @tarekgh would it make sense to just disable these tests since they're causing noise and re-enable once we have the System.Drawing replacement? |
I am wondering if such tests are flakey as we didn't disable them till now. We can temporary disable them if it is consistently failing the CI. |
Fixes #6213, #5698, #5676
Add the DateTime type as a PrimitiveDataFrameColumn. This will better allow conversion between IDataView and the DataFrame.
Some DateTime code exists already, like the DateTime computations. However, the partial implementation left bugs, particularly when trying to convert between IDataView and DataFrame.
I tried to follow the existing patterns but if this shouldn't be bucketed in with the PrimitiveDataFrameColumns I can make changes.