-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to merge using multiple columns in JOIN condition #5838
Conversation
Merge latest changes from dotnet/machinelearning
Back merge from base repository
Backmerge latest changes from dotnet/machinelearning
…://github.com/asmirnov82/machinelearning into feature/5657_dataframe_merge_multiple_columns # Conflicts: # src/Microsoft.Data.Analysis/DataFrame.cs # src/Microsoft.Data.Analysis/GroupBy.cs # test/Microsoft.Data.Analysis.Tests/DataFrameGroupByTests.cs
…ltiple_columns Feature/5657 dataframe merge multiple columns
Codecov Report
@@ Coverage Diff @@
## main #5838 +/- ##
==========================================
+ Coverage 68.19% 68.26% +0.07%
==========================================
Files 1142 1142
Lines 242367 242534 +167
Branches 25355 25378 +23
==========================================
+ Hits 165282 165574 +292
+ Misses 70406 70279 -127
- Partials 6679 6681 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
|
{ | ||
foreach (long nullIndex in smallerDataFrameColumnNullIndices) | ||
var newValue = kvp.Value.Where(i => occurrences[kvp.Key].Contains(i)).ToArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q for next round of review: Is LINQ necessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asmirnov82 : I think you can get rid of the ToArray here right? newValue
will then be an IEnumerable<long>
and you can still call newValue.Any()
in the next line.
Edit: Never mind, I see that shrinkedOccurences
needs a value of ICollection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments. They are mostly for myself right now. Please don't answer them yet. I will do another round of reviewing these changes tomorrow.
Overall this PR is looking quite good. I like the changes, great job! Apologies for the late review here. I was super busy with .NET 6 stuff that I didn't have time to focus on DataFrame for 2 months. I'm relatively free all of September though, so I can review changes as they come in for DataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done with the 2nd review. LGTM. Thanks for the work here @asmirnov82. This is a great addition to DataFrame!
I looked at the test failures and it looks unrelated. Do you mind rebasing on the latest main branch? That might fix the CI failures and let us merge this PR |
Backmerge latest changes from original repo
I merged latest into my repo, this didn't help. Several tests failed, due to infrastructure issue, for example one with unability to load nuget package due to connection timeout (".dotnet\sdk\6.0.100-preview.3.21202.5\NuGet.targets(131,5): error : (NETCORE_ENGINEERING_TELEMETRY=Restore) Failed to download package 'runtime.native.System.4.3.0' from 'https://pkgs.dev.azure.com/dnceng/9ee6d478-d288-47f7-aacc-f6e6d082ae6d/_packaging/45bacae2-5efb-47c8-91e5-8ec20c22b4f8/nuget/v3/flat2/runtime.native.system/4.3.0/runtime.native.system.4.3.0.nupkg'. |
@michaelgsharp : I'm seeing this error in CI:
Any idea if this is being tracked somewhere? |
@pgovind its not yet, but I did bring this up yesterday. There are some tests that are flaky, and this is one of them. It doesn't happen always, and I believe its only with arm stuff right? It would be good to make an issue about flaky tests |
Unrelated CI failure. Merging |
Fixes #5657
PR contains 3 main changes (these changes were done independently, step by step):
Avoid code duplication in Merge implementation (commit - dc8daf0). To achive that I created new static function, that merges 2 dataframes (retained and supplementary). All merge implementations (Left. right, inner and outer) uses this function to perform correct merge.
Change templated (generic) DataFrame Merge (...) function to general (not templated) DataFrame Merge(...)
Commit a4735ce. To achive the goal I added method to find element entrance of one column into another inside the column (column is already templated). Base column implementation provides 2 methods: one is public not templated and abstract (used like an interface).
public abstract Dictionary<long, ICollection<long>> GetGroupedOccurrences(DataFrameColumn other, out HashSet<long> otherColumnNullIndices);
Second provides real implementation and is templated:
protected Dictionary<long, ICollection<long>> GetGroupedOccurrences<TKey>(DataFrameColumn other, out HashSet<long> otherColumnNullIndices)
Each column implements abstract method by calling protected method providing correct Type as a generic
This step was neccessary to allow group by multiple columns, that may have different data types