-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update samples #238
Update samples #238
Conversation
|
||
[Column(ordinal: "4")] | ||
[ColumnName("Label")] | ||
public string Label; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Column(ordinal: "4")]
[ColumnName("Label")]
public string Label;
I think this is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, just removed
9ceb805
to
ed14986
Compare
Hi @OliaG , thanks for updating the PR. So I see we are deleting the files for taxi, at least, so I judge from the change line count (+34,393 −2,048,602), what I see of While the addition of these large files was unfortunate, due to the way git works with cloning preserving the whole history, the damage is already done in some sense. Any sort of "edits" on this file to delete or whatever merely compound the problem, especially if we decide to do a rebase to solve the problem once and for all any touching will just complicate that process. Perhaps we should just avoid touching it? |
Maybe if we did |
This looks like MNIST to me. Generally I'd prefer that related sample data be stored in directories. If we're going to move all data files to a single directory, then we will need to give them meaningful names. "Train-Tiny" is only a meaningful name if you know it is MNIST. I was only able to guess it was MNIST because I am very familiar with this dataset, other people may not have that advantage. Refers to: examples/datasets/Train-Tiny-28x28.txt:1 in 9647337. [](commit_id = 9647337, deletion_comment = False) |
{ | ||
for (var j = 0; j < metrics.ConfusionMatrix.ClassNames.Count; j++) | ||
{ | ||
Console.Write("\t" + metrics.ConfusionMatrix[i, j] + "\t"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Console.Write("\t" + metrics.ConfusionMatrix[i, j] + "\t"); [](start = 20, length = 59)
Were the double tabs on both the start and end intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was intentional; it's using Console.Write
not WriteLine
, so the trailing tab is for subsequent writes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? I could see having a trailing tab, or a leading tab, but I don't think having both is intentional?
Let's imagine you're outputting, on one row, the numbers, say, 0, 1, 2, 3. The output is \t0\t\t1\t\t2\t\t3\t
. Right? That seems wrong.
In reply to: 190758071 [](ancestors = 190758071)
private static async Task<PredictionModel<TaxiTrip, TaxiTripFarePrediction>> TrainAsync() | ||
{ | ||
// LearningPipeline holds all steps of the learning process: data, transforms, learners. | ||
var pipeline = new LearningPipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are very nice comments. #ByDesign
That's correct. I've undone the shrinking so now it's a pure move without changes.
Good point.
Git doesn't track renames.
OK, so I'll rename |
<Folder Include="datasets\" /> | ||
</ItemGroup> | ||
|
||
</Project> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale behind creating a separate project for "Binary Classification sentiment analysis"? Same for MC_Iris, TaxiFarePrediction.
What I'm seeing is you have taken the scenario test and broke it down into many files. What does this accomplish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to have separate project for different problem types so it aligns with our docs
[Column("1", name: "Label")] | ||
public float Sentiment; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this class needs its own file? You can keep SentimentData, SentimentPrediction and TestSentimentData in Program.cs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general .NET coding guidelines use one class per file and we want our samples to use idiomatic C#.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OliaG Can you please point me to this guideline? Vast majority of ML.Net code has multiple class definitions in one file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't find the written up rule, but based on my conversation with @terrajobst that's what the .NET team is usually doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there isn't a written rule then we should keep things consistent with the current code base. Here is an example from .Net Corefx repo that shows multiple classes defined in one file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there isn't a written rule then we should keep things consistent with the current code base
I disagree. Samples aren't part of the ML.NET codebase. They are for our customers and the vast majority doesn't need to know nor care about the ML.NET coding conventions. That's why we try to follow the conventions that the vast majority of our customers are using.
That being said, I don't have a problem if we were to merge all these classes into one file as it might even help to keep the sample easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guideline is interesting, and one we generally try to obey ourselves, but not when separating definitions would lead to less comprehensible code. So we tend to declare types meant to be understood together as a unit, in the same file. (This isn't Java after all. :) ) This clearly falls into that bucket. Especially in this case where the intent is pedagogical, we would "bend" any rules we usually follow about style (even if this was an actual policy, which, as Zeeshan points out, it is not).
In this case what I'd rather do is have a (private?) nested type, declared right next to their usages in the method. This will structure the example so that the method and class can be understood. Then you can get rid of these three files, and have only one, more easily understood example.
In reply to: 190942775 [](ancestors = 190942775)
{ | ||
var pipeline = new LearningPipeline(); | ||
|
||
pipeline.Add(new TextLoader<GitHubIssue>(DataPath, useHeader: true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pipeline.Add(new TextLoader(DataPath, useHeader: true)); [](start = 9, length = 72)
How is this line even compiling? Are you using some outdated nuget package? The loader API has been changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are samples, which means they are building against the latest package that is published to nuget.org, not what is currently in master. Hence, they are working as we're consuming what our customers would, which is what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. The alternate argument is, if they're checked in with the repo, they ought to target the code in the repo, since then the samples are simply always up to date, if we make them part of our build/test cycle. Indeed until I saw that we were using this outdated stuff, I had fully expected that to be the case.
The nuget story makes the samples far more difficult to maintain as a code artifact. Especially in these days, when the API is still in heavy flux. I'm considering two worlds:
- The samples are incrementally changed so that as we move towards 0.2, changing the samples one at a time by the person authoring that PR (and so, hopefully, understands how it should be use), I see how development happens. Also at any given point, the samples are correct.
- The samples remain unchanged. Come release time, it is someone's job (whose?) to comprehend all PRs against the API in that time, so they have a clue of how the sample should be updated, and then we hope they do all that correctly. Meanwhile the release is blocked on this point.
Generally anything that reduces maintainability and adds costs to our release process, is something we ought to be hostile towards.
In reply to: 190764164 [](ancestors = 190764164)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually yeah, I was thinking about it, and the nuget approach just seems deeply wrong. You couldn't even write the PR to update the samples until after you've actually published the nuget. This means that any release branch we make will either have by design outdated samples, or we will have to accept a world in which the nuget published for version X never corresponds to the branch release/X
. Regarding it being what we want, I'm not sure who wants that, but whoever they are, let's have them want something else. :)
In reply to: 190949466 [](ancestors = 190949466,190764164)
- Iris sample - Sentiment Analysis sample - TaxiFare sample - GitHub issues classification sample
Moved datasets from tests to examples Removed model files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕐
No description provided.