Update samples #238

OliaG · 2018-05-24T19:29:26Z

No description provided.

zeahmed · 2018-05-24T19:55:45Z

examples/GitHubLabeler/GitHubLabeler/GitHubIssue.cs

+
+        [Column(ordinal: "4")]
+        [ColumnName("Label")]
+        public string Label;


[Column(ordinal: "4")] [ColumnName("Label")] public string Label;

I think this is not needed.

Agreed, just removed

TomFinley · 2018-05-24T23:22:07Z

Hi @OliaG , thanks for updating the PR. So I see we are deleting the files for taxi, at least, so I judge from the change line count (+34,393 −2,048,602), what I see of test/data/taxi-fare-train.csv, etc. Is that correct?

While the addition of these large files was unfortunate, due to the way git works with cloning preserving the whole history, the damage is already done in some sense. Any sort of "edits" on this file to delete or whatever merely compound the problem, especially if we decide to do a rebase to solve the problem once and for all any touching will just complicate that process. Perhaps we should just avoid touching it?

TomFinley · 2018-05-24T23:24:20Z

Maybe if we did git move on those files it would realize this is really a move, but if I look at the diff over here it seems to think we've deleted and added.

TomFinley · 2018-05-24T23:27:15Z

5 784 152:3 153:18 154:18 155:18 156:126 157:136 158:175 159:26 160:166 161:255 162:247 163:127 176:30 177:36 178:94 179:154 180:170 181:253 182:253 183:253 184:253 185:253 186:225 187:172 188:253 189:242 190:195 191:64 203:49 204:238 205:253 206:253 207:253 208:253 209:253 210:253 211:253 212:253 213:251 214:93 215:82 216:82 217:56 218:39 231:18 232:219 233:253 234:253 235:253 236:253 237:253 238:198 239:182 240:247 241:241 260:80 261:156 262:107 263:253 264:253 265:205 266:11 268:43 269:154 289:14 290:1 291:154 292:253 293:90 319:139 320:253 321:190 322:2 347:11 348:190 349:253 350:70 376:35 377:241 378:225 379:160 380:108 381:1 405:81 406:240 407:253 408:253 409:119 410:25 434:45 435:186 436:253 437:253 438:150 439:27 463:16 464:93 465:252 466:253 467:187 493:249 494:253 495:249 496:64 518:46 519:130 520:183 521:253 522:253 523:207 524:2 544:39 545:148 546:229 547:253 548:253 549:253 550:250 551:182 570:24 571:114 572:221 573:253 574:253 575:253 576:253 577:201 578:78 596:23 597:66 598:213 599:253 600:253 601:253 602:253 603:198 604:81 605:2 622:18 623:171 624:219 625:253 626:253 627:253 628:253 629:195 630:80 631:9 648:55 649:172 650:226 651:253 652:253 653:253 654:253 655:244 656:133 657:11 676:136 677:253 678:253 679:253 680:212 681:135 682:132 683:16

This looks like MNIST to me.

Generally I'd prefer that related sample data be stored in directories. If we're going to move all data files to a single directory, then we will need to give them meaningful names.

"Train-Tiny" is only a meaningful name if you know it is MNIST. I was only able to guess it was MNIST because I am very familiar with this dataset, other people may not have that advantage.

Refers to: examples/datasets/Train-Tiny-28x28.txt:1 in 9647337. [](commit_id = 9647337, deletion_comment = False)

TomFinley · 2018-05-24T23:29:28Z

examples/GettingStarted/MulticlassClassification_Iris/Program.cs

+            {
+                for (var j = 0; j < metrics.ConfusionMatrix.ClassNames.Count; j++)
+                {
+                    Console.Write("\t" + metrics.ConfusionMatrix[i, j] + "\t");


Console.Write("\t" + metrics.ConfusionMatrix[i, j] + "\t"); [](start = 20, length = 59)

Were the double tabs on both the start and end intentional?

It was intentional; it's using Console.Write not WriteLine, so the trailing tab is for subsequent writes.

Are you sure? I could see having a trailing tab, or a leading tab, but I don't think having both is intentional?

Let's imagine you're outputting, on one row, the numbers, say, 0, 1, 2, 3. The output is \t0\t\t1\t\t2\t\t3\t. Right? That seems wrong.

In reply to: 190758071 [](ancestors = 190758071)

TomFinley · 2018-05-24T23:30:40Z

examples/GettingStarted/Regression_TaxiFarePrediction/Program.cs

+        private static async Task<PredictionModel<TaxiTrip, TaxiTripFarePrediction>> TrainAsync()
+        {
+            // LearningPipeline holds all steps of the learning process: data, transforms, learners.
+            var pipeline = new LearningPipeline


These are very nice comments. #ByDesign

OliaG · 2018-05-24T23:57:25Z

@TomFinley

I see we are deleting the files for taxi, at least, so I judge from the change line count (+34,393 −2,048,602), what I see of test/data/taxi-fare-train.csv, etc. Is that correct?

That's correct. I've undone the shrinking so now it's a pure move without changes.

While the addition of these large files was unfortunate, due to the way git works with cloning preserving the whole history, the damage is already done in some sense.

Good point.

Maybe if we did git move on those files it would realize this is really a move, but if I look at the diff over here it seems to think we've deleted and added.

Git doesn't track renames. git move is syntax sugar for add and remove. Git will track renames later based on a similarity threshold, which is AFAIK 75% by default and the massive shrinking would be above that threshold.

"Train-Tiny" is only a meaningful name if you know it is MNIST.

OK, so I'll rename Train-Tiny-28x28 to MNIST-Train-Tiny-28x28.

codemzs · 2018-05-25T00:00:59Z

...Started/BinaryClassification_SentimentAnalysis/BinaryClassification_SentimentAnalysis.csproj

+    <Folder Include="datasets\" />
+  </ItemGroup>
+
+</Project>


What is the rationale behind creating a separate project for "Binary Classification sentiment analysis"? Same for MC_Iris, TaxiFarePrediction.

What I'm seeing is you have taken the scenario test and broke it down into many files. What does this accomplish?

We want to have separate project for different problem types so it aligns with our docs

codemzs · 2018-05-25T00:07:42Z

examples/GettingStarted/BinaryClassification_SentimentAnalysis/SentimentData.cs

+        [Column("1", name: "Label")]
+        public float Sentiment;
+    }
+}


Why does this class needs its own file? You can keep SentimentData, SentimentPrediction and TestSentimentData in Program.cs.

The general .NET coding guidelines use one class per file and we want our samples to use idiomatic C#.

@OliaG Can you please point me to this guideline? Vast majority of ML.Net code has multiple class definitions in one file.

Can't find the written up rule, but based on my conversation with @terrajobst that's what the .NET team is usually doing.

If there isn't a written rule then we should keep things consistent with the current code base. Here is an example from .Net Corefx repo that shows multiple classes defined in one file.

@codemzs

If there isn't a written rule then we should keep things consistent with the current code base

I disagree. Samples aren't part of the ML.NET codebase. They are for our customers and the vast majority doesn't need to know nor care about the ML.NET coding conventions. That's why we try to follow the conventions that the vast majority of our customers are using.

That being said, I don't have a problem if we were to merge all these classes into one file as it might even help to keep the sample easier to understand.

The guideline is interesting, and one we generally try to obey ourselves, but not when separating definitions would lead to less comprehensible code. So we tend to declare types meant to be understood together as a unit, in the same file. (This isn't Java after all. :) ) This clearly falls into that bucket. Especially in this case where the intent is pedagogical, we would "bend" any rules we usually follow about style (even if this was an actual policy, which, as Zeeshan points out, it is not).

In this case what I'd rather do is have a (private?) nested type, declared right next to their usages in the method. This will structure the example so that the method and class can be understood. Then you can get rid of these three files, and have only one, more easily understood example.

In reply to: 190942775 [](ancestors = 190942775)

codemzs · 2018-05-25T00:09:11Z

examples/GitHubLabeler/GitHubLabeler/Predictor.cs

+        {
+            var pipeline = new LearningPipeline();
+
+            pipeline.Add(new TextLoader<GitHubIssue>(DataPath, useHeader: true));


pipeline.Add(new TextLoader(DataPath, useHeader: true)); [](start = 9, length = 72)

How is this line even compiling? Are you using some outdated nuget package? The loader API has been changed.

These are samples, which means they are building against the latest package that is published to nuget.org, not what is currently in master. Hence, they are working as we're consuming what our customers would, which is what we want.

Hmmm. The alternate argument is, if they're checked in with the repo, they ought to target the code in the repo, since then the samples are simply always up to date, if we make them part of our build/test cycle. Indeed until I saw that we were using this outdated stuff, I had fully expected that to be the case.

The nuget story makes the samples far more difficult to maintain as a code artifact. Especially in these days, when the API is still in heavy flux. I'm considering two worlds:

The samples are incrementally changed so that as we move towards 0.2, changing the samples one at a time by the person authoring that PR (and so, hopefully, understands how it should be use), I see how development happens. Also at any given point, the samples are correct.

The samples remain unchanged. Come release time, it is someone's job (whose?) to comprehend all PRs against the API in that time, so they have a clue of how the sample should be updated, and then we hope they do all that correctly. Meanwhile the release is blocked on this point.

Generally anything that reduces maintainability and adds costs to our release process, is something we ought to be hostile towards.

In reply to: 190764164 [](ancestors = 190764164)

Actually yeah, I was thinking about it, and the nuget approach just seems deeply wrong. You couldn't even write the PR to update the samples until after you've actually published the nuget. This means that any release branch we make will either have by design outdated samples, or we will have to accept a world in which the nuget published for version X never corresponds to the branch release/X. Regarding it being what we want, I'm not sure who wants that, but whoever they are, let's have them want something else. :)

In reply to: 190949466 [](ancestors = 190949466,190764164)

- Iris sample - Sentiment Analysis sample - TaxiFare sample - GitHub issues classification sample

Moved datasets from tests to examples Removed model files

codemzs

🕐

zeahmed reviewed May 24, 2018

View reviewed changes

OliaG force-pushed the samples branch 2 times, most recently from 9ceb805 to ed14986 Compare May 24, 2018 22:55

OliaG requested a review from terrajobst May 24, 2018 23:01

OliaG changed the title ~~WIP: Update samples~~ Update samples May 24, 2018

shauheen requested review from TomFinley and codemzs May 24, 2018 23:19

TomFinley reviewed May 24, 2018

View reviewed changes

OliaG force-pushed the samples branch from 9647337 to d49629d Compare May 24, 2018 23:29

TomFinley reviewed May 24, 2018

View reviewed changes

OliaG force-pushed the samples branch from d49629d to d6c6949 Compare May 24, 2018 23:52

OliaG added API Issues pertaining the friendly API and removed API Issues pertaining the friendly API labels May 24, 2018

codemzs reviewed May 25, 2018

View reviewed changes

oliag added 6 commits May 24, 2018 17:13

Added GitHubLabeler and GettingStarted

c690212

- Iris sample - Sentiment Analysis sample - TaxiFare sample - GitHub issues classification sample

Address review comments

560293c

Reorganized datasets

3b1e5eb

Moved datasets from tests to examples Removed model files

Fixed unit tests

27439e8

Add README for GettingStarted sln

5658af3

Address more code review feedback

dc15aef

OliaG force-pushed the samples branch from 458c344 to 3a998bb Compare May 25, 2018 00:16

codemzs requested changes May 25, 2018

View reviewed changes

Make MNIST dataset name more meaningful

1c55277

OliaG force-pushed the samples branch from 3a998bb to 1c55277 Compare May 25, 2018 01:13

terrajobst approved these changes May 25, 2018

View reviewed changes

asthana86 self-requested a review May 25, 2018 13:10

OliaG closed this May 25, 2018

OliaG deleted the samples branch May 30, 2018 19:40

ghost locked as resolved and limited conversation to collaborators Mar 30, 2022

Update samples #238

Update samples #238

Uh oh!

Conversation

OliaG commented May 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zeahmed May 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley commented May 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomFinley commented May 24, 2018

Uh oh!

TomFinley commented May 24, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley May 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OliaG commented May 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codemzs May 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

terrajobst May 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codemzs left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

OliaG commented May 24, 2018 •

edited

Loading

zeahmed May 24, 2018 •

edited

Loading

TomFinley commented May 24, 2018 •

edited

Loading

TomFinley May 24, 2018 •

edited

Loading

OliaG commented May 24, 2018 •

edited

Loading

codemzs May 25, 2018 •

edited

Loading

terrajobst May 25, 2018 •

edited

Loading