ML taxi fare tutorial updates #5898

pkulikov · 2018-06-10T20:40:01Z

This PR tries to make the tutorial a little bit smoother to follow. Also added xref-links to the ML API.

@aditidugar @JRAlexander thanks a lot for creating the ML taxi fare tutorial; please review if made changes are correct.

Fixes #5629

pkulikov

Some clarifications comments.

pkulikov · 2018-06-10T20:42:43Z

docs/machine-learning/tutorials/taxi-fare.md


 To predict the taxi fare, you first select the appropriate machine learning task. You are looking to predict a real value (a double that represents price) based on the other factors in the dataset. You choose a [**regression**](../resources/glossary.md#regression) task.

-The process of training the model identifies which factors in the dataset are most influential when predicting the final fare price.


I've removed that sentence because this is the gun on the wall that never shoots. What I mean is the tutorial never comes back to that topic again. And as the tutorial is already long, I thought it's not worth to keep the open ends in.

pkulikov · 2018-06-10T20:45:24Z

docs/machine-learning/tutorials/taxi-fare.md

+
+`TaxiTrip` is the input data class and has definitions for each of the data set columns. Use the [Column](xref:Microsoft.ML.Runtime.Api.ColumnAttribute) attribute to specify the indices of the source columns in the data set.
+
+The `TaxiTripFarePrediction` class is used to represent predicted results. It has a single float (`FareAmount`) field with a `Score` [ColumnName](xref:Microsoft.ML.Runtime.Api.ColumnNameAttribute) attribute applied. The **Score** column is the special column in ML.NET. The model outputs predicted values into that column.


@aditidugar please check that the description of the Score column is correct. The current version of the tutorial doesn't explain why to use the ColumnName attribute with the Score name. However, it's quite important to have that attribute (I guess the field named Score also would work).

@OliaG might know better on this one. It sounds correct based on my understanding.

@pkulikov that is correct. Either field named Score or attribute.

pkulikov · 2018-06-10T20:48:12Z

docs/machine-learning/tutorials/taxi-fare.md

-You'll refer to the columns without the underscores in the code you're creating. Copy the `FareAmount` column into a new column called "Label" using the `ColumnCopier()` function. This column is the **Label**.
+In the next steps we refer to the columns by the names defined in the `TaxiTrip` class.
+
+When the model is trained, the values in the column named **Label** are considered as values to be predicted. As we want to predict the taxi trip fare, copy the `FareAmount` column into the **Label** column. To do that, use <xref:Microsoft.ML.Transforms.ColumnCopier> and add the following code:


@aditidugar please check the updated wording about the Label column, which is another special column in ML.NET.

@OliaG again.

Yes or have one of your fields called Label

OliaG

Thank you very much Petr for your contribution!

GalOshri · 2018-06-18T16:12:42Z

docs/machine-learning/tutorials/taxi-fare.md

 > * Understand the problem
 > * Select the appropriate machine learning task
-> * Prepare and understand your data
+> * Prepare and understand the data


@JRAlexander, what are the conventions for "your" vs "the" for this repo and .NET docs? I see other tutorials are also using phrases like "your data".

@mairaw would be the best person to respond.

Our conventions accept both so it's a case by case scenario. I think in this case using the is better.

GalOshri · 2018-06-18T16:20:34Z

docs/machine-learning/tutorials/taxi-fare.md

+## Load and transform data

-Next, load your data into the pipeline. Point to the `_datapath` created initially and specify the delimiter of the .csv file (,). Add the following code into the `Train()` method underneath the last step:
+The first step that the learning pipeline performs is loading data from the training data set. In our case, training data set is stored in the text file, which path is defined by the value of the `_datapath` constant. That file contains the header with the column names, so the first row should be ignored while loading data. Columns in the file are separated by the comma (","). Add the following code into the `Train` method:


Perhaps "with a path defined by the _datapath constant" ?

GalOshri · 2018-06-18T16:22:13Z

docs/machine-learning/tutorials/taxi-fare.md

 ```

-The last step in data preparation combines all of your **features** into one vector using the `ColumnConcatenator()` function. This necessary step helps the algorithm easily process your features. Add the following code:
+The last step in data preparation combines all of the **features** into one vector using the <xref:Microsoft.ML.Transforms.ColumnConcatenator> transformation class. This necessary step helps the algorithm to easily process the features. Add the following code:


It's not really that it helps the algorithm to easily process the features, but rather that the learner looks for the "Features" column by default and only uses that column as the features.

GalOshri · 2018-06-18T16:23:34Z

docs/machine-learning/tutorials/taxi-fare.md

 ## Choose a learning algorithm

-After adding the data to the pipeline and transforming it into the correct input format, you select a learning algorithm (**learner**). The learning algorithm trains the model. You chose a **regression task** for this problem, so you add a learner called `FastTreeRegressor()` to the pipeline that utilizes **gradient boosting**.
+After adding the data to the pipeline and transforming it into the correct input format, you select a learning algorithm (**learner**). The learning algorithm trains the model. You chose a **regression task** for this problem, so you add a <xref:Microsoft.ML.Trainers.FastTreeRegressor> learner that utilizes **gradient boosting** to the pipeline.


Perhaps it's worth mentioning that there are other learners that can also be used for regression? This makes it sound like FastTreeRegressor is the only option.

That's a good suggestion. However I suggest to add it later, when the topic that describes learners is ready. Then, there is something to reference. What do you think?

That works as well. My only concern is that if someone does not read all the way to the end, they might leave with the impression that it is the only learner. Perhaps just change the wording a bit and expand on it later?

pkulikov · 2018-06-18T23:49:11Z

@GalOshri thank you for the review. I've addressed your feedback in the last three commits. Please check them.

I've mentioned that there are other regression learners available. Anyway, I'll come back to that paragraph after PR #5698 is merged and published.

GalOshri · 2018-06-19T06:29:23Z

docs/machine-learning/tutorials/taxi-fare.md

 ```

-The last step in data preparation combines all of your **features** into one vector using the `ColumnConcatenator()` function. This necessary step helps the algorithm easily process your features. Add the following code:
+The last step in data preparation combines all of the feature columns into the **Features** column using the <xref:Microsoft.ML.Transforms.ColumnConcatenator> transformation class. This step is necessary as a learner processes only features from the **Features** column. Add the following code:


Not sure if it is worth clarifying that using the Features column is the default behavior but it can be changed to a different column (still a single column) or whether this just complicates the tutorial

I think it would complicate the tutorial, so let's keep that out.
However, the separate topic that explains predefined/special columns in ML.NET and how to work with them might be useful.

GalOshri

Thank you for making these changes!

pkulikov · 2018-06-19T10:52:34Z

@JRAlexander @mairaw it looks that this PR might be merged.

JRAlexander

LGTM. Thanks, @pkulikov, for your contribution! I'll merge and you should see it live in a few days.

pkulikov added 2 commits June 10, 2018 20:31

Updated ML taxi fare tutorial

a394351

Mention that the Score column is special

85fc30a

pkulikov requested a review from JRAlexander as a code owner June 10, 2018 20:40

pkulikov commented Jun 10, 2018

View reviewed changes

mairaw requested a review from OliaG June 11, 2018 02:34

mairaw added the Area - ML.NET Guide label Jun 12, 2018

mairaw requested review from GalOshri and aditidugar-zz June 12, 2018 21:49

mairaw assigned pkulikov Jun 12, 2018

mairaw added this to the Sprint 137 (06/11/18 - 06/29/18) milestone Jun 12, 2018

mairaw added waiting-on-reviews and removed waiting-on-reviews waiting-on-feedback Waiting for feedback from SMEs before they can be merged labels Jun 12, 2018

OliaG approved these changes Jun 16, 2018

View reviewed changes

GalOshri reviewed Jun 18, 2018

View reviewed changes

pkulikov added 3 commits June 18, 2018 18:25

Addressed feedback

49f0b87

Addressed feedback

ddc20d0

Addressed feedback

7ca2924

GalOshri reviewed Jun 19, 2018

View reviewed changes

GalOshri approved these changes Jun 19, 2018

View reviewed changes

JRAlexander approved these changes Jun 20, 2018

View reviewed changes

JRAlexander merged commit 4903733 into dotnet:master Jun 20, 2018

JRAlexander removed the waiting-on-reviews label Jun 20, 2018

pkulikov deleted the ml-taxi-fare-tutorial-updates branch June 20, 2018 18:56

BillWagner added the dotnet-ml/svc label Feb 9, 2021

BillWagner removed the 📚 Area - ML.NET Guide label Feb 9, 2021


		To predict the taxi fare, you first select the appropriate machine learning task. You are looking to predict a real value (a double that represents price) based on the other factors in the dataset. You choose a [regression](../resources/glossary.md#regression) task.

		The process of training the model identifies which factors in the dataset are most influential when predicting the final fare price.


		`TaxiTrip` is the input data class and has definitions for each of the data set columns. Use the [Column](xref:Microsoft.ML.Runtime.Api.ColumnAttribute) attribute to specify the indices of the source columns in the data set.

		The `TaxiTripFarePrediction` class is used to represent predicted results. It has a single float (`FareAmount`) field with a `Score` [ColumnName](xref:Microsoft.ML.Runtime.Api.ColumnNameAttribute) attribute applied. The Score column is the special column in ML.NET. The model outputs predicted values into that column.

ML taxi fare tutorial updates #5898

ML taxi fare tutorial updates #5898

Uh oh!

Conversation

pkulikov commented Jun 10, 2018 • edited by mairaw Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pkulikov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OliaG left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pkulikov commented Jun 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GalOshri left a comment

Choose a reason for hiding this comment

Uh oh!

pkulikov commented Jun 19, 2018

Uh oh!

JRAlexander left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pkulikov commented Jun 10, 2018 •

edited by mairaw

Loading