-
Notifications
You must be signed in to change notification settings - Fork 6.1k
ML taxi fare tutorial updates #5898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ML taxi fare tutorial updates #5898
Conversation
pkulikov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some clarifications comments.
|
|
||
| To predict the taxi fare, you first select the appropriate machine learning task. You are looking to predict a real value (a double that represents price) based on the other factors in the dataset. You choose a [**regression**](../resources/glossary.md#regression) task. | ||
|
|
||
| The process of training the model identifies which factors in the dataset are most influential when predicting the final fare price. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed that sentence because this is the gun on the wall that never shoots. What I mean is the tutorial never comes back to that topic again. And as the tutorial is already long, I thought it's not worth to keep the open ends in.
|
|
||
| `TaxiTrip` is the input data class and has definitions for each of the data set columns. Use the [Column](xref:Microsoft.ML.Runtime.Api.ColumnAttribute) attribute to specify the indices of the source columns in the data set. | ||
|
|
||
| The `TaxiTripFarePrediction` class is used to represent predicted results. It has a single float (`FareAmount`) field with a `Score` [ColumnName](xref:Microsoft.ML.Runtime.Api.ColumnNameAttribute) attribute applied. The **Score** column is the special column in ML.NET. The model outputs predicted values into that column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aditidugar please check that the description of the Score column is correct. The current version of the tutorial doesn't explain why to use the ColumnName attribute with the Score name. However, it's quite important to have that attribute (I guess the field named Score also would work).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OliaG might know better on this one. It sounds correct based on my understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pkulikov that is correct. Either field named Score or attribute.
| You'll refer to the columns without the underscores in the code you're creating. Copy the `FareAmount` column into a new column called "Label" using the `ColumnCopier()` function. This column is the **Label**. | ||
| In the next steps we refer to the columns by the names defined in the `TaxiTrip` class. | ||
|
|
||
| When the model is trained, the values in the column named **Label** are considered as values to be predicted. As we want to predict the taxi trip fare, copy the `FareAmount` column into the **Label** column. To do that, use <xref:Microsoft.ML.Transforms.ColumnCopier> and add the following code: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aditidugar please check the updated wording about the Label column, which is another special column in ML.NET.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OliaG again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes or have one of your fields called Label
OliaG
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much Petr for your contribution!
| > * Understand the problem | ||
| > * Select the appropriate machine learning task | ||
| > * Prepare and understand your data | ||
| > * Prepare and understand the data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JRAlexander, what are the conventions for "your" vs "the" for this repo and .NET docs? I see other tutorials are also using phrases like "your data".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mairaw would be the best person to respond.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our conventions accept both so it's a case by case scenario. I think in this case using the is better.
| ## Load and transform data | ||
|
|
||
| Next, load your data into the pipeline. Point to the `_datapath` created initially and specify the delimiter of the .csv file (,). Add the following code into the `Train()` method underneath the last step: | ||
| The first step that the learning pipeline performs is loading data from the training data set. In our case, training data set is stored in the text file, which path is defined by the value of the `_datapath` constant. That file contains the header with the column names, so the first row should be ignored while loading data. Columns in the file are separated by the comma (","). Add the following code into the `Train` method: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps "with a path defined by the _datapath constant" ?
| ``` | ||
|
|
||
| The last step in data preparation combines all of your **features** into one vector using the `ColumnConcatenator()` function. This necessary step helps the algorithm easily process your features. Add the following code: | ||
| The last step in data preparation combines all of the **features** into one vector using the <xref:Microsoft.ML.Transforms.ColumnConcatenator> transformation class. This necessary step helps the algorithm to easily process the features. Add the following code: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really that it helps the algorithm to easily process the features, but rather that the learner looks for the "Features" column by default and only uses that column as the features.
| ## Choose a learning algorithm | ||
|
|
||
| After adding the data to the pipeline and transforming it into the correct input format, you select a learning algorithm (**learner**). The learning algorithm trains the model. You chose a **regression task** for this problem, so you add a learner called `FastTreeRegressor()` to the pipeline that utilizes **gradient boosting**. | ||
| After adding the data to the pipeline and transforming it into the correct input format, you select a learning algorithm (**learner**). The learning algorithm trains the model. You chose a **regression task** for this problem, so you add a <xref:Microsoft.ML.Trainers.FastTreeRegressor> learner that utilizes **gradient boosting** to the pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it's worth mentioning that there are other learners that can also be used for regression? This makes it sound like FastTreeRegressor is the only option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good suggestion. However I suggest to add it later, when the topic that describes learners is ready. Then, there is something to reference. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works as well. My only concern is that if someone does not read all the way to the end, they might leave with the impression that it is the only learner. Perhaps just change the wording a bit and expand on it later?
| ``` | ||
|
|
||
| The last step in data preparation combines all of your **features** into one vector using the `ColumnConcatenator()` function. This necessary step helps the algorithm easily process your features. Add the following code: | ||
| The last step in data preparation combines all of the feature columns into the **Features** column using the <xref:Microsoft.ML.Transforms.ColumnConcatenator> transformation class. This step is necessary as a learner processes only features from the **Features** column. Add the following code: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it is worth clarifying that using the Features column is the default behavior but it can be changed to a different column (still a single column) or whether this just complicates the tutorial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would complicate the tutorial, so let's keep that out.
However, the separate topic that explains predefined/special columns in ML.NET and how to work with them might be useful.
GalOshri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making these changes!
|
@JRAlexander @mairaw it looks that this PR might be merged. |
JRAlexander
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks, @pkulikov, for your contribution! I'll merge and you should see it live in a few days.
This PR tries to make the tutorial a little bit smoother to follow. Also added xref-links to the ML API.
@aditidugar @JRAlexander thanks a lot for creating the ML taxi fare tutorial; please review if made changes are correct.
Fixes #5629