Skip to content

Conversation

@pkulikov
Copy link
Contributor

@pkulikov pkulikov commented Jun 10, 2018

This PR tries to make the tutorial a little bit smoother to follow. Also added xref-links to the ML API.

@aditidugar @JRAlexander thanks a lot for creating the ML taxi fare tutorial; please review if made changes are correct.

Fixes #5629

@pkulikov pkulikov requested a review from JRAlexander as a code owner June 10, 2018 20:40
Copy link
Contributor Author

@pkulikov pkulikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some clarifications comments.


To predict the taxi fare, you first select the appropriate machine learning task. You are looking to predict a real value (a double that represents price) based on the other factors in the dataset. You choose a [**regression**](../resources/glossary.md#regression) task.

The process of training the model identifies which factors in the dataset are most influential when predicting the final fare price.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed that sentence because this is the gun on the wall that never shoots. What I mean is the tutorial never comes back to that topic again. And as the tutorial is already long, I thought it's not worth to keep the open ends in.


`TaxiTrip` is the input data class and has definitions for each of the data set columns. Use the [Column](xref:Microsoft.ML.Runtime.Api.ColumnAttribute) attribute to specify the indices of the source columns in the data set.

The `TaxiTripFarePrediction` class is used to represent predicted results. It has a single float (`FareAmount`) field with a `Score` [ColumnName](xref:Microsoft.ML.Runtime.Api.ColumnNameAttribute) attribute applied. The **Score** column is the special column in ML.NET. The model outputs predicted values into that column.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditidugar please check that the description of the Score column is correct. The current version of the tutorial doesn't explain why to use the ColumnName attribute with the Score name. However, it's quite important to have that attribute (I guess the field named Score also would work).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OliaG might know better on this one. It sounds correct based on my understanding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkulikov that is correct. Either field named Score or attribute.

You'll refer to the columns without the underscores in the code you're creating. Copy the `FareAmount` column into a new column called "Label" using the `ColumnCopier()` function. This column is the **Label**.
In the next steps we refer to the columns by the names defined in the `TaxiTrip` class.

When the model is trained, the values in the column named **Label** are considered as values to be predicted. As we want to predict the taxi trip fare, copy the `FareAmount` column into the **Label** column. To do that, use <xref:Microsoft.ML.Transforms.ColumnCopier> and add the following code:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditidugar please check the updated wording about the Label column, which is another special column in ML.NET.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OliaG again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes or have one of your fields called Label

@mairaw mairaw requested a review from OliaG June 11, 2018 02:34
@mairaw mairaw requested review from GalOshri and aditidugar-zz June 12, 2018 21:49
@mairaw mairaw added waiting-on-reviews and removed waiting-on-reviews waiting-on-feedback Waiting for feedback from SMEs before they can be merged labels Jun 12, 2018
Copy link
Contributor

@OliaG OliaG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much Petr for your contribution!

> * Understand the problem
> * Select the appropriate machine learning task
> * Prepare and understand your data
> * Prepare and understand the data

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JRAlexander, what are the conventions for "your" vs "the" for this repo and .NET docs? I see other tutorials are also using phrases like "your data".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mairaw would be the best person to respond.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our conventions accept both so it's a case by case scenario. I think in this case using the is better.

## Load and transform data

Next, load your data into the pipeline. Point to the `_datapath` created initially and specify the delimiter of the .csv file (,). Add the following code into the `Train()` method underneath the last step:
The first step that the learning pipeline performs is loading data from the training data set. In our case, training data set is stored in the text file, which path is defined by the value of the `_datapath` constant. That file contains the header with the column names, so the first row should be ignored while loading data. Columns in the file are separated by the comma (","). Add the following code into the `Train` method:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "with a path defined by the _datapath constant" ?

```

The last step in data preparation combines all of your **features** into one vector using the `ColumnConcatenator()` function. This necessary step helps the algorithm easily process your features. Add the following code:
The last step in data preparation combines all of the **features** into one vector using the <xref:Microsoft.ML.Transforms.ColumnConcatenator> transformation class. This necessary step helps the algorithm to easily process the features. Add the following code:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really that it helps the algorithm to easily process the features, but rather that the learner looks for the "Features" column by default and only uses that column as the features.

## Choose a learning algorithm

After adding the data to the pipeline and transforming it into the correct input format, you select a learning algorithm (**learner**). The learning algorithm trains the model. You chose a **regression task** for this problem, so you add a learner called `FastTreeRegressor()` to the pipeline that utilizes **gradient boosting**.
After adding the data to the pipeline and transforming it into the correct input format, you select a learning algorithm (**learner**). The learning algorithm trains the model. You chose a **regression task** for this problem, so you add a <xref:Microsoft.ML.Trainers.FastTreeRegressor> learner that utilizes **gradient boosting** to the pipeline.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's worth mentioning that there are other learners that can also be used for regression? This makes it sound like FastTreeRegressor is the only option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good suggestion. However I suggest to add it later, when the topic that describes learners is ready. Then, there is something to reference. What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works as well. My only concern is that if someone does not read all the way to the end, they might leave with the impression that it is the only learner. Perhaps just change the wording a bit and expand on it later?

@pkulikov
Copy link
Contributor Author

@GalOshri thank you for the review. I've addressed your feedback in the last three commits. Please check them.

I've mentioned that there are other regression learners available. Anyway, I'll come back to that paragraph after PR #5698 is merged and published.

```

The last step in data preparation combines all of your **features** into one vector using the `ColumnConcatenator()` function. This necessary step helps the algorithm easily process your features. Add the following code:
The last step in data preparation combines all of the feature columns into the **Features** column using the <xref:Microsoft.ML.Transforms.ColumnConcatenator> transformation class. This step is necessary as a learner processes only features from the **Features** column. Add the following code:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it is worth clarifying that using the Features column is the default behavior but it can be changed to a different column (still a single column) or whether this just complicates the tutorial

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would complicate the tutorial, so let's keep that out.
However, the separate topic that explains predefined/special columns in ML.NET and how to work with them might be useful.

Copy link

@GalOshri GalOshri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making these changes!

@pkulikov
Copy link
Contributor Author

@JRAlexander @mairaw it looks that this PR might be merged.

Copy link
Contributor

@JRAlexander JRAlexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks, @pkulikov, for your contribution! I'll merge and you should see it live in a few days.

@JRAlexander JRAlexander merged commit 4903733 into dotnet:master Jun 20, 2018
@pkulikov pkulikov deleted the ml-taxi-fare-tutorial-updates branch June 20, 2018 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reason to exclude "trip_time_in_secs" ?

7 participants