Skip to content

Commit 1aa1f3c

Browse files
authored
Update SentimentAnalysis tutorial with new datasets (#5616)
* Update SentimentAnalysis tutorial with new datasets * Fixed link * Revised based on feedback
1 parent a90328a commit 1aa1f3c

File tree

1 file changed

+30
-31
lines changed

1 file changed

+30
-31
lines changed

docs/machine-learning/tutorials/sentiment-analysis.md

Lines changed: 30 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,29 @@
11
---
22
title: Use ML.NET in a sentiment analysis classification scenario
33
description: Discover how to use ML.NET in a classification scenario to understand how to use sentiment prediction to take the appropriaste action.
4-
ms.date: 05/21/2018
4+
ms.date: 05/24/2018
55
ms.custom: mvc
66
#Customer intent: As a developer, I want to use ML.NET to apply a binary classification task so that I can understand how to use sentiment prediction to take appropriaste action.
77
---
8-
# Tutorial: Use the ML.NET APIs in a sentiment analysis classification scenario
8+
# Tutorial: Use ML.NET in a sentiment analysis classification scenario
99

10-
This sample tutorial illustrates using the ML.NET API to create a sentiment classifier via a .NET Core console application using C# in Visual Studio 2017.
10+
> [!NOTE]
11+
> This topic refers to ML.NET, which is currently in Preview, and material may be subject to change. For more information, visit [the ML.NET introduction](https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet).
12+
13+
This sample tutorial illustrates using ML.NET to create a sentiment classifier via a .NET Core console application using C# in Visual Studio 2017.
1114

1215
In this tutorial, you learn how to:
1316
> [!div class="checklist"]
1417
> * Understand the problem
1518
> * Create the learning pipeline
1619
> * Load a classifier
1720
> * Train the model
18-
> * Predict the model
21+
> * Predict the test data outcomes with the model
1922
> * Evaluate the model with a different dataset
2023
2124
## Sentiment analysis sample overview
2225

23-
The sample is a console app that uses the ML.NET API to train a model that classifies and predicts sentiment as either positive or negative. It also evaluates the model with a second dataset for quality analysis. The sentiment datasets are from University of California, Irvine (UCI).
24-
25-
Prediction and evaluation results are displayed accordingly so that analysis and action can be taken.
26-
27-
Sentiment analysis is either positive or negative. So, you can use classification to train the model, for prediction, and for evaluation.
26+
The sample is a console app that uses ML.NET to train a model that classifies and predicts sentiment as either positive or negative. It also evaluates the model with a second dataset for quality analysis. The sentiment datasets are from the WikiDetox project.
2827

2928
## Machine learning workflow
3029

@@ -52,13 +51,17 @@ You then need to **determine** the sentiment, which helps you with the machine l
5251
With this problem, you know the following facts:
5352

5453
Training data: website comments can be positive or negative (**sentiment**).
55-
Predict the **sentiment** of a new website comment, either positive or negative.
54+
Predict the **sentiment** of a new website comment, either positive or negative, such as in the following examples:
55+
56+
* Please refrain from adding nonsense to Wikipedia.
57+
* He is the best, and the article should say that.
5658

5759
## Prerequisites
5860

5961
* [Visual Studio 2017 15.6 or later](https://www.visualstudio.com/downloads/?utm_medium=microsoft&utm_source=docs.microsoft.com&utm_campaign=button+cta&utm_content=download+vs2017) with the ".NET Core cross-platform development" workload installed.
6062

61-
* [The UCI Sentiment Labeled Sentences dataset zip file](https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip)
63+
* The [Wikipedia detox line data tab separated file (wikiPedia-detox-250-line-data.tsv)](https://github.com/dotnet/machinelearning/blob/master/test/data/wikipedia-detox-250-line-data.tsv).
64+
* The [Wikipedia detox line test tab separated file (wikipedia-detox-250-line-test.tsv)](https://github.com/dotnet/machinelearning/blob/master/test/data/wikipedia-detox-250-line-test.tsv).
6265

6366
## Create a console application
6467

@@ -74,15 +77,10 @@ Predict the **sentiment** of a new website comment, either positive or negative.
7477

7578
### Prepare your data
7679

77-
1. Download [The UCI Sentiment Labeled Sentences dataset zip file (see citations in the following note)](https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip), unzip the file and copy the following two files into the *Data* directory you created:
80+
1. Download the [WikiPedia detox-250-line-data.tsv](https://github.com/dotnet/machinelearning/blob/master/test/data/wikipedia-detox-250-line-data.tsv) and the [wikipedia-detox-250-line-test.tsv](https://github.com/dotnet/machinelearning/blob/master/test/data/wikipedia-detox-250-line-test.tsv) data sets and save them to the *Data* folder previously created. The first dataset trains the machine learning model and the second can be used to evaluate how accurate your model is.
7881

79-
* *imdb_labelled.txt*
80-
* *yelp_labelled.txt*
8182

82-
> [!NOTE]
83-
> The datasets this tutorial uses are from the 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015, and hosted at the UCI Machine Learning Repository - Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
84-
85-
2. In Solution Explorer, right-click each of the \*.txt files and select **Properties**. Under **Advanced**, change the value of **Copy to Output Directory** to **Always**.
83+
2. In Solution Explorer, right-click each of the \*.tsv files and select **Properties**. Under **Advanced**, change the value of **Copy to Output Directory** to **Always**.
8684

8785
### Create classes and define paths
8886

@@ -113,7 +111,7 @@ Remove the existing class definition and add the following code, which has two c
113111

114112
[!code-csharp[DeclareTypes](../../../samples/machine-learning/tutorials/SentimentAnalysis/SentimentData.cs#2 "Declare data record types")]
115113

116-
`SentimentData` is the input dataset class and has a string for the comment (`SentimentText`), a `float` (`Sentiment`) that has a value for sentiment of either positive or negative. Both fields have `Column` attributes attached to them. This attribute describes the order of each field in the data file, and which is the `Label` field. `SentimentPrediction` is the class used for prediction after the model has been trained. It has a single boolean (`Sentiment`) and a `PredictedLabel` `ColumnName` attribute. The `Label` is used to create and train the model, and it's also used with a second dataset to evaluate the model. The `PredictedLabel` is used during prediction and evaluation. For evaluation, an input with training data, the predicted values, and the model are used.
114+
`SentimentData` is the input dataset class and has a `float` (`Sentiment`) that has a value for sentiment of either positive or negative, and a string for the comment (`SentimentText`). Both fields have `Column` attributes attached to them. This attribute describes the order of each field in the data file, and which is the `Label` field. `SentimentPrediction` is the class used for prediction after the model has been trained. It has a single boolean (`Sentiment`) and a `PredictedLabel` `ColumnName` attribute. The `Label` is used to create and train the model, and it's also used with a second dataset to evaluate the model. The `PredictedLabel` is used during prediction and evaluation. For evaluation, an input with training data, the predicted values, and the model are used.
117115

118116
In the *Program.cs* file, replace the `Console.WriteLine("Hello World!")` line with the following code in the `Main` method:
119117

@@ -149,7 +147,7 @@ The <xref:Microsoft.ML.TextLoader%601> object is the first part of the pipeline,
149147

150148
Pre-processing and cleaning data are important tasks that occur before a dataset is used effectively for machine learning. Raw data is often noisy and unreliable, and may be missing values. Using data without these modeling tasks can produce misleading results. ML.NET's transform pipelines allow you to compose a custom set of transforms that are applied to your data before training or testing. The transforms' primary purpose is for data featurization. A transform pipeline's advantage is that after transform pipeline definition, save the pipeline to apply it to test data.
151149

152-
Apply a <xref:Microsoft.ML.Transforms.TextFeaturizer> to convert the `SentimentText` column into a numeric vector called `Features` used by the machine learning algorithm. This is the preprocessing/featurization step. Using additional components available in ML.NET can enable better results with your model. Add `TextFeaturizer` to the pipeline as the next line of code:
150+
Apply a <xref:Microsoft.ML.Transforms.TextFeaturizer> to convert the `SentimentText` column into a [numeric vector](../resources/glossary.md#numerical-feature-vector) called `Features` used by the machine learning algorithm. This is the preprocessing/featurization step. Using additional components available in ML.NET can enable better results with your model. Add `TextFeaturizer` to the pipeline as the next line of code:
153151

154152
[!code-csharp[TextFeaturizer](../../../samples/machine-learning/tutorials/SentimentAnalysis/Program.cs#7 "Add a TextFeaturizer to the pipeline")]
155153

@@ -167,7 +165,7 @@ Classification tasks are frequently one of the following types:
167165
* Binary: either A or B.
168166
* Multiclass: multiple categories that can be predicted by using a single model.
169167

170-
The <xref:Microsoft.ML.Trainers.FastTreeBinaryClassifier> object is a decision tree learner you'll use in this pipeline. Similar to the featurization step, trying out different learners available in ML.NET and changing their parameters leads to different results. For tuning, you can set hyperparameters like <xref:Microsoft.ML.Trainers.FastTreeBinaryClassifier.NumTrees>, <xref:Microsoft.ML.Trainers.FastTreeBinaryClassifier.NumLeaves>, and <xref:Microsoft.ML.Trainers.FastTreeBinaryClassifier.MinDocumentsInLeafs>. These hyperparameters are set before anything affects the model and are model specific. They're used to tune the decision tree for performance, so larger values can negatively impact performance.
168+
The <xref:Microsoft.ML.Trainers.FastTreeBinaryClassifier> object is a decision tree learner you'll use in this pipeline. Similar to the featurization step, trying out different learners available in ML.NET and changing their parameters leads to different results. For tuning, you can set [hyperparameters](../resources/glossary.md#hyperparameter) like <xref:Microsoft.ML.Trainers.FastTreeBinaryClassifier.NumTrees>, <xref:Microsoft.ML.Trainers.FastTreeBinaryClassifier.NumLeaves>, and <xref:Microsoft.ML.Trainers.FastTreeBinaryClassifier.MinDocumentsInLeafs>. These hyperparameters are set before anything affects the model and are model-specific. They're used to tune the decision tree for performance, so larger values can negatively impact performance.
171169

172170
Add the following code to the `TrainAndPredict` method:
173171

@@ -181,7 +179,7 @@ Add the following code to the `TrainAndPredict` method:
181179

182180
[!code-csharp[TrainModel](../../../samples/machine-learning/tutorials/SentimentAnalysis/Program.cs#9 "Train the model")]
183181

184-
## Predict the model
182+
## Predict the test data outcomes with the model
185183

186184
Add some comments to test the trained model's predictions in the `TrainAndPredict` method:
187185

@@ -210,7 +208,7 @@ To do that, right-click on the project node in **Solution Explorer** and select
210208

211209
#### Return the model trained to use for evaluation
212210

213-
Return the model at the end of the `TrainAndPredict` method. At this point, you could then save it to a zip file or continue to work with it. For this tutorial, you're going to work with it, so add the following code to the next line in `TrainAndPredict`:
211+
Return the model at the end of the `TrainAndPredict` method. At this point, you have a model that can be integrated into any of your existing or new .NET applications, or continue to work with it. For this tutorial, you're going to work with it, so add the following code to the next line in `TrainAndPredict`:
214212

215213
[!code-csharp[ReturnModel](../../../samples/machine-learning/tutorials/SentimentAnalysis/Program.cs#15 "Return the model")]
216214

@@ -243,7 +241,7 @@ The <xref:Microsoft.ML.Models.BinaryClassificationMetrics> contains the overall
243241

244242
### Displaying the metrics for model validation
245243

246-
Use the following code to display the metrics, share the results, and act on them accordingly:
244+
Use the following code to display the metrics, share the results, and then act on them:
247245

248246
[!code-csharp[DisplayMetrics](../../../samples/machine-learning/tutorials/SentimentAnalysis/Program.cs#21 "Display selected metrics")]
249247

@@ -252,18 +250,19 @@ Use the following code to display the metrics, share the results, and act on the
252250
Your results should be similar to the following. As the pipeline processes, it displays messages. You may see warnings, or processing messages. These have been removed from the following results for clarity.
253251

254252
```
253+
255254
Sentiment Predictions
256255
---------------------
257-
Sentiment: Contoso's 11 is a wonderful experience | Prediction: Positive
258-
Sentiment:The acting in this movie is really bad | Prediction: Negative
259-
Sentiment: Joe versus the Volcano Coffee Company is a great film. | Prediction: Positive
256+
Sentiment: Please refrain from adding nonsense to Wikipedia. | Prediction: Negative
257+
Sentiment: He is the best, and the article should say that. | Prediction: Positive
260258
261259
262260
PredictionModel quality metrics evaluation
263261
------------------------------------------
264-
Accuracy: 67.30%
265-
Auc: 73.78%
266-
F1Score: 65.25%
262+
Accuracy: 66.67%
263+
Auc: 94.44%
264+
F1Score: 75.00%
265+
267266
```
268267

269268
Congratulations! You've now successfully built a machine learning model for classifying and predicting messages sentiment. You can find the source code for this tutorial at the [dotnet/samples](https://github.com/dotnet/samples/tree/master/machine-learning/tutorials/SentimentAnalysis) repository.
@@ -276,7 +275,7 @@ In this tutorial, you learned how to:
276275
> * Create the learning pipeline
277276
> * Load a classifier
278277
> * Train the model
279-
> * Predict the model
278+
> * Predict the test data outcomes with the model
280279
> * Evaluate the model with a different dataset
281280
282281
Advance to the next tutorial to learn more

0 commit comments

Comments
 (0)