You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/machine-learning/tutorials/sentiment-analysis.md
+30-31Lines changed: 30 additions & 31 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,30 +1,29 @@
1
1
---
2
2
title: Use ML.NET in a sentiment analysis classification scenario
3
3
description: Discover how to use ML.NET in a classification scenario to understand how to use sentiment prediction to take the appropriaste action.
4
-
ms.date: 05/21/2018
4
+
ms.date: 05/24/2018
5
5
ms.custom: mvc
6
6
#Customer intent: As a developer, I want to use ML.NET to apply a binary classification task so that I can understand how to use sentiment prediction to take appropriaste action.
7
7
---
8
-
# Tutorial: Use the ML.NET APIs in a sentiment analysis classification scenario
8
+
# Tutorial: Use ML.NET in a sentiment analysis classification scenario
9
9
10
-
This sample tutorial illustrates using the ML.NET API to create a sentiment classifier via a .NET Core console application using C# in Visual Studio 2017.
10
+
> [!NOTE]
11
+
> This topic refers to ML.NET, which is currently in Preview, and material may be subject to change. For more information, visit [the ML.NET introduction](https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet).
12
+
13
+
This sample tutorial illustrates using ML.NET to create a sentiment classifier via a .NET Core console application using C# in Visual Studio 2017.
11
14
12
15
In this tutorial, you learn how to:
13
16
> [!div class="checklist"]
14
17
> * Understand the problem
15
18
> * Create the learning pipeline
16
19
> * Load a classifier
17
20
> * Train the model
18
-
> * Predict the model
21
+
> * Predict the test data outcomes with the model
19
22
> * Evaluate the model with a different dataset
20
23
21
24
## Sentiment analysis sample overview
22
25
23
-
The sample is a console app that uses the ML.NET API to train a model that classifies and predicts sentiment as either positive or negative. It also evaluates the model with a second dataset for quality analysis. The sentiment datasets are from University of California, Irvine (UCI).
24
-
25
-
Prediction and evaluation results are displayed accordingly so that analysis and action can be taken.
26
-
27
-
Sentiment analysis is either positive or negative. So, you can use classification to train the model, for prediction, and for evaluation.
26
+
The sample is a console app that uses ML.NET to train a model that classifies and predicts sentiment as either positive or negative. It also evaluates the model with a second dataset for quality analysis. The sentiment datasets are from the WikiDetox project.
28
27
29
28
## Machine learning workflow
30
29
@@ -52,13 +51,17 @@ You then need to **determine** the sentiment, which helps you with the machine l
52
51
With this problem, you know the following facts:
53
52
54
53
Training data: website comments can be positive or negative (**sentiment**).
55
-
Predict the **sentiment** of a new website comment, either positive or negative.
54
+
Predict the **sentiment** of a new website comment, either positive or negative, such as in the following examples:
55
+
56
+
* Please refrain from adding nonsense to Wikipedia.
57
+
* He is the best, and the article should say that.
56
58
57
59
## Prerequisites
58
60
59
61
*[Visual Studio 2017 15.6 or later](https://www.visualstudio.com/downloads/?utm_medium=microsoft&utm_source=docs.microsoft.com&utm_campaign=button+cta&utm_content=download+vs2017) with the ".NET Core cross-platform development" workload installed.
60
62
61
-
*[The UCI Sentiment Labeled Sentences dataset zip file](https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip)
63
+
* The [Wikipedia detox line data tab separated file (wikiPedia-detox-250-line-data.tsv)](https://github.com/dotnet/machinelearning/blob/master/test/data/wikipedia-detox-250-line-data.tsv).
64
+
* The [Wikipedia detox line test tab separated file (wikipedia-detox-250-line-test.tsv)](https://github.com/dotnet/machinelearning/blob/master/test/data/wikipedia-detox-250-line-test.tsv).
62
65
63
66
## Create a console application
64
67
@@ -74,15 +77,10 @@ Predict the **sentiment** of a new website comment, either positive or negative.
74
77
75
78
### Prepare your data
76
79
77
-
1. Download [The UCI Sentiment Labeled Sentences dataset zip file (see citations in the following note)](https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip), unzip the file and copy the following two files into the *Data* directory you created:
80
+
1. Download the [WikiPedia detox-250-line-data.tsv](https://github.com/dotnet/machinelearning/blob/master/test/data/wikipedia-detox-250-line-data.tsv) and the [wikipedia-detox-250-line-test.tsv](https://github.com/dotnet/machinelearning/blob/master/test/data/wikipedia-detox-250-line-test.tsv) data sets and save them to the *Data* folder previously created. The first dataset trains the machine learning model and the second can be used to evaluate how accurate your model is.
78
81
79
-
**imdb_labelled.txt*
80
-
**yelp_labelled.txt*
81
82
82
-
> [!NOTE]
83
-
> The datasets this tutorial uses are from the 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015, and hosted at the UCI Machine Learning Repository - Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
84
-
85
-
2. In Solution Explorer, right-click each of the \*.txt files and select **Properties**. Under **Advanced**, change the value of **Copy to Output Directory** to **Always**.
83
+
2. In Solution Explorer, right-click each of the \*.tsv files and select **Properties**. Under **Advanced**, change the value of **Copy to Output Directory** to **Always**.
86
84
87
85
### Create classes and define paths
88
86
@@ -113,7 +111,7 @@ Remove the existing class definition and add the following code, which has two c
113
111
114
112
[!code-csharp[DeclareTypes](../../../samples/machine-learning/tutorials/SentimentAnalysis/SentimentData.cs#2"Declare data record types")]
115
113
116
-
`SentimentData` is the input dataset class and has a string for the comment (`SentimentText`), a `float` (`Sentiment`) that has a value for sentiment of either positive or negative. Both fields have `Column` attributes attached to them. This attribute describes the order of each field in the data file, and which is the `Label` field. `SentimentPrediction` is the class used for prediction after the model has been trained. It has a single boolean (`Sentiment`) and a `PredictedLabel``ColumnName` attribute. The `Label` is used to create and train the model, and it's also used with a second dataset to evaluate the model. The `PredictedLabel` is used during prediction and evaluation. For evaluation, an input with training data, the predicted values, and the model are used.
114
+
`SentimentData` is the input dataset class and has a `float` (`Sentiment`) that has a value for sentiment of either positive or negative, and a string for the comment (`SentimentText`). Both fields have `Column` attributes attached to them. This attribute describes the order of each field in the data file, and which is the `Label` field. `SentimentPrediction` is the class used for prediction after the model has been trained. It has a single boolean (`Sentiment`) and a `PredictedLabel``ColumnName` attribute. The `Label` is used to create and train the model, and it's also used with a second dataset to evaluate the model. The `PredictedLabel` is used during prediction and evaluation. For evaluation, an input with training data, the predicted values, and the model are used.
117
115
118
116
In the *Program.cs* file, replace the `Console.WriteLine("Hello World!")` line with the following code in the `Main` method:
119
117
@@ -149,7 +147,7 @@ The <xref:Microsoft.ML.TextLoader%601> object is the first part of the pipeline,
149
147
150
148
Pre-processing and cleaning data are important tasks that occur before a dataset is used effectively for machine learning. Raw data is often noisy and unreliable, and may be missing values. Using data without these modeling tasks can produce misleading results. ML.NET's transform pipelines allow you to compose a custom set of transforms that are applied to your data before training or testing. The transforms' primary purpose is for data featurization. A transform pipeline's advantage is that after transform pipeline definition, save the pipeline to apply it to test data.
151
149
152
-
Apply a <xref:Microsoft.ML.Transforms.TextFeaturizer> to convert the `SentimentText` column into a numeric vector called `Features` used by the machine learning algorithm. This is the preprocessing/featurization step. Using additional components available in ML.NET can enable better results with your model. Add `TextFeaturizer` to the pipeline as the next line of code:
150
+
Apply a <xref:Microsoft.ML.Transforms.TextFeaturizer> to convert the `SentimentText` column into a [numeric vector](../resources/glossary.md#numerical-feature-vector) called `Features` used by the machine learning algorithm. This is the preprocessing/featurization step. Using additional components available in ML.NET can enable better results with your model. Add `TextFeaturizer` to the pipeline as the next line of code:
153
151
154
152
[!code-csharp[TextFeaturizer](../../../samples/machine-learning/tutorials/SentimentAnalysis/Program.cs#7"Add a TextFeaturizer to the pipeline")]
155
153
@@ -167,7 +165,7 @@ Classification tasks are frequently one of the following types:
167
165
* Binary: either A or B.
168
166
* Multiclass: multiple categories that can be predicted by using a single model.
169
167
170
-
The <xref:Microsoft.ML.Trainers.FastTreeBinaryClassifier> object is a decision tree learner you'll use in this pipeline. Similar to the featurization step, trying out different learners available in ML.NET and changing their parameters leads to different results. For tuning, you can set hyperparameters like <xref:Microsoft.ML.Trainers.FastTreeBinaryClassifier.NumTrees>, <xref:Microsoft.ML.Trainers.FastTreeBinaryClassifier.NumLeaves>, and <xref:Microsoft.ML.Trainers.FastTreeBinaryClassifier.MinDocumentsInLeafs>. These hyperparameters are set before anything affects the model and are modelspecific. They're used to tune the decision tree for performance, so larger values can negatively impact performance.
168
+
The <xref:Microsoft.ML.Trainers.FastTreeBinaryClassifier> object is a decision tree learner you'll use in this pipeline. Similar to the featurization step, trying out different learners available in ML.NET and changing their parameters leads to different results. For tuning, you can set [hyperparameters](../resources/glossary.md#hyperparameter) like <xref:Microsoft.ML.Trainers.FastTreeBinaryClassifier.NumTrees>, <xref:Microsoft.ML.Trainers.FastTreeBinaryClassifier.NumLeaves>, and <xref:Microsoft.ML.Trainers.FastTreeBinaryClassifier.MinDocumentsInLeafs>. These hyperparameters are set before anything affects the model and are model-specific. They're used to tune the decision tree for performance, so larger values can negatively impact performance.
171
169
172
170
Add the following code to the `TrainAndPredict` method:
173
171
@@ -181,7 +179,7 @@ Add the following code to the `TrainAndPredict` method:
181
179
182
180
[!code-csharp[TrainModel](../../../samples/machine-learning/tutorials/SentimentAnalysis/Program.cs#9"Train the model")]
183
181
184
-
## Predict the model
182
+
## Predict the test data outcomes with the model
185
183
186
184
Add some comments to test the trained model's predictions in the `TrainAndPredict` method:
187
185
@@ -210,7 +208,7 @@ To do that, right-click on the project node in **Solution Explorer** and select
210
208
211
209
#### Return the model trained to use for evaluation
212
210
213
-
Return the model at the end of the `TrainAndPredict` method. At this point, you could then save it to a zip file or continue to work with it. For this tutorial, you're going to work with it, so add the following code to the next line in `TrainAndPredict`:
211
+
Return the model at the end of the `TrainAndPredict` method. At this point, you have a model that can be integrated into any of your existing or new .NET applications, or continue to work with it. For this tutorial, you're going to work with it, so add the following code to the next line in `TrainAndPredict`:
214
212
215
213
[!code-csharp[ReturnModel](../../../samples/machine-learning/tutorials/SentimentAnalysis/Program.cs#15"Return the model")]
216
214
@@ -243,7 +241,7 @@ The <xref:Microsoft.ML.Models.BinaryClassificationMetrics> contains the overall
243
241
244
242
### Displaying the metrics for model validation
245
243
246
-
Use the following code to display the metrics, share the results, and act on them accordingly:
244
+
Use the following code to display the metrics, share the results, and then act on them:
@@ -252,18 +250,19 @@ Use the following code to display the metrics, share the results, and act on the
252
250
Your results should be similar to the following. As the pipeline processes, it displays messages. You may see warnings, or processing messages. These have been removed from the following results for clarity.
253
251
254
252
```
253
+
255
254
Sentiment Predictions
256
255
---------------------
257
-
Sentiment: Contoso's 11 is a wonderful experience | Prediction: Positive
258
-
Sentiment:The acting in this movie is really bad | Prediction: Negative
259
-
Sentiment: Joe versus the Volcano Coffee Company is a great film. | Prediction: Positive
256
+
Sentiment: Please refrain from adding nonsense to Wikipedia. | Prediction: Negative
257
+
Sentiment: He is the best, and the article should say that. | Prediction: Positive
260
258
261
259
262
260
PredictionModel quality metrics evaluation
263
261
------------------------------------------
264
-
Accuracy: 67.30%
265
-
Auc: 73.78%
266
-
F1Score: 65.25%
262
+
Accuracy: 66.67%
263
+
Auc: 94.44%
264
+
F1Score: 75.00%
265
+
267
266
```
268
267
269
268
Congratulations! You've now successfully built a machine learning model for classifying and predicting messages sentiment. You can find the source code for this tutorial at the [dotnet/samples](https://github.com/dotnet/samples/tree/master/machine-learning/tutorials/SentimentAnalysis) repository.
@@ -276,7 +275,7 @@ In this tutorial, you learned how to:
0 commit comments