Create a spam classifier with Watson Natural Language Classifier. This repo provides code samples and instruction to support the IBM developerWorks article, "Create a natural language classifier that identifies spam".
Learn how to train a spam classfier, validate its accuracy, classify new texts, and run it as a web application. You'll do it all with Watson Natural Language Classifier.
This project contains:
-
Training data
-
Test data
-
a Python script to measure accuracy
-
An instance of the Watson Naturual Language Classifier service on IBM Cloud (see blog)
-
data/SpamHam-Train.csv
- SpamHam training data -
data/SpamHam-Test.json
- SpamHam test data -
spam.py
- a python script used to measure the accuracy of the classifier -
web/
- The node.js based web demo (http://watsonnlcspam.mybluemix.net)
Data files are a transform of [SMS Spam Collection v.1](<http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ >) (UCI's SMS Spam Collectoin Data Set)
See the "Create a natural language classifier that identifies spam" developerWorks article for details or (for less detail) follow the general outline below.
- Go to IBM Cloud
- From the IBM Cloud catalog, select Watson Natural Language Classifier
Training the classifier is easy. Simply, provide training data in a Watson NLC
compatible format and POST a request to the Watson NLC /classifiers
REST
endpoint.
-
Open
data/SpamHam-Train.csv
to view the data format -
Train Watson NLC
curl -X POST -u username:password -F training_data=@SpamHam-Train.csv \ -F training_metadata="{\"language\":\"en\",\"name\":\"My Classifier\"}" \ "https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers"
-
Open
spam.py
and supply values for: _YOUR_CLASSIFIER_ID
_YOUR_CLASSIFIER_USERNAME
*YOUR_CLASSIFIER_PASSWORD
-
Run
pip install requests
-
Run
python spam.py
Use Watson Natural Language Classifier to predict spam. The training data is a public set of 5,574 English SMS messages collected for mobile phone spam research.
The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-encoded messages, tagged according being legitimate (ham) or spam. More information can be found here.
More information can be found here
A comprehensive study of this data can be found in the following papers:
-
Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011. (preprint)
-
Gómez Hidalgo, J.M., Almeida, T.A., Yamakami, A. On the Validity of a New SMS Spam Collection. Proceedings of the 11th IEEE International Conference on Machine Learning and Applications (ICMLA'12), Boca Raton, FL, USA, 2012. (preprint)
-
Almeida, T.A., Gómez Hidalgo, J.M., Silva, T.P. Towards SMS Spam Filtering: Results under a New Dataset. International Journal of Information Security Science (IJISS), 2(1), 1-18. (Invited paper - full version)
Copyright 2015-2018 Carmine M DiMascio
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.