This repository contains code to generate word embeddings using the Swivel algorithm on IBM Watson Machine Learning. This model is part of the IBM Code Model Asset Exchange.
Machine learning algorithms usually expect numeric inputs. When a data scientist wants to use text to create a machine learning model, they must first find a way to represent their text as a vector of numbers. These vectors are called word embeddings. The Swivel algorithm is a frequency-based word embedding that uses a co-occurence matrix. The idea here is that words that have similar meanings tend to occur together in a text corpus. As a result, words that have similar meanings will have vector representations that are closer than those of unrelated words.
This demo contains scripts to run the Swivel algorithm on a preprocessed Wikipedia text corpus. For instructions on generating word embeddings on your own text corpus see the instructions in the original repository here.
Domain | Application | Industry | Framework | Training Data | Input Data Format |
---|---|---|---|---|---|
Text/NLP | Natural Language | General | TensorFlow | Any Text Corpus (e.g. Wiki Dump) | Text |
[1] N. Shazeer, R. Doherty, C. Evans, C. Waterson., "Swivel: Improving Embeddings by Noticing What's Missing" arXiv preprint arXiv:1602.02215 (2016)
Component | License | Link |
---|---|---|
This repository | Apache 2.0 | LICENSE |
Model Code (3rd party) | Apache 2.0 | TensorFlow Models |
Data | CC BY-SA 3.0 | Wikipedia Text Dump |
- This experiment requires a provisioned instance of IBM Watson Machine Learning service.
- Create an IBM Cloud Object Storage account if you don't have one (https://www.ibm.com/cloud/storage)
- Create credentials for either reading and writing or just reading
- From the bluemix console page (https://console.bluemix.net/dashboard/apps/), choose
Cloud Object Storage
- On the left side, click the
service credentials
- Click on the
new credentials
button to create new credentials - In the
Add New Credentials
popup, use this parameter{"HMAC":true}
in theAdd Inline Configuration...
- When you create the credentials, copy the
access_key_id
andsecret_access_key
values. - Make a note of the endpoint url
- On the left side of the window, click on
Endpoint
- Copy the relevant public or private endpoint. [I choose the us-geo private endpoint].
- On the left side of the window, click on
- From the bluemix console page (https://console.bluemix.net/dashboard/apps/), choose
- In addition setup your AWS S3 command line which can be used to create buckets and/or add files to COS.
- Export
AWS_ACCESS_KEY_ID
with your COSaccess_key_id
andAWS_SECRET_ACCESS_KEY
with your COSsecret_access_key
- Export
- Install IBM Cloud CLI
- Login using
bx login
orbx login --sso
if within IBM
- Login using
- Install ML CLI Plugin
- After install, check if there is any plugins that need update
bx plugin update
- Make sure to setup the various environment variables correctly:
ML_INSTANCE
,ML_USERNAME
,ML_PASSWORD
,ML_ENV
- After install, check if there is any plugins that need update
The train.sh
utility script will deploy the experiment to WML and start the training as a training-run
train.sh
After the train is started, it should print the training-id that is going to be necessary for steps below
Starting to train ...
OK
Model-ID is 'training-GCtN_YRig'
- To list the training runs -
bx ml list training-runs
- To monitor a specific training run -
bx ml show training-runs <training-id>
- To monitor the output (stdout) from the training run -
bx ml monitor training-runs <training-id>
- This will print the first couple of lines, and may time out.
The demo.sh
utility script will download the results from the bucket, convert the embeddings into binary vector format, and run a python application
to explore the embeddings:
demo.sh
When querying a single word, the results will list words that are similar in meaning.
query> dog
dog
dogs
cat
It is also possible to query to complete an analogy. (e.g. A man is to a woman as a king is to... )
query> man woman king
king
queen
princess
If you are interested in contributing to the Model Asset Exchange project or have any queries, please follow the instructions here.