-
Notifications
You must be signed in to change notification settings - Fork 117
[WIP][HIVEMALL-118] word2vec #116
base: master
Are you sure you want to change the base?
Conversation
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
package hivemall.unsupervised; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move package from hivemall.unsupervised
to hivemall.embedding
.
import javax.annotation.Nonnegative; | ||
import javax.annotation.Nonnull; | ||
|
||
public abstract class AbstractWord2vecModel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename Word2vec
to Word2Vec
as seen in spark.
} | ||
} | ||
|
||
protected static float sigmoid(final float v, final int MAX_SIGMOID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to use Constants for argument: final int MAX_SIGMOID, final int SIGMOID_TABLE_SIZE
|
||
public abstract class AbstractWord2VecModel { | ||
// cached sigmoid function parameters | ||
protected final int MAX_SIGMOID = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constants should be static final
.
protected static final int SIGMOID_TABLE_SIZE = 1000; | ||
protected float[] sigmoidTable; | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove unnecessary blank line
|
||
@Nonnegative | ||
protected int dim; | ||
protected int win; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Nonnegative
for each variable (win, neg, iter).
protected Int2FloatOpenHashTable S; | ||
protected int[] aliasWordId; | ||
|
||
protected AbstractWord2VecModel(final int dim, final int win, final int neg, final int iter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add @Nonnegative
for each constructor argument and caller methods.
} | ||
} | ||
|
||
protected static float sigmoid(final float v, final float[] sigmoidTable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Nonnull
for sigmoidTable
} | ||
|
||
protected void updateLearningRate() { | ||
// TODO: valid lr? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this TODO comment and blank lines.
import java.util.List; | ||
|
||
public final class CBoWModel extends AbstractWord2VecModel { | ||
protected CBoWModel(final int dim, final int win, final int neg, final int iter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a blank line before constructor.
|
||
updateLearningRate(); | ||
|
||
int docLength = doc.length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final int docLength
opts.addOption("win", "window", true, "Context window size [default: 5]"); | ||
opts.addOption("neg", "negative", true, | ||
"The number of negative sampled words per word [default: 5]"); | ||
opts.addOption("iter", "iteration", true, "The number of iterations [default: 5]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consistent naming "iters", "iterations"
as seen in SLIM.
opts.addOption("model", "modelName", true, | ||
"The model name of word2vec: skipgram or cbow [default: skipgram]"); | ||
opts.addOption( | ||
"lr", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consistent naming eta0
, learningRate
for the initial learning rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
Does longOpt
remain learningRate
or remove this field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remain learningRate
for longOpt and use eta0
for initialLearningRate.
} | ||
|
||
modelName = cl.getOptionValue("model", modelName); | ||
if (!(modelName.equals("skipgram") || modelName.equals("cbow"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"skipgram".equals(modelName)
is null safe.
|
@myui I resolved conflicts. |
What changes were proposed in this pull request?
Add new algorithm: skip-gram with negative sampling (a.k.a word2vec)
What type of PR is it?
Feature
What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-118
How was this patch tested?
manual tests on EMR
To train word2vec, I used wikipedia dataset, preprocessed by this perl script.
I evaluated word vector by https://github.com/kudkudak/word-embeddings-benchmarks .
CBoW model of hivemall
Skip-gram of hivemall
CBoW of hivemall when the number of reducer for training is 4
How to use this feature?
please see
word2vec.md
Checklist
mvn formatter:format
, for your commit?