Add n-gram option for embedding training #81

AngusKung · 2018-11-15T05:47:21Z

Feature implemented:
N-gram Phrases and Phraser by gensim, test successfully for both in-memory(data_size<max_memory) and out-of-core(data_size>max_memory) computation on local machine. Controlled via args.ngram, default set to 1 (unigram and thus original).

Test has been carried out, please find below log:

python setup.py test (after removing 'from deepwalk import deepwalk' in test_deepwalk.py)

running test
/Users/i351465/miniconda3/envs/deepwalk-dev/lib/python3.6/site-packages/setuptools/dist.py:517: UserWarning: Module argparse was already imported from /Users/i351465/miniconda3/envs/deepwalk-dev/lib/python3.6/argparse.py, but /Users/i351465/miniconda3/envs/deepwalk-dev/lib/python3.6/site-packages/argparse-1.4.0-py3.6.egg is being added to sys.path
pkg_resources.working_set.add(dist, replace=True)
running egg_info
writing deepwalk.egg-info/PKG-INFO
writing dependency_links to deepwalk.egg-info/dependency_links.txt
writing entry points to deepwalk.egg-info/entry_points.txt
writing requirements to deepwalk.egg-info/requires.txt
writing top-level names to deepwalk.egg-info/top_level.txt
reading manifest file 'deepwalk.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no previously-included files matching 'pycache' found under directory '*'
writing manifest file 'deepwalk.egg-info/SOURCES.txt'
running build_ext
test_something (tests.test_deepwalk.TestDeepwalk) ... ok

Ran 1 test in 0.000s

OK

tox test:

____________________________________________________________________________________ summary _____________________________________________________________________________________
py26: commands succeeded
py27: commands succeeded
py33: commands succeeded
ERROR: py34: InterpreterNotFound: python3.4
*NOTE: tested python-3.4 manually with conda environment, working fine. The failure here should be test scripts issue.

Delete debug pdb and add complete comments for new function Add 'ngram' argument to control this n-gram feature, set default to off Add phrases to enable ngrams for random walk sequences

GTmac · 2018-11-17T05:19:32Z

Thanks for the pull request! I have a question here: considering that random walks are first-order Markov process, what benefits are we expected to get from using the n-grams? Is there any task or experimental results to prove that n-grams features are beneficial?

AngusKung · 2018-11-18T13:02:35Z

Hi GTmac,

Good question!
To be honest, I haven't found any experiments supporting n-gram features.
I'm guessing it might help for e-commerce use cases.

More specific, when representing items with weighted directed graph (as the paper illustrated in Figure 2.),
it might help finding two or more items highly coexisting that we should mark it as a single n-gram and give it an independent embedding.

In fact, I'd like to experiment this, so I implement this and create this pull request to share.
I'd agree both

to merge this request first since it default set to the original unigram and might trigger more people to experiment this idea
to wait until mine or others experiment that supports this feature.

Thanks for replying, reading and please advise :)

Add phrases to enable ngrams for random walk sequences

b4d5b4f

Delete debug pdb and add complete comments for new function Add 'ngram' argument to control this n-gram feature, set default to off Add phrases to enable ngrams for random walk sequences

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add n-gram option for embedding training #81

Add n-gram option for embedding training #81

AngusKung commented Nov 15, 2018 •

edited

Loading

GTmac commented Nov 17, 2018

AngusKung commented Nov 18, 2018 •

edited

Loading

Add n-gram option for embedding training #81

Are you sure you want to change the base?

Add n-gram option for embedding training #81

Conversation

AngusKung commented Nov 15, 2018 • edited Loading

GTmac commented Nov 17, 2018

AngusKung commented Nov 18, 2018 • edited Loading

AngusKung commented Nov 15, 2018 •

edited

Loading

AngusKung commented Nov 18, 2018 •

edited

Loading