Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add n-gram option for embedding training #81

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AngusKung
Copy link

@AngusKung AngusKung commented Nov 15, 2018

#81

Feature implemented:
N-gram Phrases and Phraser by gensim, test successfully for both in-memory(data_size<max_memory) and out-of-core(data_size>max_memory) computation on local machine. Controlled via args.ngram, default set to 1 (unigram and thus original).

Test has been carried out, please find below log:

python setup.py test (after removing 'from deepwalk import deepwalk' in test_deepwalk.py)

running test
/Users/i351465/miniconda3/envs/deepwalk-dev/lib/python3.6/site-packages/setuptools/dist.py:517: UserWarning: Module argparse was already imported from /Users/i351465/miniconda3/envs/deepwalk-dev/lib/python3.6/argparse.py, but /Users/i351465/miniconda3/envs/deepwalk-dev/lib/python3.6/site-packages/argparse-1.4.0-py3.6.egg is being added to sys.path
pkg_resources.working_set.add(dist, replace=True)
running egg_info
writing deepwalk.egg-info/PKG-INFO
writing dependency_links to deepwalk.egg-info/dependency_links.txt
writing entry points to deepwalk.egg-info/entry_points.txt
writing requirements to deepwalk.egg-info/requires.txt
writing top-level names to deepwalk.egg-info/top_level.txt
reading manifest file 'deepwalk.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no previously-included files matching 'pycache' found under directory '*'
writing manifest file 'deepwalk.egg-info/SOURCES.txt'
running build_ext
test_something (tests.test_deepwalk.TestDeepwalk) ... ok


Ran 1 test in 0.000s

OK

tox test:

____________________________________________________________________________________ summary _____________________________________________________________________________________
py26: commands succeeded
py27: commands succeeded
py33: commands succeeded
ERROR: py34: InterpreterNotFound: python3.4
*NOTE: tested python-3.4 manually with conda environment, working fine. The failure here should be test scripts issue.

Delete debug pdb and add complete comments for new function

Add 'ngram' argument to control this n-gram feature, set default to off

Add phrases to enable ngrams for random walk sequences
@GTmac
Copy link
Collaborator

GTmac commented Nov 17, 2018

Thanks for the pull request! I have a question here: considering that random walks are first-order Markov process, what benefits are we expected to get from using the n-grams? Is there any task or experimental results to prove that n-grams features are beneficial?

@AngusKung
Copy link
Author

AngusKung commented Nov 18, 2018

Hi GTmac,

Good question!
To be honest, I haven't found any experiments supporting n-gram features.
I'm guessing it might help for e-commerce use cases.

More specific, when representing items with weighted directed graph (as the paper illustrated in Figure 2.),
it might help finding two or more items highly coexisting that we should mark it as a single n-gram and give it an independent embedding.

In fact, I'd like to experiment this, so I implement this and create this pull request to share.
I'd agree both

  1. to merge this request first since it default set to the original unigram and might trigger more people to experiment this idea
  2. to wait until mine or others experiment that supports this feature.

Thanks for replying, reading and please advise :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants