There are three files here. One of them downloads and sets up the tests that you need. Another runs the tests. You must provide a pickled python dictionary for your embeddings. The keys should be strings, and the values should be arrays of floats or ints. The final file is a log file for the scores from the GloVe embeddings.
Recorded scores are the squared error between the similarity proposed in the tests and the similarity of the word pairs in the embeddings you provide. In theory, a lower score would indicate 'better' embeddings. The GloVe embeddings are provided as reference, as scores are more meaninful in a relative sense than an absolute sense.
Run download.sh
Run test_embeddings.py <path to dict> <log file name>
Here are the links to the tests that are used.
MEN: http://clic.cimec.unitn.it/~elia.bruni/MEN.html
MTurk: http://www2.mta.ac.il/~gideon/mturk771.html
WS-353 http://alfonseca.org/eng/research/wordsim353.html
SimLex http://www.cl.cam.ac.uk/~fh295/simlex.html#
Here are the GloVe embeddings used as a baseline.