Releases: piskvorky/gensim-data
fasttext-wiki-news-subwords-300
Pre-trained FastText 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
Feature | Description |
---|---|
File size | 959MB |
Number of vectors | 999999 |
Dimension | 300 |
License | https://creativecommons.org/licenses/by-sa/3.0/ |
Read more:
- https://fasttext.cc/docs/en/english-vectors.html
- Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, Armand Joulin: "Advances in Pre-Training Distributed Word Representations"
- Armand Joulin Edouard Grave Piotr Bojanowski Tomas Mikolov: "Bag of Tricks for Efficient Text Classification"
Example
import gensim.downloader as api
model = api.load("fasttext-wiki-news-subwords-300")
model.most_similar(positive=["russia", "river"])
"""
Output:
[(u'russias', 0.6939424276351929),
(u'danube', 0.6881916522979736),
(u'river.', 0.6683923006057739),
(u'crimea', 0.6638611555099487),
(u'rhine', 0.6632323861122131),
(u'rivermouth', 0.6602864265441895),
(u'wester', 0.6586191058158875),
(u'finland', 0.6585439443588257),
(u'volga', 0.6576792001724243),
(u'ukraine', 0.6569074392318726)]
"""
semeval-2016-2017-task3-subtaskBC
SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collected data is given in sections 3 and 4.1 of the 2016 task paper linked in section “Papers” of #18.
Related issue #18
attribute | value |
---|---|
File size | 6MB |
Number of records | 4 (upper level) |
Read more:
- SemEval task 3: community question answering
- Preslav Nakov, Doris Hoogeveen, Llu´ıs Marquez, Alessandro Moschitti, Hamdy Mubarak, Timothy Baldwin, Karin Verspoor: "SemEval-2017 Task 3: Community Question Answering"
Produced by: https://github.com/Witiko/semeval-2016_2017-task3-subtaskB-english
Example:
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.similarities import MatrixSimilarity
from gensim.utils import simple_preprocess
import numpy as np
def read_corpus():
for thread in api.load("semeval-2016-2017-task3-subtaskA-unannotated"):
yield simple_preprocess(thread["RelQuestion"]["RelQSubject"])
yield simple_preprocess(thread["RelQuestion"]["RelQBody"])
for relcomment in thread["RelComments"]:
yield simple_preprocess(relcomment["RelCText"])
dictionary = Dictionary(read_corpus())
datasets = api.load("semeval-2016-2017-task3-subtaskBC")
def produce_test_data(dataset):
for orgquestion in datasets[dataset]:
relquestions = [
(
dictionary.doc2bow(simple_preprocess(thread["RelQuestion"]["RelQSubject"]) + simple_preprocess(thread["RelQuestion"]["RelQBody"])),
thread["RelQuestion"]["RELQ_RELEVANCE2ORGQ"] in ("PerfectMatch", "Relevant")
)
for thread in orgquestion["Threads"]
]
relcomments = [
(
dictionary.doc2bow(simple_preprocess(relcomment["RelCText"])),
relcomment["RELC_RELEVANCE2ORGQ"] == "Good"
)
for thread in orgquestion["Threads"] for relcomment in thread["RelComments"]
]
orgquestion = dictionary.doc2bow(simple_preprocess(orgquestion["OrgQSubject"]) + simple_preprocess(orgquestion["OrgQBody"]))
yield orgquestion, dict(subtaskB=relquestions, subtaskC=relcomments)
def average_precision(similarities, relevance):
precision = [
(num_correct + 1) / (num_total + 1) \
for num_correct, num_total in enumerate(
num_total for num_total, (_, relevant) in enumerate(
sorted(zip(similarities, relevance), reverse=True)
)
if relevant)
]
return np.mean(precision) if precision else 0.0
def evaluate(dataset, subtask):
results = []
for orgquestion, subtasks in produce_test_data(dataset):
documents, relevance = zip(*subtasks[subtask])
index = MatrixSimilarity(documents, num_features=len(dictionary))
similarities = index[orgquestion]
results.append(average_precision(similarities, relevance))
return np.mean(results) * 100.0
for dataset in ("2016-dev", "2016-test", "2017-test"):
print("MAP score on the {} dataset:\t{:.2f} (Subtask B)\t{:.2f} (Subtask C)".format(dataset, evaluate(dataset, "subtaskB"), evaluate(dataset, "subtaskC")))
"""
Output:
MAP score on the 2016-dev dataset: 41.89 (Subtask B) 3.33 (Subtask C)
MAP score on the 2016-test dataset: 51.42 (Subtask B) 5.59 (Subtask C)
MAP score on the 2017-test dataset: 23.65 (Subtask B) 0.74 (Subtask C)
"""
semeval-2016-2017-task3-subtaskA-unannotated
SemEval 2016 / 2017 Task 3 Subtask A unannotated dataset contains 189,941 questions and 1,894,456 comments in English collected from the Community Question Answering (CQA) web forum of Qatar Living. These can be used as a corpus for language modelling.
Related issue #18
attribute | value |
---|---|
File size | 224MB |
Number of records | 189941 |
Read more:
- http://alt.qcri.org/semeval2017/task3/
- http://alt.qcri.org/semeval2017/task3/data/uploads/semeval2017-task3.pdf
Produced by: https://github.com/Witiko/semeval-2016_2017-task3-subtaskA-unannotated-english
Example:
import gensim.downloader as api
for thread in api.load("semeval-2016-2017-task3-subtaskA-unannotated"):
print("Question subjects: {}\n".format(thread["RelQuestion"]["RelQSubject"]))
print("Question body: {}\n".format(thread["RelQuestion"]["RelQBody"]))
print("Relevat comments: ")
for idx, relcomment in enumerate(thread["RelComments"]):
print("\t#{}: {}\n".format(idx + 1, relcomment["RelCText"]))
break
"""
Output:
Question subjects: Thailand:IT Minsitry blocks CNN; Facebook;
Question body: The state of Internet in Thailand:IT Minsitry blocks CNN; Facebook; Yahoo; Flickr Thai Immigration website listed as dangerousFull story: http://www.thaivisa.com/forum/Thai-Govt-Blocks-Cnn-Yahoo-Financ-t321851.html
Relevat comments:
#1: have they blocked porn??? <img src="http://www.qatarliving.com/files/images/Da.gif">
#2: like trying to contain a tsunami with a hand towel ************************************ I'm Jack's complete lack of surprise
#3: oops double post.. ----------------- "HE WHO DARES WINS" Derek Edward Trotter
#4: What next they gonna ban all *** tourist from entering the country? ----------------- "HE WHO DARES WINS" Derek Edward Trotter
#5: Or you can always make your own there with some thai babys Rules are a guideline for intelligent people; but they must be adhered to by idiots.
#6: why CNN? they want to die ignorant of what happens around?
"""
patent-2017
Raw full text and metadata of patent grants, from the US Patent and Trademark Office (USPTO), as distributed by Reed Tech.
Contains the full text including tables, International Patent Classification (IPC) and Cooperative Patent Classification (CPC), sequence data and 'in-line' mathematical expressions of each patent grant issued in 2017.
Read more about dataset history, usage and conditions:
attribute | value |
---|---|
File size | 3GB |
Number of patents | 353,197 |
For alternative patent datasets, see the discussion in issue #8.
Example:
import gensim.downloader as api
import json
dataset = api.load("patent-2017")
for idx, document in enumerate(dataset):
print(json.dumps(document, indent=2))
"""
Output:
{
"description": {
"p": [
"The present application claims the benefit under 35 U.S.C. \u00a7119 to U.S. provisional patent application Ser. No. 61/768,295, filed Feb. 22, 2013. The foregoing application is hereby incorporated by reference into the present application in its entirety.",
"The present inventions relate to tissue stimulation systems, and more particularly, to systems and methods for adjusting the stimulation provided to tissue to minimize the energy requirements of the systems.",
"Implantable neurostimulation systems have proven therapeutic in a wide variety of diseases and disorders. Pacemakers and Implantable Cardiac Defibrillators (ICDs) have proven highly effective in the treatment of a number of cardiac conditions (e.g., arrhythmias). Spinal Cord Stimulation (SCS) systems have long been accepted as a therapeutic modality for the treatment of chronic pain syndromes, and the application of spinal stimulation has begun to expand to additional applications, such as angina pectoris and incontinence. Deep Brain Stimulation (DBS) has also been applied therapeutically for well over a decade for the treatment of refractory Parkinson's Disease, and DBS has also recently been applied in additional areas, such as essential tremor and epilepsy. Further, in recent investigations, Peripheral Nerve Stimulation (PNS) systems have demonstrated efficacy in the treatment of chronic pain syndromes and incontinence, and a number of additional applications are currently under investigation. Furthermore, Functional Electrical Stimulation (FES) systems such as the Freehand system by NeuroControl (Cleveland, Ohio) have been applied to restore some functionality to paralyzed extremities in spinal cord injury patients.",
"Each of these implantable neurostimulation systems typically includes one or more electrode carrying stimulation leads, which are implanted at the desired stimulation site, and a neurostimulation device implanted remotely from the stimulation site, but coupled either directly to the stimulation lead(s) or indirectly to the stimulation lead(s) via a lead extension. Thus, electrical pulses can be delivered from the neurostimulation device to the electrode(s) to activate a volume of tissue in accordance with a set of stimulation parameters and provide the desired efficacious therapy to the patient. In particular, electrical energy conveyed between at least one cathodic electrode and at least one anodic electrode creates an electrical field, which when strong enough, depolarizes (or \u201cstimulates\u201d) the neurons beyond a threshold level, thereby evoking action potentials (APs) that propagate along the neural fibers. A typical stimulation parameter set may include the electrodes that are sourcing (anodes) or returning (cathodes) the modulating current at any given time, as well as the amplitude, duration, and rate of the stimulation pulses.",
"The neurostimulation system may further comprise a handheld patient programmer to remotely instruct the neurostimulation device to generate electrical stimulation pulses in accordance with selected stimulation parameters. The handheld programmer in the form of a remote control (RC) may, itself, be programmed by a clinician, for example, by using a clinician's programmer (CP), which typically includes a general purpose computer, such as a laptop, with a programming software package installed thereon.",
"Of course, neurostimulation devices are active devices requiring energy for operation, and thus, the neurostimulation system may oftentimes includes an external charger to recharge a neurostimulation device, so that a surgical procedure to replace a power depleted neurostimulation device can be avoided. To wirelessly convey energy between the external charger and the implanted neurostimulation device, the charger typically includes an alternating current (AC) charging coil that supplies energy to a similar charging coil located in or on the neurostimulation device. The energy received by the charging coil located on the neurostimulation device can then be used to directly power the electronic componentry contained within the neurostimulation device, or can be stored in a rechargeable battery within the neurostimulation device, which can then be used to power the electronic componentry on-demand.",
"Typically, the therapeutic effect for any given neurostimulation application may be optimized by adjusting the stimulation parameters. Although the threshold for evoking action potentials may be a good indication of whether a desired therapeutic result is achieved, it is usually not directly observable when programming the neurostimulation device. For this reason, the programmer of the neurostimulation system is often required to identify the efficacy threshold and the side-effect threshold based on the patient's perception. For instance, the programmer of the neurostimulation system may identify the efficacy threshold by asking the patient whether the pain is relieved or perceived paresthesia, and record the set of stimulation parameters of that stimulation level. Similarly, the side-effect threshold is identified by adjusting the stimulation until the patient perceives any undesired side-effects such as slurred speech or involuntary muscle contraction, and records the set of stimulation parameters of that stimulation level. Then, the neurostimulation system is configured with a certain set of stimulation parameters to generate stimulation at an arbitrary level within the therapeutic window so that the stimulation is perceptible by the patient without causing any undesirable side effects.",
"There are a few issues that need to be considered when using this approach. Many neurostimulation therapies take time to develop the clinical benefit. For example, the patient may need to be on a certain level of stimulation for a few hours or even days before he or she can actually feel the pain relief or regain muscles mobility. Also, the side effect threshold is often not perfectly correlated with the therapeutic effect. Therefore, relying on the subjective clinical assessment (e.g., perception threshold) at the acute setting and configuring the stimulation parameters may result in an erroneous therapeutic window. Moreover, various changes, including postural changes, leads movement and tissue maturation, may occur in the patient during the course of therapy, and the stimulation parameters may need to be re-calibrated using the same unreliable subjective clinical assessment approach, thus the therapeutic window is often chosen to be very broad. That is, the gap between the efficacy threshold and the side-effect threshold is set as far as possible. In order to prevent under-stimulation and over-stimulation, a set of stimulation parameters are chosen to generate a stimulation pulse at the mid-level of the wide therapeutic window. The set of stimulation parameters for generating such stimulation pulse is more energy-intensive than necessary to achieve the therapy, which in turn causes decreased battery life, more frequent recharge cycles, and/or in the case where non-chargeable primary cell devices are used, more frequent surgeries for replacing the battery.",
"There, thus, remains a need to decrease the energy requirements for neurostimulation therapy.",
"In accordance with the present inventions, a neurostimulation system is provided. The system comprises stimulation output circuitry configured for delivering stimulation pulses to target tissue in accordance with a set of stimulation parameters (e.g., at least one of a pulse amplitude, a pulse width, a pulse rate, a duty cycle, a burst rate, and an electrode combination), monitoring circuitry configured for continuously measuring action potentials evoked in the target tissue (e.g., one of an evoked compound action potential and an evoked compound muscle action potential) in response to the delivery of the stimulation pulses to the target tissue, memory configured for storing a characteristic of a reference evoked action potential (e.g., at least one of peak delay, width, amplitude, and waveform morphology), which may be a therapeutic evoked action potential or a side-effect evoked action potential, and at least one processor configured for initiating an automatic mode, in which a characteristic of the measured evoked action potentials is compared to the corresponding characteristic of the reference evoked action potential, and one or more stimulation parameter values in the set of stimulation parameters are adjusted to decrease or increase the energy level of the stimulation pulses, thereby evoking action potentials in the target tissue having substantially the same corresponding characteristic as the reference evoked action potential.",
"In one embodiment, the processor(s) is configured for triggering the automatic mode base...
conceptnet-numberbatch-17-06-300
ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning.
Related issue #9.
attribute | value |
---|---|
File size | 1.14GB |
Number of vectors | 1917247 |
Dimension | 300 |
License | https://github.com/commonsense/conceptnet-numberbatch/blob/master/LICENSE.txt |
Read more:
- http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972
- https://github.com/commonsense/conceptnet-numberbatch
- http://conceptnet.io/
Example
import gensim.downloader as api
model = api.load("conceptnet-numberbatch-17-06-300")
for word, distance in model.most_similar("/c/en/beer"):
print(u"{}: {:4f}".format(word, distance))
"""
output:
/c/ca/birra: 0.995633
/c/eu/zerbeza: 0.995058
/c/hi/बियर: 0.994754
/c/ja/ビア: 0.994656
/c/ja/ビヤ: 0.994406
/c/ja/ビーア: 0.994406
/c/eu/garagardo: 0.994178
/c/ku/بیرە: 0.993689
/c/eu/biera: 0.993634
/c/sh/пиво: 0.992218
"""
word2vec-ruscorpora-300
Word2vec Continuous Skipgram vectors trained on the full Russian National Corpus (about 250M words).
Related issue #3.
attribute | value |
---|---|
File size | 199MB |
Number of vectors | 184973 |
Preprocessing | The corpus (used for training) was lemmatized and tagged with Universal PoS |
Window size | 10 |
Dimension | 300 |
License | https://creativecommons.org/licenses/by/4.0/deed.en |
Read more:
- https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models
- http://rusvectores.org/en/
Example
import gensim.downloader as api
model = api.load("word2vec-ruscorpora-300")
for word, distance in model.most_similar(u"кот_NOUN"):
print(u"{}: {:.3f}".format(word, distance))
"""
output:
кошка_NOUN: 0.757
котенок_NOUN: 0.668
пес_NOUN: 0.563
мяукать_VERB: 0.562
тобик_NOUN: 0.559
фоксик_NOUN: 0.557
собака_NOUN: 0.557
мяучать_VERB: 0.554
харлашка_NOUN: 0.552
котяра_NOUN: 0.551
"""
wiki-english-20171001
Plaintext extracted from raw XML Wikipedia dump from October 2017. Each article is split into its constituent sections and their headlines (see the section_texts
and section_titles
attributes of each record).
attribute | value |
---|---|
File size | 6.3GB |
Number of articles | 4,924,894 |
Total number of sections | 23,179,735 |
Read more:
Produced by
python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o wiki-english-20171001.gz
Example:
import gensim.downloader as api
data = api.load("wiki-english-20171001")
for article in data:
for section_title, section_text in zip(article['section_titles'], article['section_texts']):
print("Section title: %s" % section_title)
print("Section text: %s" % section_text)
break
"""
Section title: Introduction
Section text:
'''Anarchism''' is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary and harmful.
While anti-statism is central, anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations, including—but not limited to—the state system. Anarchism is usually considered a far-left ideology and much of anarchist economics and anarchist legal philosophy reflects anti-authoritarian interpretations of communism, collectivism, syndicalism, mutualism or participatory economics.
Anarchism does not offer a fixed body of doctrine from a single particular world view, instead fluxing and flowing as a philosophy. Many types and traditions of anarchism exist, not all of which are mutually exclusive. Anarchist schools of thought can differ fundamentally, supporting anything from extreme individualism to complete collectivism. Strains of anarchism have often been divided into the categories of social and individualist anarchism or similar dual classifications.
Section title: Etymology and terminology
Section text:
The word ''anarchism'' is composed from the word ''anarchy'' and the suffix ''-ism'', themselves derived respectively from the Greek , i.e. ''anarchy'' (from , ''anarchos'', meaning "one without rulers"; from the privative prefix ἀν- (''an-'', i.e. "without") and , ''archos'', i.e. "leader", "ruler"; (cf. ''archon'' or , ''arkhē'', i.e. "authority", "sovereignty", "realm", "magistracy")) and the suffix or (''-ismos'', ''-isma'', from the verbal infinitive suffix -ίζειν, ''-izein''). The first known use of this word was in 1539. Various factions within the French Revolution labelled opponents as anarchists (as Robespierre did the Hébertists) although few shared many views of later anarchists. There would be many revolutionaries of the early nineteenth century who contributed to the anarchist doctrines of the next generation, such as William Godwin and Wilhelm Weitling, but they did not use the word ''anarchist'' or ''anarchism'' in describing themselves or their beliefs.
The first political philosopher to call himself an anarchist was Pierre-Joseph Proudhon, marking the formal birth of anarchism in the mid-nineteenth century. Since the 1890s, and beginning in France, the term "libertarianism" has often been used as a synonym for anarchism and was used almost exclusively in this sense until the 1950s in the United States; its use as a synonym is still common outside the United States. On the other hand, some use libertarianism to refer to individualistic free-market philosophy only, referring to free-market anarchism as libertarian anarchism.
Section title: History
Section text:
===Origins===
Woodcut from a Diggers document by William Everard
The earliest anarchist themes can be found in the 6th century BC among the works of Taoist philosopher Laozi and in later centuries by Zhuangzi and Bao Jingyan. Zhuangzi's philosophy has been described by various sources as anarchist. Zhuangzi wrote: "A petty thief is put in jail. A great brigand becomes a ruler of a Nation". Diogenes of Sinope and the Cynics, as well as their contemporary Zeno of Citium, the founder of Stoicism, also introduced similar topics. Jesus is sometimes considered the first anarchist in the Christian anarchist tradition. Georges Lechartier wrote: "The true founder of anarchy was Jesus Christ and ... the first anarchist society was that of the apostles". In early Islamic history, some manifestations of anarchic thought are found during the Islamic civil war over the Caliphate, where the Kharijites insisted that the imamate is a right for each individual within the Islamic society.
The French renaissance political philosopher Étienne de La Boétie wrote in his most famous work the ''Discourse on Voluntary Servitude'' what some historians consider an important anarchist precedent. The radical Protestant Christian Gerrard Winstanley and his group the Diggers are cited by various authors as proposing anarchist social measures in the 17th century in England. The term "anarchist" first entered the English language in 1642, during the English Civil War, as a term of abuse, used by Royalists against their Roundhead opponents. By the time of the French Revolution some, such as the ''Enragés'', began to use the term positively, in opposition to Jacobin centralisation of power, seeing "revolutionary government" as oxymoronic. By the turn of the 19th century, the English word "anarchism" had lost its initial negative connotation.
Modern anarchism emerged from the secular or religious thought of the Enlightenment, particularly Jean-Jacques Rousseau's arguments for the moral centrality of freedom.
As part of the political turmoil of the 1790s in the wake of the French Revolution, William Godwin developed the first expression of modern anarchist thought. Godwin was, according to Peter Kropotkin, "the first to formulate the political and economical conceptions of anarchism, even though he did not give that name to the ideas developed in his work", while Godwin attached his anarchist ideas to an early Edmund Burke.
William Godwin, "the first to formulate the political and economical conceptions of anarchism, even though he did not give that name to the ideas developed in his work".
Godwin is generally regarded as the founder of the school of thought known as 'philosophical anarchism'. He argued in ''Political Justice'' (1793) that government has an inherently malevolent influence on society, and that it perpetuates dependency and ignorance. He thought that the spread of the use of reason to the masses would eventually cause government to wither away as an unnecessary force. Although he did not accord the state with moral legitimacy, he was against the use of revolutionary tactics for removing the government from power. Rather, he advocated for its replacement through a process of peaceful evolution.
His aversion to the imposition of a rules-based society led him to denounce, as a manifestation of the people's 'mental enslavement', the foundations of law, property rights and even the institution of marriage. He considered the basic foundations of society as constraining the natural development of individuals to use their powers of reasoning to arrive at a mutually beneficial method of social organisation. In each case, government and its institutions are shown to constrain the development of our capacity to live wholly in accordance with the full and free exercise of private judgement.
The French Pierre-Joseph Proudhon is regarded as the first ''self-proclaimed'' anarchist, a label he adopted in his groundbreaking work, ''What is Property?'', published in 1840. It is for this reason that some claim Proudhon as the founder of modern anarchist theory. He developed the theory of spontaneous order in society, where organisation emerges without a central coordinator imposing its own idea of order against the wills of individuals acting in their own interests. His famous quote on the matter is "Liberty is the mother, not the daughter, of order". In ''What is Property?'' Proudhon answers with the famous accusation "Property is theft." In this work, he opposed the institution of decreed "property" (''propriété''), where owners have complete rights to "use and abuse" their property as they wish. He contrasted this with what he called "possession," or limited ownership of resources and goods only while in more or less continuous use. Later, however, Proudhon added that "Property is Liberty" and argued that it was a bulwark against state power. His opposition to the state, organised religion, and certain capitalist practices inspired subsequent anarchists, and made him one of the leading social thinkers of his time.
The anarcho-communist Joseph Déjacque was the first person to describe himself as "libertarian". Unlike Pierre-Joseph Proudhon, he argued that, "it is not the product of his or her labour that the worker has a right to, but to the satisfaction of his or her needs, whatever may be their nature." In 1844 in Germany the post-hegelian philosopher Max Stirner published the book, ''The Ego and Its Own'', which would later be considered an influential early text of individualist anarchism. French anarchists active in the 1848 Revolution included Anselme Bellegarrigue, Ernest Coeurderoy, Joseph Déjacque and Pierre Joseph Proudhon.
===First International and the Paris Commune===
Anarchist Mikhail Bakunin opposed the Marxist aim of dictatorship of the proletariat in favour of universal rebellion, and allied himself with the federalists in the First International before his expulsion by the Marxists.
In Europe, harsh reaction followed the revolut...
quora-duplicate-questions
Over 400,000 lines of potential question duplicate pairs. Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line contains a duplicate pair or not.
attribute | value |
---|---|
File size | 21MB |
Number of pairs | 404290 |
License | probably https://www.quora.com/about/tos |
Read more:
Example
import gensim.downloader as api
import json
data = api.load("quora-duplicate-questions")
for question_pair in data:
print(json.dumps(question_pair, indent=4))
break
"""
Output:
{
"qid1": "1",
"question2": "What is the step by step guide to invest in share market?",
"qid2": "2",
"is_duplicate": "0",
"question1": "What is the step by step guide to invest in share market in india?",
"id": "0"
}
"""
word2vec-google-news-300
Pre-trained vectors trained on a part of the Google News dataset (about 100 billion words). The model contain vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in "Distributed Representations of Words and Phrases and their Compositionality".
Feature | Description |
---|---|
File size | 1.6GB |
Number of vectors | 3000000 |
Dimension | 300 |
Read more:
- https://code.google.com/archive/p/word2vec/
- Efficient Estimation of Word Representations in Vector Space
- Distributed Representations of Words and Phrases and their Compositionality
- Linguistic Regularities in Continuous Space Word Representations
Example
import gensim.downloader as api
model = api.load("word2vec-google-news-300")
model.most_similar(positive=["king", "woman"], negative=["man"])
"""
Output:
[(u'queen', 0.7118192911148071),
(u'monarch', 0.6189674139022827),
(u'princess', 0.5902431011199951),
(u'crown_prince', 0.5499460697174072),
(u'prince', 0.5377321243286133),
(u'kings', 0.5236844420433044),
(u'Queen_Consort', 0.5235945582389832),
(u'queens', 0.518113374710083),
(u'sultan', 0.5098593235015869),
(u'monarchy', 0.5087411999702454)]
"""
__testing_word2vec-matrix-synopsis
❗ For testing purposes only ❗
This a word2vec model of matrix-synopsis.