-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grand Challenge #38
Comments
Hi! That sounds really awesome! Please do keep us posted about this. :-) I'm also looking forward to learn more about your tricks as well as the details of your configuration. How many human contestants will you have, and I assume they will be amateurs in the domain? |
Btw.: Are any new releases planned for the near future (~6 weeks)? I currently use 1.4 and ask myself whether I should upgrade to 1.5 or wait for the next version. |
I'm planning to tag the current master as 1.6 whenever I have a moment
to consolidate the wiki benchmarks list etc. It's retrained on a much
higher quality dataset of questions than until now; otoh if you are
retraining the models with your own datasets, I think there aren't many
other improvements.
(We focus on developing some neural language models right now, but
attempts to start integrating them back to YodaQA already started a few
days ago.)
|
Neural models sound great, looking forward to that. |
Grand Challenge is done. I asked questions which a colleague of mine (who is not involved in my team or project) wrote for me so I couldn't influence them in any way. Most normal people could only answer around 15 questions correctly, but one particularly strong candidate managed to get 24 right. Afterwards, I ran my system against it. Yoda itself answered 13 correctly, the complete system got 24 just like the best human contender. So the result of my Grand Challenge of man vs. machine is: We are currently in a draw. We can win against average people, but we are "only" on par with the best. |
@jbauer180266 - I'm wondering if you have results published somewhere? I'd love to take a look! |
Not yet, soon. I'll let you know. |
We have superhuman performance after all: I didn't activate the Bing backend when I did my tests, but with it exactly one additional question can be answered correctly, which is one more than the best human could. 25 out of 30. Very nice. |
Update: |
Congratulations! To test pure stock YodaQA on the challenge, I have created a small JSON dataset and dusted off [
{"qId": "gch000000", "qText": "What is the capital of Zimbabwe?", "answers": ["Harare"]},
{"qId": "gch000001", "qText": "Who invented the Otto engine?", "answers": ["Nikolaus Otto"]},
{"qId": "gch000002", "qText": "When was Pablo Picasso born?", "answers": ["1881"]},
{"qId": "gch000003", "qText": "What is 7*158 + 72 - 72 + 9?", "answers": ["1115"]},
{"qId": "gch000004", "qText": "Who wrote the novel The Light Fantastic?", "answers": ["Terry Pratchett"]},
{"qId": "gch000005", "qText": "In which city was Woody Allen born?", "answers": ["New York"]},
{"qId": "gch000006", "qText": "Who is the current prime minister of Italy?", "answers": ["Matteo Renzi"]},
{"qId": "gch000007", "qText": "What is the equatorial radius of Earth's moon?", "answers": ["1738"]},
{"qId": "gch000008", "qText": "When did the Soviet Union dissolve?", "answers": ["1991"]},
{"qId": "gch000009", "qText": "What is the core body temperature of a human?", "answers": ["37", "98.6"]},
{"qId": "gch000010", "qText": "Who is the current Dalai Lama?", "answers": ["Tenzin Gyatso"]},
{"qId": "gch000011", "qText": "What is 2^23?", "answers": ["8388608"]},
{"qId": "gch000012", "qText": "Who is the creator of Star Trek?", "answers": ["Gene Roddenberry"]},
{"qId": "gch000013", "qText": "In which city is the Eiffel Tower?", "answers": ["Paris"]},
{"qId": "gch000014", "qText": "12 metric tonnes in kilograms?", "answers": ["12 *000"]},
{"qId": "gch000015", "qText": "Where is the mouth of the river Rhine?", "answers": ["the Netherlands"]},
{"qId": "gch000016", "qText": "Where is Buckingham Palace located?", "answers": ["London"]},
{"qId": "gch000017", "qText": "Who directed the movie The Green Mile?", "answers": ["Frank Darabont"]},
{"qId": "gch000018", "qText": "When did Franklin D. Roosevelt die?", "answers": ["1945"]},
{"qId": "gch000019", "qText": "Who was the first man in space?", "answers": ["Yuri Gagarin"]},
{"qId": "gch000020", "qText": "Where was the Peace of Westphalia signed?", "answers": ["Osnabrück", "Münster", "Westphalia"]},
{"qId": "gch000021", "qText": "Who was the first woman to be awarded a Nobel Prize?", "answers": ["Marie Curie"]},
{"qId": "gch000022", "qText": "12.1147 inches to yards?", "answers": ["0.3365194444"]},
{"qId": "gch000023", "qText": "What is the atomic number of potassium?", "answers": ["19"]},
{"qId": "gch000024", "qText": "Where is the Tiananmen Square?", "answers": ["China"]},
{"qId": "gch000025", "qText": "What is the binomial name of horseradish?", "answers": ["Armoracia Rusticana"]},
{"qId": "gch000026", "qText": "How long did Albert Einstein live?", "answers": ["76"]},
{"qId": "gch000027", "qText": "Who earned the most Academy Awards?", "answers": ["Walt Disney", "Katharine Hepburn"]},
{"qId": "gch000028", "qText": "How many lines does the London Underground have?", "answers": ["11"]},
{"qId": "gch000029", "qText": "When is the next planned German Federal Convention?", "answers": []}
]
Seems like combined with Wolfram Alpha, the system could answer another at least 4 non-factoid questions, and probably help with at least 2 factoids - which would bring it to 24, but it's pretty likely Wolfram Alpha knows some of the other incorrect factoids too. For your thesis, I'd also recommend comparing this to plain Wolfram Alpha and Google QA. For the latter, we have a script you can easily use in https://github.com/brmson/google-qa (though it may not correctly extract answer from all non-movie-related results, there is some variability in the HTML code). Great work! |
Thank you very much for the hints - already done from the start, though. I compared my results to both and also confirmed the great synergies between Yoda and Wolfram that I've already reported to you after the pilot study (as you probably remember). The questions are a bit treacherous, btw. - for this particular set Yoda and Wolfram almost achieve answering all questions of my full system correctly, but for larger test sets it became apparent that there are still some holes when only relying on a combination of Wolfram and Yoda which I've been able to plug at least to some degree. Btw: The evaluation is finished after 26 (13 male, 13 female) subjects. The system still takes the top spot. [Just to document a minor discrepancy: I had 27 subjects, but couldn't find a 14th female subject, so I burnt a male candidate to get equal numbers by gender, which should slightly increase external validity. The one I burnt was neither the best nor the worst and randomly chosen from all males. Btw: The two best human contenders are male, the third best is female.] |
Hi,
just wanted to let you know that I plan on having a competition between human contestants and "my" Yoda version in mid to late April as a grande finale of my contributions so far. Due to some tricks I currently have about twice as many correct (top 1) answers in one third of the time as the default configuration and thus I'm cautiously optimistic the system could win this challenge. Hence, unless some higher power prevents it, this will take place.
Might not be Jeopardy grand champions or covered on live TV, but if Yoda with some additions should be able to win against people running around in a university that would be a great milestone imho. I'm currently in between fear and excitement and will keep you posted about the results.
Best wishes,
Joe
The text was updated successfully, but these errors were encountered: