Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An answer end with extraneous characters #5

Open
teetone opened this issue Aug 2, 2022 · 2 comments
Open

An answer end with extraneous characters #5

teetone opened this issue Aug 2, 2022 · 2 comments

Comments

@teetone
Copy link

teetone commented Aug 2, 2022

I found an example in the MedQA EN questions in dev.jsonl where one of the answers had extra characters (a new line and a double quote) appended to it:

Answer choice E is Pulmonary embolism\n\"

{"question": "A 47-year-old man comes to the physician because of severe retrosternal chest pain and shortness of breath for 45 minutes. He has dyslipidemia, hypertension, and type 2 diabetes mellitus. Current medications include hydrochlorothiazide, lisinopril, metformin, and atorvastatin. He has smoked 1 pack of cigarettes daily for 20 years. He appears pale and diaphoretic. His temperature is 37°C (98.6°F), pulse is 115/min, and blood pressure is 140/70 mm Hg. Breath sounds are normal. The remainder of the examination shows no abnormalities. An ECG shows left ventricular hypertrophy with ST-segment elevation in leads I, aVL, and V1–V6. High-dose aspirin, clopidogrel, metoprolol, sublingual nitroglycerin, and unfractionated heparin are administered. As the patient awaits transport to the nearest emergency room, he collapses and becomes unresponsive. His pulse and blood pressure cannot be detected. Despite resuscitative efforts, the patient dies. Which of the following is the most likely cause of death in this patient?", "answer": "Ventricular fibrillation", "options": {"A": "Papillary muscle rupture", "B": "Left ventricular failure", "C": "Ventricular fibrillation", "D": "Septal wall rupture", "E": "Pulmonary embolism\n\""}, "meta_info": "step2&3", "answer_idx": "C"}

@teetone
Copy link
Author

teetone commented Aug 2, 2022

I also noticed the same issue for multiple answers in the 4_options set.

@jind11
Copy link
Owner

jind11 commented Aug 8, 2022

hi, thanks for the notice! These should be some special strings attached to the options in the webpages we parse during the data collection process. An easy shot is to remove them by string matching, e.g., x.replace('"', '').replace('\n', '') (x is the option string).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants