-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lexicon file missing #16
Comments
Hi Oliver & Brian! The language model files you can download on dryad (https://datadryad.org/stash/dataset/doi:10.5061/dryad.x69p8czpq) should be all that's needed to run the 3-gram or 5-gram language model decoder (assuming you've also compiled the language model decoder https://github.com/fwillett/speechBCI/tree/main/LanguageModelDecoder). Once you've done those steps, you should be able to run the example inference notebook (https://github.com/fwillett/speechBCI/blob/main/AnalysisExamples/rnn_step3_baselineRNNInference.ipynb). Are you running into errors trying to do those steps, or do you just want to examine the lexicon.txt file? The lexicon.txt file was not included. |
Hi Frank, We're trying to examine the lexicon file. We want to make sure that the tokens are from only the train/test set sentences in the competition data. Our concern is that, given the relatively "small" size of the language lexicon, it's possible that our tokenizer would fail to find important tokens from the holdout set. If this happens, it might constrain the CTC beam-search algorithm from finding correct answers on the holdout set. We just want to make sure our tokenizer has the same linguistic inputs as yours for constructing the lexicon. If you could clarify what linguistic data was used for lexicon construction, that would be very helpful as well. Cheers, |
Hi again, For information about the language model, see section 8 of the supplementary materials of the paper (https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-023-06377-x/MediaObjects/41586_2023_6377_MOESM1_ESM.pdf). The language model contains all 125k words in the CMU dictionary (http://www.speech.cs.cmu.edu/cgi-bin/cmudict). The language model was not built using the small amount of text in the train/test set - it was built using OpenWebText2 (https://openwebtext2.readthedocs.io/en/latest/). Feel free to use large datasets outside of the competition for the language model part - we didn't intend for the language model to be constrained to only the text in the competition data, which I'm sure would miss some words in the hold out set, and would just be a poorly performing model in general. Attaching the lexicon.txt file here. Best, |
Hey guys,
Happy to be part of this competition but one thing we noticed in our training runs is that there was no lexicon file in the original repo. There are a lot of references to it as lexicon.txt or a related files of that nature being in a temporary directory. If you can provide a link to a module or the actual files themselves. Also there's a words.txt file we were looking at that might be the source of creating the lexicon files but we're not sure because it's not included in the repository. Any feedback or help is much appreciated and thanks again for everything here.
-Oliver Shetler and Brian Parbhu
The text was updated successfully, but these errors were encountered: