SST output cleaning algorithm. It fixes misstakes/typos of a SST model result with availability of the original speech transcription.
In examples/
folder you can compare chunks_in.txt
with algorithm result chunks_out.txt
. check_text.txt
was used as original text.
Run:
python3 transcribe-trimmer/__main__.py -c transcribe-trimmer/config.yaml
config.yaml is a file that contains algorthm settings. Parameters description:
- config.yaml:path:in_text - path to input dataset file
- config.yaml:path:check_text - path to original text file
- config.yaml:path:out_text - path to output file
- config.yaml:chars:chunk_separator - character that will be treated as separator between text phrases
- config.yaml:chars:strip - list of characters to be striped in out_text
- config.yaml:chars:space_replace - list of characters that will be replaced with space sign
- config.yaml:algorithm:small_phrase_size - min number of characters in the phrase that triggers different searching algorithm
- config.yaml:algorithm:words_ahead_to_check - for how many words ahead to look for a similar phrase in original text
- config.yaml:algorithm:words_behind_to_check - for how many words behind to look for a similar phrase in original text
- config.yaml:algorithm:mistakes_border - number of characters in one phrase that