Skip to content

Commit

Permalink
updating tests to reflect unk changes
Browse files Browse the repository at this point in the history
  • Loading branch information
pique0822 committed Oct 19, 2023
1 parent c883ac3 commit be7072f
Show file tree
Hide file tree
Showing 5 changed files with 5 additions and 14 deletions.
2 changes: 1 addition & 1 deletion docs/Usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ must also be disabled with `--disable-approx-alignment`.
### Synonyms
Synonyms allow for reference words to be equivalent to similar forms (determined by the user) for error counting. They are accepted for any input formats and passed into the tool via the `--syn <path_to_synonym_file>` flag. For details see [Synonyms Format](https://github.com/revdotcom/fstalign/blob/develop/docs/Synonyms-Format.md). A standard set of synonyms we use at Rev.ai is available in the repository under `sample_data/synonyms.rules.txt`.

In addition to allowing for custom synonyms to be passed in via CLI, fstalign also automatically generates synonyms based on the reference and hypothesis text. Currently, it does this for two cases: cutoff words (hello-) and compound hyphenated words (long-term). In both cases, a synonym is dynamically generated with the hyphen removed. Both of these synonym types can be disabled through the CLI by passing in `--disable-cutoffs` and `--disable-hyphen-ignore`, respectively.
In addition to allowing for custom synonyms to be passed in via CLI, fstalign also automatically generates synonyms based on the reference and hypothesis text. Currently, it does this for three cases: cutoff words (e.g. hello-), compound hyphenated words (e.g. long-term), and tags or codes that follow the regular expression: `<.*>` (e.g. <laugh>). In the first two cases, a synonym is dynamically generated with the hyphen removed. Both of these synonym types can be disabled through the CLI by passing in `--disable-cutoffs` and `--disable-hyphen-ignore`, respectively. For the last case of tags, we will automatically allow for `<unk>` to be a valid synonym -- currently, this feature cannot be turned off.

### Normalizations
Normalizations are a similar concept to synonyms. They allow a token or group of tokens to be represented by alternatives when calculating the WER alignment. Unlike synonyms, they are only accepted for NLP file inputs where the tokens are tagged with a unique ID. The normalizations are specified in a JSON format, with the unique ID as keys. Example to illustrate the schema:
Expand Down
9 changes: 0 additions & 9 deletions src/fstalign.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -594,15 +594,6 @@ void write_stitches_to_nlp(vector<Stitching>& stitches, ofstream &output_nlp_fil
logger->warn("an unnormalized token was found: {}", ref_tk);
}
} else if (IsNoisecodeToken(original_nlp_token)) {
// if we have a noisecode <.*> in the nlp token, we inject it here
if (stitch.comment.length() == 0) {
if (ref_tk == DEL || ref_tk == "") {
stitch.comment = "sub(<eps>)";
} else {
stitch.comment = "sub(" + ref_tk + ")";
}
}

ref_tk = original_nlp_token;
} else if (stitch.comment.find("ins") == 0) {
assert(add_inserts);
Expand Down
2 changes: 1 addition & 1 deletion test/data/align_1.aligned.punc_case.nlp
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ b|1|3.0000|4.0000|||LC|[]|[]||||
c|1|5.0000|6.0000|||LC|[]|[]||||
d|1|7.0000|8.0000|,||LC|[]|[]||||
,|1|7.0000|8.0000|||||[]||||
<laugh>|1|9.0000|10.0000|.||LC|['0:FALLBACK']|[]|||sub(<unk>)|
<laugh>|1|9.0000|10.0000|.||LC|['0:FALLBACK']|[]||||
.|1|11.0000|12.0000|||||[]|||sub(e)|
e|1|11.0000|12.0000|||LC|[]|[]|||sub(.)|
f|1|13.0000|14.0000|||LC|[]|[]||||
Expand Down
2 changes: 1 addition & 1 deletion test/data/align_1.ref.aligned.nlp
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ a|1|1.0000|2.0000|||CA|[]|[]||||
b|1|3.0000|4.0000|||LC|[]|[]||||
c|1|5.0000|6.0000|||LC|[]|[]||||
d|1|7.0000|8.0000|,||LC|[]|[]||||
<laugh>|1|9.0000|10.0000|.||LC|['0:FALLBACK']|[]|||sub(<unk>)|
<laugh>|1|9.0000|10.0000|.||LC|['0:FALLBACK']|[]||||
e|1|11.0000|12.0000|||LC|[]|[]||||
f|1|13.0000|14.0000|||LC|[]|[]||||
g|1|15.0000|16.0000|||LC|[]|[]||||
Expand Down
4 changes: 2 additions & 2 deletions test/data/noise_1.hyp2.aligned
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@ a|1|1.0000|2.0000|||CA|[]|[]||||
b|1|3.0000|4.0000|||LC|[]|[]||||
c|1|5.0000|6.0000|||LC|[]|[]||||
d|1|7.0000|8.0000|,||LC|[]|[]||||
<inaudible>|1|9.0000|10.0000|,||LC|[]|[]|||sub(<unk>)|
<inaudible>|1|9.0000|10.0000|,||LC|[]|[]||||
e|1|11.0000|12.0000|||LC|[]|[]||||
F|1|13.0000|14.0000|||LC|[]|[]||||
G|1|15.0000|16.0000|||LC|[]|[]||||
h|1|17.0000|18.0000|||LC|[]|[]||||
<foreign>|1|19.0000|20.0000|,||LC|[]|[]|||sub(<unk>)|
<foreign>|1|19.0000|20.0000|,||LC|[]|[]||||
i|1|21.0000|22.0000|||LC|[]|[]||||
j|1|23.0000|24.0000|||LC|[]|[]||||

0 comments on commit be7072f

Please sign in to comment.