Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spellchecking ASR customization model #6179

Merged
merged 197 commits into from
Jun 2, 2023
Merged
Changes from 1 commit
Commits
Show all changes
197 commits
Select commit Hold shift + click to select a range
545598f
bug fixes
Oct 12, 2022
adb1ce2
fix bugs, add preparation and evaluation scripts, add readme
Oct 19, 2022
37693f4
small fixes
Oct 19, 2022
16a75f0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 22, 2022
7f02059
add real coverage calculation, small fixes, more debug information
bene-ges Nov 3, 2022
2ee091c
add option to pass a filelist and output folder - to handle inference…
bene-ges Nov 4, 2022
540ed99
added preprocessing for yago wikipedia articles - finding yago entiti…
bene-ges Nov 24, 2022
1932dce
yago wiki preprocessing, sampling, pseudonormalization
bene-ges Nov 28, 2022
047c7c8
more scripts for preparation of training examples
bene-ges Dec 9, 2022
ba1a79b
bug fixes
bene-ges Dec 10, 2022
996aa5e
add some alphabet checks
bene-ges Dec 15, 2022
ee2fe28
add bert on subwords, concatenate it to bert on characters
bene-ges Nov 11, 2022
14d8c80
add calculation of character_pos_to_subword_pos
bene-ges Nov 12, 2022
4c975e1
bug fix
bene-ges Nov 12, 2022
a4069dd
bug fix
bene-ges Nov 12, 2022
82b2b4c
pdb
bene-ges Nov 12, 2022
9ad5b7c
tensor join bug fix
bene-ges Nov 12, 2022
6d63ad5
double hidden_size in classifier
bene-ges Nov 12, 2022
bee59f2
pdb
bene-ges Nov 12, 2022
fcd5e8f
default index value 0 instead of -1 because index cannot be negative
bene-ges Nov 12, 2022
fb3e927
pad index value 0 instead of -1 because index cannot be negative
bene-ges Nov 12, 2022
b2c5c5b
remove pdb
bene-ges Nov 12, 2022
c389b36
fix bugs, add creation of tarred dataset
bene-ges Dec 16, 2022
9d638a7
add possibility to change sequence len at inference
bene-ges Dec 18, 2022
50d1147
change sampling of dummy candidates at inference, add candidate info …
bene-ges Dec 19, 2022
55205fd
fix import
bene-ges Dec 19, 2022
be14005
fix bug
bene-ges Dec 19, 2022
02bc90a
update transcription now uses info
bene-ges Dec 20, 2022
c12b45f
write path
bene-ges Dec 20, 2022
69fd821
1. add tarred dataset support(untested). 2. fix bug with ban_ngrams i…
bene-ges Dec 22, 2022
1c3793c
skip short_sent if no real candidates
bene-ges Dec 22, 2022
0e6a981
fix import
bene-ges Dec 22, 2022
b3f0f28
add braceexpand
bene-ges Dec 22, 2022
f087a6d
fixes
bene-ges Dec 22, 2022
955e59a
fix bug
bene-ges Dec 22, 2022
ecf6ca5
fix bug
bene-ges Dec 22, 2022
d82f47f
fix bug in np.ones
bene-ges Dec 28, 2022
68ee337
fix bug in collate
bene-ges Dec 28, 2022
cd6a265
change tensor type to long because of error in torch.gather
bene-ges Dec 28, 2022
37ae2df
fix for empty spans tensor
bene-ges Dec 28, 2022
4049984
same fixes in _collate_fn for tarred dataset
bene-ges Dec 28, 2022
b0adc87
fix bug from previous commit
bene-ges Dec 28, 2022
9328161
change int types to be shorter to minimize tar size
bene-ges Dec 28, 2022
b198c0c
refactoring of datasets and inference
bene-ges Dec 30, 2022
0c93c11
bug fix
bene-ges Dec 30, 2022
d9cd84e
bug fix
bene-ges Dec 30, 2022
02fb3f8
bug fix
bene-ges Dec 30, 2022
6454517
tar by 100k examples, small fixes
bene-ges Jan 5, 2023
f088491
small fixes, add analytics script
bene-ges Jan 9, 2023
d22c7ff
Add functions for dynamic programming comparison to get best path by …
bene-ges Jan 13, 2023
28deac6
fixes
bene-ges Feb 1, 2023
a5706ff
small fix
bene-ges Feb 1, 2023
4b2dee1
fixes to support testing on SPGISpeech
bene-ges Feb 15, 2023
ca33fb9
add preprocessing for userlibri
bene-ges Feb 21, 2023
82c909b
some refactoring
bene-ges Mar 5, 2023
287d67b
some refactoring
bene-ges Mar 6, 2023
3bdff5d
move some functions to utils to reuse from other project
bene-ges Mar 8, 2023
2acc888
move some functions to utils to reuse from other project
bene-ges Mar 8, 2023
f778141
move some functions to utils to reuse from other project
bene-ges Mar 8, 2023
28c0246
small refactoring before pr. Add bash-scripts reproducing evaluation
bene-ges Mar 12, 2023
3843abd
style fix
bene-ges Mar 13, 2023
bc1f8c1
small fixes in inference
bene-ges Apr 19, 2023
cbc24d8
bug fix - didn't move window on last symbol
bene-ges Apr 21, 2023
0d9c001
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 21, 2023
2dd24ed
fix bug - shuffle was before truncation of sorted candidates
bene-ges Apr 21, 2023
fe75a76
refactoring, fix some bugs
bene-ges Apr 27, 2023
edd48fa
variour fixes. Add word_indices at inference
bene-ges May 2, 2023
c989676
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 21, 2023
a055199
add candidate positions
bene-ges May 3, 2023
1df3e90
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 2, 2023
ef83a8e
Move data preparation and evaluation to other repo
bene-ges May 5, 2023
110f2df
add infer_reproduce_paper. Refactoring
bene-ges May 6, 2023
4fb86fc
refactor inference using fragment indices
bene-ges May 10, 2023
1b8dafe
add some helper functions
bene-ges May 13, 2023
33a6e9f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 10, 2023
fd86468
fix bug with parameters order
bene-ges May 13, 2023
abf4a8f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 13, 2023
3b174a0
fix bugs
bene-ges May 17, 2023
0454953
refactoring, fix bug
bene-ges May 20, 2023
05c8fe7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 13, 2023
e83b38b
add multiple variants of adjusting start/end positions
bene-ges May 20, 2023
4ce665c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 20, 2023
2d01c26
more fixes
bene-ges May 21, 2023
7b09133
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 20, 2023
c918d40
add unit tests, other fixes
bene-ges May 22, 2023
c79b676
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 22, 2023
997b940
fix
bene-ges May 22, 2023
9db922d
Merge branch 'spellchecking_asr_customization_double_bert' of github.…
bene-ges May 22, 2023
ecd58e4
fix CodeQl warnings
bene-ges May 23, 2023
7929aa0
bug fixes
Oct 12, 2022
dd3f784
fix bugs, add preparation and evaluation scripts, add readme
Oct 19, 2022
3277dd2
small fixes
Oct 19, 2022
a358134
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 22, 2022
76e637b
add real coverage calculation, small fixes, more debug information
bene-ges Nov 3, 2022
b32dac0
add option to pass a filelist and output folder - to handle inference…
bene-ges Nov 4, 2022
1e8b103
added preprocessing for yago wikipedia articles - finding yago entiti…
bene-ges Nov 24, 2022
e4528f7
yago wiki preprocessing, sampling, pseudonormalization
bene-ges Nov 28, 2022
e1d5d04
more scripts for preparation of training examples
bene-ges Dec 9, 2022
5cb6d32
bug fixes
bene-ges Dec 10, 2022
3d78770
add some alphabet checks
bene-ges Dec 15, 2022
e803d4e
add bert on subwords, concatenate it to bert on characters
bene-ges Nov 11, 2022
a381a61
add calculation of character_pos_to_subword_pos
bene-ges Nov 12, 2022
b8c6e4f
bug fix
bene-ges Nov 12, 2022
30bc4cd
bug fix
bene-ges Nov 12, 2022
4c323c4
pdb
bene-ges Nov 12, 2022
4f9d0c8
tensor join bug fix
bene-ges Nov 12, 2022
0e98191
double hidden_size in classifier
bene-ges Nov 12, 2022
a081e56
pdb
bene-ges Nov 12, 2022
d4f29af
default index value 0 instead of -1 because index cannot be negative
bene-ges Nov 12, 2022
5b581ba
pad index value 0 instead of -1 because index cannot be negative
bene-ges Nov 12, 2022
6fab32e
remove pdb
bene-ges Nov 12, 2022
66e11dc
fix bugs, add creation of tarred dataset
bene-ges Dec 16, 2022
bf3bed1
add possibility to change sequence len at inference
bene-ges Dec 18, 2022
5142e73
change sampling of dummy candidates at inference, add candidate info …
bene-ges Dec 19, 2022
c19b54c
fix import
bene-ges Dec 19, 2022
650ca4c
fix bug
bene-ges Dec 19, 2022
5f848e0
update transcription now uses info
bene-ges Dec 20, 2022
8ce1037
write path
bene-ges Dec 20, 2022
aee765c
1. add tarred dataset support(untested). 2. fix bug with ban_ngrams i…
bene-ges Dec 22, 2022
46566d3
skip short_sent if no real candidates
bene-ges Dec 22, 2022
96335d6
fix import
bene-ges Dec 22, 2022
b8dc2aa
add braceexpand
bene-ges Dec 22, 2022
353f016
fixes
bene-ges Dec 22, 2022
e8ecf54
fix bug
bene-ges Dec 22, 2022
d3cdf00
fix bug
bene-ges Dec 22, 2022
1a2dbf5
fix bug in np.ones
bene-ges Dec 28, 2022
3bda8f7
fix bug in collate
bene-ges Dec 28, 2022
c73eb22
change tensor type to long because of error in torch.gather
bene-ges Dec 28, 2022
2401fc4
fix for empty spans tensor
bene-ges Dec 28, 2022
e21781c
same fixes in _collate_fn for tarred dataset
bene-ges Dec 28, 2022
6cfe2c7
fix bug from previous commit
bene-ges Dec 28, 2022
0d87dc7
change int types to be shorter to minimize tar size
bene-ges Dec 28, 2022
f345ead
refactoring of datasets and inference
bene-ges Dec 30, 2022
bb89dfd
bug fix
bene-ges Dec 30, 2022
c91b0c5
bug fix
bene-ges Dec 30, 2022
a966fa9
bug fix
bene-ges Dec 30, 2022
2080c87
tar by 100k examples, small fixes
bene-ges Jan 5, 2023
740ed63
small fixes, add analytics script
bene-ges Jan 9, 2023
1d8d1b0
Add functions for dynamic programming comparison to get best path by …
bene-ges Jan 13, 2023
af50cdb
fixes
bene-ges Feb 1, 2023
2276bb2
small fix
bene-ges Feb 1, 2023
6a1e11f
fixes to support testing on SPGISpeech
bene-ges Feb 15, 2023
82d551d
add preprocessing for userlibri
bene-ges Feb 21, 2023
8fd5b34
some refactoring
bene-ges Mar 5, 2023
cf9484c
some refactoring
bene-ges Mar 6, 2023
fca2a11
move some functions to utils to reuse from other project
bene-ges Mar 8, 2023
e1c43a0
move some functions to utils to reuse from other project
bene-ges Mar 8, 2023
7ffb782
move some functions to utils to reuse from other project
bene-ges Mar 8, 2023
ff00416
small refactoring before pr. Add bash-scripts reproducing evaluation
bene-ges Mar 12, 2023
17e36b4
style fix
bene-ges Mar 13, 2023
c497fd2
small fixes in inference
bene-ges Apr 19, 2023
1bdef8f
bug fix - didn't move window on last symbol
bene-ges Apr 21, 2023
e459556
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 21, 2023
32a1535
fix bug - shuffle was before truncation of sorted candidates
bene-ges Apr 21, 2023
98c7486
refactoring, fix some bugs
bene-ges Apr 27, 2023
7554ab4
variour fixes. Add word_indices at inference
bene-ges May 2, 2023
8a91b03
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 21, 2023
a28c305
add candidate positions
bene-ges May 3, 2023
7e86efd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 2, 2023
1fccfa3
Move data preparation and evaluation to other repo
bene-ges May 5, 2023
681a323
add infer_reproduce_paper. Refactoring
bene-ges May 6, 2023
8915c0c
refactor inference using fragment indices
bene-ges May 10, 2023
e74cd27
add some helper functions
bene-ges May 13, 2023
5251684
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 10, 2023
a63d6be
fix bug with parameters order
bene-ges May 13, 2023
ad8e2ff
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 13, 2023
2b80e25
fix bugs
bene-ges May 17, 2023
1b790bc
refactoring, fix bug
bene-ges May 20, 2023
7300de5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 13, 2023
5ef0c65
add multiple variants of adjusting start/end positions
bene-ges May 20, 2023
303f571
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 20, 2023
ec1442a
more fixes
bene-ges May 21, 2023
374664c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 20, 2023
02467c3
add unit tests, other fixes
bene-ges May 22, 2023
cf08a46
fix
bene-ges May 22, 2023
3d9ba36
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 22, 2023
868e248
fix CodeQl warnings
bene-ges May 23, 2023
9997419
Merge branch 'spellchecking_asr_customization_double_bert' of github.…
bene-ges May 23, 2023
691d7eb
add script for full inference pipeline, refactoring
bene-ges May 26, 2023
f6b819a
add tutorial
bene-ges May 26, 2023
3fa3b62
take example data from HuggingFace
bene-ges May 27, 2023
f35e331
add docs
bene-ges May 27, 2023
ce13037
fix comment
bene-ges May 27, 2023
8b2c1fa
fix bug
bene-ges May 30, 2023
4eddb80
Merge branch 'main' into spellchecking_asr_customization_double_bert
bene-ges May 30, 2023
428b32e
small fixes for PR
bene-ges May 31, 2023
0365036
Merge branch 'main' into spellchecking_asr_customization_double_bert
bene-ges May 31, 2023
f5d5ffd
add some more tests
bene-ges May 31, 2023
0d34292
Merge branch 'main' into spellchecking_asr_customization_double_bert
bene-ges May 31, 2023
c8a60e5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 31, 2023
c1edc67
try to fix tests adding with_downloads
bene-ges May 31, 2023
c40d2a1
Merge branch 'main' into spellchecking_asr_customization_double_bert
bene-ges May 31, 2023
35a1ec1
Merge branch 'spellchecking_asr_customization_double_bert' of github.…
bene-ges May 31, 2023
ecde7bc
skip tests with tokenizer download
bene-ges Jun 1, 2023
2262ec1
Merge branch 'main' into spellchecking_asr_customization_double_bert
bene-ges Jun 1, 2023
e36ce3d
Merge branch 'main' into spellchecking_asr_customization_double_bert
bene-ges Jun 1, 2023
a8664da
Merge branch 'main' into spellchecking_asr_customization_double_bert
bene-ges Jun 2, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
change tensor type to long because of error in torch.gather
Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru>
bene-ges committed May 22, 2023
commit cd6a26597510beb32a8483229e46173c16699eb0
Original file line number Diff line number Diff line change
@@ -180,16 +180,16 @@ def _collate_fn(self, batch):
padded_spans.append(spans)

return (
torch.IntTensor(padded_input_ids),
torch.IntTensor(padded_input_mask),
torch.IntTensor(padded_segment_ids),
torch.IntTensor(padded_input_ids_for_subwords),
torch.IntTensor(padded_input_mask_for_subwords),
torch.IntTensor(padded_segment_ids_for_subwords),
torch.IntTensor(padded_character_pos_to_subword_pos),
torch.IntTensor(padded_labels_mask),
torch.IntTensor(padded_labels),
torch.IntTensor(padded_spans)
torch.LongTensor(padded_input_ids),
torch.LongTensor(padded_input_mask),
torch.LongTensor(padded_segment_ids),
torch.LongTensor(padded_input_ids_for_subwords),
torch.LongTensor(padded_input_mask_for_subwords),
torch.LongTensor(padded_segment_ids_for_subwords),
torch.LongTensor(padded_character_pos_to_subword_pos),
torch.LongTensor(padded_labels_mask),
torch.LongTensor(padded_labels),
torch.LongTensor(padded_spans)
)