-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support non-linguistic symbol in uio #819
Conversation
1. Lint fix.
Hi, thanks for your great contribution, however, i do think we can make the impl much simpler. For example, If we force that non-linguistic symbols must appear in a fixed format (i.e. non_lang_syms_pattern = re.compile(r'(\{[A-Z]+\})')
pieces = non_lang_syms_pattern.split(txt.upper())
pieces = [w for w in pieces if len(w.strip()) > 0] below is the verification: import re
txt = "我是{NOISE} do you known im{AAA} haha {nnb}s {c}"
# Solution-a
non_lang_syms = ["{NOISE}", "{AAA}", "{nnb}", "{c}"]
rs = [re.compile(re.escape(x)) for x in non_lang_syms]
def exist_or_not(i, match_pos):
start_pos = None
end_pos = None
for pos in match_pos:
if pos[0] <= i < pos[1]:
start_pos = pos[0]
end_pos = pos[1]
break
return start_pos, end_pos
def __tokenize_non_lang_syms(txt, non_lang_sym_patterns):
"""split txt by non_lang_sym_patterns
Args:
txt: original transcript.
non_lang_sym_patterns: non lang sym regex pattern.
Returns:
[[piece, is_non_lang_sym], [piece, True], ..., [piece, False]]
"""
match_pos = []
for r in non_lang_sym_patterns:
i = 0
while i >= 0:
m = r.search(txt, i)
if m:
match_pos.append([m.start(), m.end()])
i = m.end()
else:
break
if len(match_pos) > 0:
pieces = []
i = 0
while i < len(txt):
start_pos, end_pos = exist_or_not(i, match_pos)
if start_pos is not None:
pieces.append([txt[start_pos: end_pos], True])
i = end_pos
else:
if len(pieces) == 0 or (len(pieces) > 0 and pieces[-1][1]):
pieces.append([txt[i], False])
else:
pieces[-1][0] += txt[i]
i += 1
return pieces
else:
return [[txt, False]]
parts_a = __tokenize_non_lang_syms(txt, rs)
# Solution-b
non_lang_syms_pattern = re.compile(r'(\{[A-Z]+\})')
parts_b = non_lang_syms_pattern.split(txt.upper())
parts_b = [w for w in parts_b if len(w.strip()) > 0]
# Check
assert len(parts_a) == len(parts_b)
for i, j in zip(parts_a, parts_b):
assert i[0].strip().upper() == j.strip().upper() Besides, we can also simplify the control-flow of function non_lang_syms_pattern = re.compile(r'(\{[A-Z]+\})') # fixed format
for sample in data:
assert 'txt' in sample
txt = sample['txt'].strip()
parts = non_lang_syms_pattern.split(txt.upper())
parts = [w for w in parts_b if len(w.strip()) > 0]
for part in parts:
if part in non_lang_syms:
tokens.append(part)
else:
assert part[0] != '{' # do some checks
if bpe_model is not None:
reuse BPE logic
else:
reuse non-BPE logic |
In my own data non-lang symbols are all surrounded by "{". In some data, the format of the non-lang symbols is mixed. So we need to construct a complicated pattern. (eg. , {BRK}, #, ## and so on) And most parts of this commit are copied from text2token.py. I am not sure if there is a potential purpose... Or we need to add some documents to warn users to use specific formats |
We can just add some format check after reading the non-ling symbol table if it contains the reasons for choosing fixed format lies in:
|
@yzmyyff Seems xingchen's solution is more simple, could you refine it according to his suggestion? |
In test code, If we change non_lang_syms to We must construct a more complicated regex pattern to cover this. I think we just move the complexity to the construction stage. |
|
Got it. Thanks for your suggestion! |
We have to take non-liguistic symbol pattern into account. for example, the pattern is: |
I mean we should take we can achieve this by repalce the regex pattern to: non_lang_syms_pattern = r"(\[[^\]]+\])|(<[^>]+>)|({[^}]+})" |
As I mentioned before, If we do not consider forward compatibility and add some documentation about the special symbols. The new method would be a good starting. Otherwise, the previous version may be simpler. |
Yes, we see. xingchen just want to keep it simple and stupid, and the pattern is just for some english dataset which has special symbols, and the above three patterns are enough to cover. |
In fact, I have seen some corpus use "#" to indicate the length of speech pause. # means stop level 1, ## means stop level 2... I think leaving it to the user is a good choice. |
Alright |
resolves #796