Support non-linguistic symbol in uio #819

yzmyyff · 2021-11-25T06:26:08Z

resolves #796

wenet/dataset/dataset.py

wenet/dataset/processor.py

1. Lint fix.

xingchensong · 2021-11-25T10:01:08Z

Hi, thanks for your great contribution, however, i do think we can make the impl much simpler. For example, If we force that non-linguistic symbols must appear in a fixed format (i.e. {XXXXXX}), then function __tokenize_non_lang_syms can be simply replaced with three lines of code:

non_lang_syms_pattern = re.compile(r'(\{[A-Z]+\})')
pieces = non_lang_syms_pattern.split(txt.upper())
pieces = [w for w in pieces if len(w.strip()) > 0]

below is the verification:

import re

txt = "我是{NOISE} do you known im{AAA} haha {nnb}s {c}"

# Solution-a
non_lang_syms = ["{NOISE}", "{AAA}", "{nnb}", "{c}"]
rs = [re.compile(re.escape(x)) for x in non_lang_syms]


def exist_or_not(i, match_pos):
    start_pos = None
    end_pos = None
    for pos in match_pos:
        if pos[0] <= i < pos[1]:
            start_pos = pos[0]
            end_pos = pos[1]
            break

    return start_pos, end_pos


def __tokenize_non_lang_syms(txt, non_lang_sym_patterns):
    """split txt by non_lang_sym_patterns
    Args:
        txt: original transcript.
        non_lang_sym_patterns: non lang sym regex pattern.
    Returns:
        [[piece, is_non_lang_sym], [piece, True], ..., [piece, False]]
    """
    match_pos = []
    for r in non_lang_sym_patterns:
        i = 0
        while i >= 0:
            m = r.search(txt, i)
            if m:
                match_pos.append([m.start(), m.end()])
                i = m.end()
            else:
                break

    if len(match_pos) > 0:
        pieces = []
        i = 0
        while i < len(txt):
            start_pos, end_pos = exist_or_not(i, match_pos)
            if start_pos is not None:
                pieces.append([txt[start_pos: end_pos], True])
                i = end_pos
            else:
                if len(pieces) == 0 or (len(pieces) > 0 and pieces[-1][1]):
                    pieces.append([txt[i], False])
                else:
                    pieces[-1][0] += txt[i]
                i += 1

        return pieces
    else:
        return [[txt, False]]


parts_a = __tokenize_non_lang_syms(txt, rs)

# Solution-b
non_lang_syms_pattern = re.compile(r'(\{[A-Z]+\})')
parts_b = non_lang_syms_pattern.split(txt.upper())
parts_b = [w for w in parts_b if len(w.strip()) > 0]


# Check
assert len(parts_a) == len(parts_b)
for i, j in zip(parts_a, parts_b):
    assert i[0].strip().upper() == j.strip().upper()

Besides, we can also simplify the control-flow of function tokenize() by reusing the BPE/non-BPE branch without any modification:

non_lang_syms_pattern = re.compile(r'(\{[A-Z]+\})')  # fixed format
for sample in data:
    assert 'txt' in sample
    txt = sample['txt'].strip()
    parts = non_lang_syms_pattern.split(txt.upper())
    parts = [w for w in parts_b if len(w.strip()) > 0]
    for part in parts:
        if part in non_lang_syms:
            tokens.append(part)
        else:
            assert part[0] != '{'  # do some checks
            if bpe_model is not None:
                 reuse BPE logic
            else:
                reuse non-BPE logic

yzmyyff · 2021-11-25T10:25:17Z

In my own data non-lang symbols are all surrounded by "{".
So it's OK for me to use the regex split.

In some data, the format of the non-lang symbols is mixed. So we need to construct a complicated pattern. (eg. , {BRK}, #, ## and so on)

And most parts of this commit are copied from text2token.py. I am not sure if there is a potential purpose...

Or we need to add some documents to warn users to use specific formats

xingchensong · 2021-11-25T11:10:30Z

We can just add some format check after reading the non-ling symbol table
https://github.com/wenet-e2e/wenet/pull/819/files#diff-a5e6ec48421b12c254127394ff8947e243de71257d2d7b4868dc9fdcb06adb4eR42

if it contains # or something else， raise errors and give some suggestions within error messages: "non-ling symbol should be formatted in {XXXX}, consider change # to {SHARP}".

the reasons for choosing fixed format lies in:

code is easier to read and understand
the number of non-ling symbols is small, it is also easy for users to define and diy their own non-ling symbol table

robin1001 · 2021-11-26T02:01:37Z

@yzmyyff Seems xingchen's solution is more simple, could you refine it according to his suggestion?

yzmyyff · 2021-11-26T02:23:20Z

In test code, If we change non_lang_syms to ["{NOISE}", "{AAA}", "{nnb}"]. These two method has different result.

We must construct a more complicated regex pattern to cover this. I think we just move the complexity to the construction stage.

xingchensong · 2021-11-26T02:40:46Z

In test code, If we change non_lang_syms to ["{NOISE}", "{AAA}", "{nnb}"]. These two method has different result.

We must construct a more complicated regex pattern to cover this. I think we just move the complexity to the construction stage.

leaving {c} as a normal linguistic symbol is kind of weird
we can double check non_lang_syms in tokenize(), see:

yzmyyff · 2021-11-26T02:49:14Z

Got it. Thanks for your suggestion!

robin1001 · 2021-11-26T08:35:58Z

We have to take non-liguistic symbol pattern into account. for example, the pattern is:
[xxx] for swithboard and fisher
<xxx> for wsj and chime4

robin1001 · 2021-11-26T08:48:26Z

I mean we should take {xxx} <xxx> [xxx] into account.

we can achieve this by repalce the regex pattern to:

non_lang_syms_pattern = r"(\[[^\]]+\])|(<[^>]+>)|({[^}]+})"

yzmyyff · 2021-11-26T08:56:51Z

As I mentioned before, If we do not consider forward compatibility and add some documentation about the special symbols. The new method would be a good starting.

Otherwise, the previous version may be simpler.

robin1001 · 2021-11-26T09:00:00Z

Yes, we see. xingchen just want to keep it simple and stupid, and the pattern is just for some english dataset which has special symbols, and the above three patterns are enough to cover.

yzmyyff · 2021-11-26T09:04:33Z

In fact, I have seen some corpus use "#" to indicate the length of speech pause. # means stop level 1, ## means stop level 2...

I think leaving it to the user is a good choice.

robin1001 · 2021-11-26T09:43:17Z

#1/#2/#3 is for TTS corpus. here we only consider the public speech recognition dataset so it's convenient for researchers.
This symbols are meaningless in production speech dataset.

yzmyyff · 2021-11-26T09:51:34Z

Alright

Support non-linguistic symbol in uio

f2a6a5c

robin1001 reviewed Nov 25, 2021

View reviewed changes

wenet/dataset/dataset.py Outdated Show resolved Hide resolved

robin1001 reviewed Nov 25, 2021

View reviewed changes

wenet/dataset/processor.py Outdated Show resolved Hide resolved

robin1001 reviewed Nov 25, 2021

View reviewed changes

wenet/dataset/processor.py Outdated Show resolved Hide resolved

robin1001 reviewed Nov 25, 2021

View reviewed changes

wenet/dataset/processor.py Outdated Show resolved Hide resolved

Compatible with the bpe tokenization

6d5fd44

1. Lint fix.

yzmyyff requested a review from robin1001 November 25, 2021 08:50

yzmyyff added 2 commits November 26, 2021 11:12

Regex splits the transcripts into non-lang syms and normal text

ca8f219

Remove unused imports

c9fdd8e

Non-lang syms now support {.*}, [.*] and <.*> format

8c221cd

robin1001 approved these changes Nov 26, 2021

View reviewed changes

robin1001 merged commit 0355919 into wenet-e2e:main Nov 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-linguistic symbol in uio #819

Support non-linguistic symbol in uio #819

yzmyyff commented Nov 25, 2021 •

edited

Loading

xingchensong commented Nov 25, 2021

yzmyyff commented Nov 25, 2021 •

edited

Loading

xingchensong commented Nov 25, 2021

robin1001 commented Nov 26, 2021

yzmyyff commented Nov 26, 2021

xingchensong commented Nov 26, 2021

yzmyyff commented Nov 26, 2021

robin1001 commented Nov 26, 2021 •

edited

Loading

robin1001 commented Nov 26, 2021 •

edited

Loading

yzmyyff commented Nov 26, 2021

robin1001 commented Nov 26, 2021

yzmyyff commented Nov 26, 2021

robin1001 commented Nov 26, 2021

yzmyyff commented Nov 26, 2021

Support non-linguistic symbol in uio #819

Support non-linguistic symbol in uio #819

Conversation

yzmyyff commented Nov 25, 2021 • edited Loading

xingchensong commented Nov 25, 2021

yzmyyff commented Nov 25, 2021 • edited Loading

xingchensong commented Nov 25, 2021

robin1001 commented Nov 26, 2021

yzmyyff commented Nov 26, 2021

xingchensong commented Nov 26, 2021

yzmyyff commented Nov 26, 2021

robin1001 commented Nov 26, 2021 • edited Loading

robin1001 commented Nov 26, 2021 • edited Loading

yzmyyff commented Nov 26, 2021

robin1001 commented Nov 26, 2021

yzmyyff commented Nov 26, 2021

robin1001 commented Nov 26, 2021

yzmyyff commented Nov 26, 2021

yzmyyff commented Nov 25, 2021 •

edited

Loading

yzmyyff commented Nov 25, 2021 •

edited

Loading

robin1001 commented Nov 26, 2021 •

edited

Loading

robin1001 commented Nov 26, 2021 •

edited

Loading