Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-linguistic symbol in uio #819

Merged
merged 5 commits into from
Nov 26, 2021
Merged

Support non-linguistic symbol in uio #819

merged 5 commits into from
Nov 26, 2021

Conversation

yzmyyff
Copy link
Contributor

@yzmyyff yzmyyff commented Nov 25, 2021

resolves #796

wenet/dataset/dataset.py Outdated Show resolved Hide resolved
@yzmyyff yzmyyff requested a review from robin1001 November 25, 2021 08:50
@xingchensong
Copy link
Member

Hi, thanks for your great contribution, however, i do think we can make the impl much simpler. For example, If we force that non-linguistic symbols must appear in a fixed format (i.e. {XXXXXX}), then function __tokenize_non_lang_syms can be simply replaced with three lines of code:

non_lang_syms_pattern = re.compile(r'(\{[A-Z]+\})')
pieces = non_lang_syms_pattern.split(txt.upper())
pieces = [w for w in pieces if len(w.strip()) > 0]

below is the verification:

import re

txt = "我是{NOISE} do you known im{AAA} haha {nnb}s {c}"

# Solution-a
non_lang_syms = ["{NOISE}", "{AAA}", "{nnb}", "{c}"]
rs = [re.compile(re.escape(x)) for x in non_lang_syms]


def exist_or_not(i, match_pos):
    start_pos = None
    end_pos = None
    for pos in match_pos:
        if pos[0] <= i < pos[1]:
            start_pos = pos[0]
            end_pos = pos[1]
            break

    return start_pos, end_pos


def __tokenize_non_lang_syms(txt, non_lang_sym_patterns):
    """split txt by non_lang_sym_patterns
    Args:
        txt: original transcript.
        non_lang_sym_patterns: non lang sym regex pattern.
    Returns:
        [[piece, is_non_lang_sym], [piece, True], ..., [piece, False]]
    """
    match_pos = []
    for r in non_lang_sym_patterns:
        i = 0
        while i >= 0:
            m = r.search(txt, i)
            if m:
                match_pos.append([m.start(), m.end()])
                i = m.end()
            else:
                break

    if len(match_pos) > 0:
        pieces = []
        i = 0
        while i < len(txt):
            start_pos, end_pos = exist_or_not(i, match_pos)
            if start_pos is not None:
                pieces.append([txt[start_pos: end_pos], True])
                i = end_pos
            else:
                if len(pieces) == 0 or (len(pieces) > 0 and pieces[-1][1]):
                    pieces.append([txt[i], False])
                else:
                    pieces[-1][0] += txt[i]
                i += 1

        return pieces
    else:
        return [[txt, False]]


parts_a = __tokenize_non_lang_syms(txt, rs)

# Solution-b
non_lang_syms_pattern = re.compile(r'(\{[A-Z]+\})')
parts_b = non_lang_syms_pattern.split(txt.upper())
parts_b = [w for w in parts_b if len(w.strip()) > 0]


# Check
assert len(parts_a) == len(parts_b)
for i, j in zip(parts_a, parts_b):
    assert i[0].strip().upper() == j.strip().upper()

Besides, we can also simplify the control-flow of function tokenize() by reusing the BPE/non-BPE branch without any modification:

non_lang_syms_pattern = re.compile(r'(\{[A-Z]+\})')  # fixed format
for sample in data:
    assert 'txt' in sample
    txt = sample['txt'].strip()
    parts = non_lang_syms_pattern.split(txt.upper())
    parts = [w for w in parts_b if len(w.strip()) > 0]
    for part in parts:
        if part in non_lang_syms:
            tokens.append(part)
        else:
            assert part[0] != '{'  # do some checks
            if bpe_model is not None:
                 reuse BPE logic
            else:
                reuse non-BPE logic

@yzmyyff
Copy link
Contributor Author

yzmyyff commented Nov 25, 2021

In my own data non-lang symbols are all surrounded by "{".
So it's OK for me to use the regex split.

In some data, the format of the non-lang symbols is mixed. So we need to construct a complicated pattern. (eg. , {BRK}, #, ## and so on)

And most parts of this commit are copied from text2token.py. I am not sure if there is a potential purpose...

Or we need to add some documents to warn users to use specific formats

@xingchensong
Copy link
Member

We can just add some format check after reading the non-ling symbol table
https://github.com/wenet-e2e/wenet/pull/819/files#diff-a5e6ec48421b12c254127394ff8947e243de71257d2d7b4868dc9fdcb06adb4eR42

if it contains # or something else, raise errors and give some suggestions within error messages: "non-ling symbol should be formatted in {XXXX}, consider change # to {SHARP}".

the reasons for choosing fixed format lies in:

  • code is easier to read and understand
  • the number of non-ling symbols is small, it is also easy for users to define and diy their own non-ling symbol table

@robin1001
Copy link
Collaborator

@yzmyyff Seems xingchen's solution is more simple, could you refine it according to his suggestion?

@yzmyyff
Copy link
Contributor Author

yzmyyff commented Nov 26, 2021

In test code, If we change non_lang_syms to ["{NOISE}", "{AAA}", "{nnb}"]. These two method has different result.

We must construct a more complicated regex pattern to cover this. I think we just move the complexity to the construction stage.

@xingchensong
Copy link
Member

In test code, If we change non_lang_syms to ["{NOISE}", "{AAA}", "{nnb}"]. These two method has different result.

We must construct a more complicated regex pattern to cover this. I think we just move the complexity to the construction stage.

  1. leaving {c} as a normal linguistic symbol is kind of weird
  2. we can double check non_lang_syms in tokenize(), see:

image

@yzmyyff
Copy link
Contributor Author

yzmyyff commented Nov 26, 2021

Got it. Thanks for your suggestion!

@robin1001
Copy link
Collaborator

robin1001 commented Nov 26, 2021

We have to take non-liguistic symbol pattern into account. for example, the pattern is:
[xxx] for swithboard and fisher
<xxx> for wsj and chime4

@robin1001
Copy link
Collaborator

robin1001 commented Nov 26, 2021

I mean we should take {xxx} <xxx> [xxx] into account.

we can achieve this by repalce the regex pattern to:

non_lang_syms_pattern = r"(\[[^\]]+\])|(<[^>]+>)|({[^}]+})"

@yzmyyff
Copy link
Contributor Author

yzmyyff commented Nov 26, 2021

As I mentioned before, If we do not consider forward compatibility and add some documentation about the special symbols. The new method would be a good starting.

Otherwise, the previous version may be simpler.

@robin1001
Copy link
Collaborator

Yes, we see. xingchen just want to keep it simple and stupid, and the pattern is just for some english dataset which has special symbols, and the above three patterns are enough to cover.

@yzmyyff
Copy link
Contributor Author

yzmyyff commented Nov 26, 2021

In fact, I have seen some corpus use "#" to indicate the length of speech pause. # means stop level 1, ## means stop level 2...

I think leaving it to the user is a good choice.

@robin1001
Copy link
Collaborator

#1/#2/#3 is for TTS corpus. here we only consider the public speech recognition dataset so it's convenient for researchers.
This symbols are meaningless in production speech dataset.

@yzmyyff
Copy link
Contributor Author

yzmyyff commented Nov 26, 2021

Alright

@robin1001 robin1001 merged commit 0355919 into wenet-e2e:main Nov 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

processor.tokenize support non-lang-syms
3 participants