Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Associated Phrases v2 #130

Merged
merged 15 commits into from
Mar 13, 2024
Merged

Associated Phrases v2 #130

merged 15 commits into from
Mar 13, 2024

Conversation

lukhnos
Copy link
Collaborator

@lukhnos lukhnos commented Mar 12, 2024

這個 PR 支援多字聯想詞,同時考慮聯想詞詞首讀音。例如打「重」,現在「ㄔㄨㄥˊ」跟「ㄓㄨㄥˋ」會有不同聯想詞(例如:重新 vs. 重要)。

同時,引進新資料格式,大幅加速聯想詞載入速度:跟先前相比,在高階筆電上加快 200x-300x。

資料檔跟資料製作腳本都暫先放於本專案。待測試調整後,再行移動到 McBopomofo 專案下的 Source/Data/

Large PR, but it should be able to review each commit independently.

Technical details:

  • Introduces a new data format to allow efficient prefix search.
  • Use byte-sorted prefix search keys to allow the use of ParselessPhraseDB.
  • Introduces a MemoryMappedFile to facilitate the use of mmap.
  • Database opening latency reduced from an average of 10693 μs (that's 10.7 ms) to 35 μs, a ~300x speedup, measured using a Release binary on a Dell XPS laptop.
  • Associated phrases are now found using prefixes that take about the value (the actual characters) and the reading (the bopomofos) into consideration.
  • ParselessPhraseDB no longer asserts pragma validation failures; the db now becomes no-op upon such failures.
  • UTF8Helper is moved to under Engine/ to allow code sharing.

Bug fixes:

  • Fixes a bug where, if Shift-Enter was pressed on a candidate page that was not Page 1, and when the associated phrases candidate panel was cancelled, the restored candidate panel did not have the current candidate index or highlight.
  • Switching input modes no longer reloads the associated phrases.

@lukhnos lukhnos requested a review from zonble March 12, 2024 00:53
lukhnos added 10 commits March 11, 2024 22:53
It uses a new format for the associated phrases, which take a phrase's
reading and value into account. The entries are designed to allow fast
prefix search. Note that the entries are sorted by the entry text's
bytes, not by the score. Actual ranking-by-the-score will be performed
later by the input method.
This allows the Engine code to use it.
It now turns the db a no-op one upon validation error.
Also creates a separate class for managing memory-mapped files. This
commit does not refactor other existing mmap-based language model
classes.
Both Bopomofo and Plain Bopomofo are supported. Bopomofo readings are
now taken into consideration when finding associated phrases. In
addition, in Bopomofo mode, we now enable multi-character prefixes.

This also includes some refactoring the make the API more consistent.
For example, functions now consistently take the reading before taking
the value parameter.
@lukhnos lukhnos force-pushed the dev/associated-phrases-v2 branch from 8ef99e6 to 1151d39 Compare March 12, 2024 02:53
Copy link
Collaborator

@zonble zonble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take care on the MS-BPMF like cursor mode. Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if the script should be here or not. I remember that we are cooking our data in the repo of the macOS version, right?

Copy link
Collaborator Author

@lukhnos lukhnos Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do that eventually. Right now the script lives here so that I can iterate on this feature work quickly, but I'll make sure to move it out to McBopomofo once this is done.

void* data;
size_t length;
private:
int fd_ = -1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming looks confusing. When I read the code for the first time, I cannot understand what the pointer is for, Maybe just data_?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented on what fd_ and ptr_ represent. PTAL.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood your comment. ptr_ is now renamed to data_.

size_t cursorIndex, const std::string& prefixReading,
const std::string& prefixValue, const std::string& associatedPhraseReading,
const std::string& associatedPhraseValue) {
// First, override the current node.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using MS-Bopomofo style cursor, and it does not correctly insert the associated phrases. I guess we need to care about the line 1386 that sets cursor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

data/derive_associated_phrases.py Outdated Show resolved Hide resolved
data/derive_associated_phrases.py Outdated Show resolved Hide resolved
src/Engine/AssociatedPhrasesV2.cpp Outdated Show resolved Hide resolved
@xatier
Copy link
Contributor

xatier commented Mar 12, 2024

On a side node, I really like the idea of the associated phrases library that takes care of phrase/reading pairs. It'd be great if we can provide an API, or create a CLI tool with it.

To manage my user-defined phrase dictionary, I found that typing ㄅㄆㄇㄈ readings can be annoying with vim; hence I have a python script that helps me generate the following output, with that, all I need to do is just to pick the proper reading and paste it to my user dictionary.

 $ ./f.py 輸入法
輸 ㄕㄨ
入 ㄖㄨˋ
法 ㄈㄚˇ  ㄈㄚˊ  ㄈㄚˋ
輸入法 ㄕㄨ-ㄖㄨˋ-ㄈㄚˇ
輸入法 ㄕㄨ-ㄖㄨˋ-ㄈㄚˊ
輸入法 ㄕㄨ-ㄖㄨˋ-ㄈㄚˋ
Script
#!/usr/bin/env python

import itertools
import sys

BPMF_FILE = '/usr/share/fcitx5/data/mcbopomofo-data-plain-bpmf.txt'


def main() -> None:
    lookup: dict[str, list[str]] = {}
    with open(BPMF_FILE) as f:
        lines: list[str] = f.readlines()[508:] # skip the kaomoji secions

    for line in lines:
        phonetic: str
        char: str
        phonetic, char = line.split()[:2]
        if char not in lookup:
            lookup[char] = [phonetic]
        else:
            lookup[char].append(phonetic)

    phrase: str = sys.argv[1]
    try:
        combo: list[list[str]] = [lookup[c] for c in phrase]

        for c in phrase:
            print(f'{c} {"  ".join(lookup[c])}')

        for t in itertools.product(*combo):
            print(f'{phrase} {"-".join(t)}')
    except KeyError as e:
        print(f'{e} not found from {BPMF_FILE}')


if __name__ == '__main__':
    main()

@lukhnos
Copy link
Collaborator Author

lukhnos commented Mar 12, 2024

Please take care on the MS-BPMF like cursor mode. Thanks!

@zonble Done, PTAL.

It'd be great if we can provide an API, or create a CLI tool with it.

@xatier Good points. I'm hoping by introducing things Entry as a Python @dataclass we are moving towards that direction. :)

I found that typing ㄅㄆㄇㄈ readings can be annoying with vim

@xatier Have you tried the feature Ctrl-Enter? That is, after you finish type "輸入法", don't press Enter, but press Ctrl-Enter. The output should be "ㄕㄨ-ㄖㄨˋ-ㄈㄚˇ".

@lukhnos lukhnos force-pushed the dev/associated-phrases-v2 branch from a662a44 to ac32ca1 Compare March 12, 2024 21:55
lukhnos added 4 commits March 12, 2024 18:03
To correctly find where the prefix node is, we need to take into
consideration that (1) the cursor is always *after* the prefix when
Shift-Enter is pressed, but (2) a user may enter the Associated Phrases
state by pressing Shift-Enter within a candidate panel, and the cursor
there does not always match the prefix node location due to the Hanyin
mode requirements.

We therefore make sure that the actual prefix node location is recorded
in the AssociatedPhrase state object, and we now provide an extra
builder method that takes the ChoosingCandidate->AssociatedPhrases state
transition into consideration.
This improves the user experience of inserting an associated phrase
after the prefix, as the affected node may have non-prefix parts whose
values would be changed upon the next walk if no overrides were made.
@lukhnos lukhnos force-pushed the dev/associated-phrases-v2 branch from ac32ca1 to 6831f6f Compare March 12, 2024 22:03
@xatier
Copy link
Contributor

xatier commented Mar 12, 2024

@lukhnos yes, I'm aware of that the ctrl-enter feature, but I'm also using some other scripts to manage my user dictionary (such as the sorting script I showed in another issue).

@lukhnos
Copy link
Collaborator Author

lukhnos commented Mar 13, 2024

I'm also using some other scripts to manage my user dictionary (such as the sorting script I showed in another issue).

@xatier Ah got it. Now I understand better the idea of wanting an API/CLI tool to the data files!

@xatier
Copy link
Contributor

xatier commented Mar 13, 2024

@lukhnos 👍, this is not urgent at all, just throwing out some ideas that can be future works.

@zonble zonble merged commit fc15114 into master Mar 13, 2024
6 checks passed
@lukhnos lukhnos deleted the dev/associated-phrases-v2 branch March 13, 2024 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants