Associated Phrases v2 #130

lukhnos · 2024-03-12T00:53:11Z

這個 PR 支援多字聯想詞，同時考慮聯想詞詞首讀音。例如打「重」，現在「ㄔㄨㄥˊ」跟「ㄓㄨㄥˋ」會有不同聯想詞（例如：重新 vs. 重要）。

同時，引進新資料格式，大幅加速聯想詞載入速度：跟先前相比，在高階筆電上加快 200x-300x。

資料檔跟資料製作腳本都暫先放於本專案。待測試調整後，再行移動到 McBopomofo 專案下的 Source/Data/。

Large PR, but it should be able to review each commit independently.

Technical details:

Introduces a new data format to allow efficient prefix search.
Use byte-sorted prefix search keys to allow the use of ParselessPhraseDB.
Introduces a MemoryMappedFile to facilitate the use of mmap.
Database opening latency reduced from an average of 10693 μs (that's 10.7 ms) to 35 μs, a ~300x speedup, measured using a Release binary on a Dell XPS laptop.
Associated phrases are now found using prefixes that take about the value (the actual characters) and the reading (the bopomofos) into consideration.
ParselessPhraseDB no longer asserts pragma validation failures; the db now becomes no-op upon such failures.
UTF8Helper is moved to under Engine/ to allow code sharing.

Bug fixes:

Fixes a bug where, if Shift-Enter was pressed on a candidate page that was not Page 1, and when the associated phrases candidate panel was cancelled, the restored candidate panel did not have the current candidate index or highlight.
Switching input modes no longer reloads the associated phrases.

It uses a new format for the associated phrases, which take a phrase's reading and value into account. The entries are designed to allow fast prefix search. Note that the entries are sorted by the entry text's bytes, not by the score. Actual ranking-by-the-score will be performed later by the input method.

This allows the Engine code to use it.

It now turns the db a no-op one upon validation error.

Also creates a separate class for managing memory-mapped files. This commit does not refactor other existing mmap-based language model classes.

Both Bopomofo and Plain Bopomofo are supported. Bopomofo readings are now taken into consideration when finding associated phrases. In addition, in Bopomofo mode, we now enable multi-character prefixes. This also includes some refactoring the make the API more consistent. For example, functions now consistently take the reading before taking the value parameter.

zonble

Please take care on the MS-BPMF like cursor mode. Thanks!

zonble · 2024-03-12T04:52:57Z

data/derive_associated_phrases.py

I am wondering if the script should be here or not. I remember that we are cooking our data in the repo of the macOS version, right?

We should do that eventually. Right now the script lives here so that I can iterate on this feature work quickly, but I'll make sure to move it out to McBopomofo once this is done.

zonble · 2024-03-12T05:06:37Z

src/Engine/MemoryMappedFile.h

-  void* data;
-  size_t length;
+ private:
+  int fd_ = -1;


The naming looks confusing. When I read the code for the first time, I cannot understand what the pointer is for, Maybe just data_?

Commented on what fd_ and ptr_ represent. PTAL.

I misunderstood your comment. ptr_ is now renamed to data_.

zonble · 2024-03-12T05:28:53Z

src/KeyHandler.cpp

+    size_t cursorIndex, const std::string& prefixReading,
+    const std::string& prefixValue, const std::string& associatedPhraseReading,
+    const std::string& associatedPhraseValue) {
+  // First, override the current node.


I am using MS-Bopomofo style cursor, and it does not correctly insert the associated phrases. I guess we need to care about the line 1386 that sets cursor.

data/derive_associated_phrases.py

src/Engine/AssociatedPhrasesV2.cpp

xatier · 2024-03-12T10:27:35Z

On a side node, I really like the idea of the associated phrases library that takes care of phrase/reading pairs. It'd be great if we can provide an API, or create a CLI tool with it.

To manage my user-defined phrase dictionary, I found that typing ㄅㄆㄇㄈ readings can be annoying with vim; hence I have a python script that helps me generate the following output, with that, all I need to do is just to pick the proper reading and paste it to my user dictionary.

 $ ./f.py 輸入法
輸 ㄕㄨ
入 ㄖㄨˋ
法 ㄈㄚˇ  ㄈㄚˊ  ㄈㄚˋ
輸入法 ㄕㄨ-ㄖㄨˋ-ㄈㄚˇ
輸入法 ㄕㄨ-ㄖㄨˋ-ㄈㄚˊ
輸入法 ㄕㄨ-ㄖㄨˋ-ㄈㄚˋ

Script

#!/usr/bin/env python

import itertools
import sys

BPMF_FILE = '/usr/share/fcitx5/data/mcbopomofo-data-plain-bpmf.txt'


def main() -> None:
    lookup: dict[str, list[str]] = {}
    with open(BPMF_FILE) as f:
        lines: list[str] = f.readlines()[508:] # skip the kaomoji secions

    for line in lines:
        phonetic: str
        char: str
        phonetic, char = line.split()[:2]
        if char not in lookup:
            lookup[char] = [phonetic]
        else:
            lookup[char].append(phonetic)

    phrase: str = sys.argv[1]
    try:
        combo: list[list[str]] = [lookup[c] for c in phrase]

        for c in phrase:
            print(f'{c} {"  ".join(lookup[c])}')

        for t in itertools.product(*combo):
            print(f'{phrase} {"-".join(t)}')
    except KeyError as e:
        print(f'{e} not found from {BPMF_FILE}')


if __name__ == '__main__':
    main()

lukhnos · 2024-03-12T21:52:29Z

Please take care on the MS-BPMF like cursor mode. Thanks!

@zonble Done, PTAL.

It'd be great if we can provide an API, or create a CLI tool with it.

@xatier Good points. I'm hoping by introducing things Entry as a Python @dataclass we are moving towards that direction. :)

I found that typing ㄅㄆㄇㄈ readings can be annoying with vim

@xatier Have you tried the feature Ctrl-Enter? That is, after you finish type "輸入法", don't press Enter, but press Ctrl-Enter. The output should be "ㄕㄨ-ㄖㄨˋ-ㄈㄚˇ".

To correctly find where the prefix node is, we need to take into consideration that (1) the cursor is always *after* the prefix when Shift-Enter is pressed, but (2) a user may enter the Associated Phrases state by pressing Shift-Enter within a candidate panel, and the cursor there does not always match the prefix node location due to the Hanyin mode requirements. We therefore make sure that the actual prefix node location is recorded in the AssociatedPhrase state object, and we now provide an extra builder method that takes the ChoosingCandidate->AssociatedPhrases state transition into consideration.

This improves the user experience of inserting an associated phrase after the prefix, as the affected node may have non-prefix parts whose values would be changed upon the next walk if no overrides were made.

xatier · 2024-03-12T23:39:26Z

@lukhnos yes, I'm aware of that the ctrl-enter feature, but I'm also using some other scripts to manage my user dictionary (such as the sorting script I showed in another issue).

lukhnos · 2024-03-13T00:22:21Z

I'm also using some other scripts to manage my user dictionary (such as the sorting script I showed in another issue).

@xatier Ah got it. Now I understand better the idea of wanting an API/CLI tool to the data files!

xatier · 2024-03-13T00:32:30Z

@lukhnos 👍, this is not urgent at all, just throwing out some ideas that can be future works.

lukhnos requested a review from zonble March 12, 2024 00:53

lukhnos added 10 commits March 11, 2024 22:53

Move UTF8Helper to Engine/

cde4aa4

This allows the Engine code to use it.

Improve ParselessPhraseDB's pragma validation

322902b

It now turns the db a no-op one upon validation error.

Introduce AssociatedPhrasesV2

15d3630

Also creates a separate class for managing memory-mapped files. This commit does not refactor other existing mmap-based language model classes.

Drop in the pre-built v2 associated phrases

ebb45bc

Make the v2 associated phrases available

89d74e1

Support for AssociatedPhrasesV2 in McBopomofoLM

8f4d158

Remove the use of AssociatedPhrases v1

2c4bd93

Delete AssociatedPhrases v1 from the Engine code

1151d39

lukhnos force-pushed the dev/associated-phrases-v2 branch from 8ef99e6 to 1151d39 Compare March 12, 2024 02:53

zonble requested changes Mar 12, 2024

View reviewed changes

xatier reviewed Mar 12, 2024

View reviewed changes

data/derive_associated_phrases.py Outdated Show resolved Hide resolved

data/derive_associated_phrases.py Outdated Show resolved Hide resolved

src/Engine/AssociatedPhrasesV2.cpp Outdated Show resolved Hide resolved

lukhnos force-pushed the dev/associated-phrases-v2 branch from a662a44 to ac32ca1 Compare March 12, 2024 21:55

lukhnos added 4 commits March 12, 2024 18:03

Split-and-override the node containing the prefix

4b2a84f

This improves the user experience of inserting an associated phrase after the prefix, as the affected node may have non-prefix parts whose values would be changed upon the next walk if no overrides were made.

Address C++ code review comments

f335358

Address Python code review comments

6831f6f

lukhnos force-pushed the dev/associated-phrases-v2 branch from ac32ca1 to 6831f6f Compare March 12, 2024 22:03

Rename ptr_ to data_

54c6972

zonble approved these changes Mar 13, 2024

View reviewed changes

zonble merged commit fc15114 into master Mar 13, 2024
6 checks passed

lukhnos deleted the dev/associated-phrases-v2 branch March 13, 2024 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Associated Phrases v2 #130

Associated Phrases v2 #130

lukhnos commented Mar 12, 2024

zonble left a comment

zonble Mar 12, 2024

lukhnos Mar 12, 2024 •

edited

Loading

zonble Mar 12, 2024

lukhnos Mar 12, 2024

lukhnos Mar 13, 2024

zonble Mar 12, 2024

lukhnos Mar 12, 2024

xatier commented Mar 12, 2024

lukhnos commented Mar 12, 2024

xatier commented Mar 12, 2024 •

edited

Loading

lukhnos commented Mar 13, 2024 •

edited

Loading

xatier commented Mar 13, 2024

Associated Phrases v2 #130

Associated Phrases v2 #130

Conversation

lukhnos commented Mar 12, 2024

zonble left a comment

Choose a reason for hiding this comment

zonble Mar 12, 2024

Choose a reason for hiding this comment

lukhnos Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

zonble Mar 12, 2024

Choose a reason for hiding this comment

lukhnos Mar 12, 2024

Choose a reason for hiding this comment

lukhnos Mar 13, 2024

Choose a reason for hiding this comment

zonble Mar 12, 2024

Choose a reason for hiding this comment

lukhnos Mar 12, 2024

Choose a reason for hiding this comment

xatier commented Mar 12, 2024

lukhnos commented Mar 12, 2024

xatier commented Mar 12, 2024 • edited Loading

lukhnos commented Mar 13, 2024 • edited Loading

xatier commented Mar 13, 2024

lukhnos Mar 12, 2024 •

edited

Loading

xatier commented Mar 12, 2024 •

edited

Loading

lukhnos commented Mar 13, 2024 •

edited

Loading