-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Associated Phrases v2 #130
Conversation
It uses a new format for the associated phrases, which take a phrase's reading and value into account. The entries are designed to allow fast prefix search. Note that the entries are sorted by the entry text's bytes, not by the score. Actual ranking-by-the-score will be performed later by the input method.
This allows the Engine code to use it.
It now turns the db a no-op one upon validation error.
Also creates a separate class for managing memory-mapped files. This commit does not refactor other existing mmap-based language model classes.
Both Bopomofo and Plain Bopomofo are supported. Bopomofo readings are now taken into consideration when finding associated phrases. In addition, in Bopomofo mode, we now enable multi-character prefixes. This also includes some refactoring the make the API more consistent. For example, functions now consistently take the reading before taking the value parameter.
8ef99e6
to
1151d39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take care on the MS-BPMF like cursor mode. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if the script should be here or not. I remember that we are cooking our data in the repo of the macOS version, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do that eventually. Right now the script lives here so that I can iterate on this feature work quickly, but I'll make sure to move it out to McBopomofo once this is done.
src/Engine/MemoryMappedFile.h
Outdated
void* data; | ||
size_t length; | ||
private: | ||
int fd_ = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming looks confusing. When I read the code for the first time, I cannot understand what the pointer is for, Maybe just data_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented on what fd_
and ptr_
represent. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misunderstood your comment. ptr_
is now renamed to data_
.
src/KeyHandler.cpp
Outdated
size_t cursorIndex, const std::string& prefixReading, | ||
const std::string& prefixValue, const std::string& associatedPhraseReading, | ||
const std::string& associatedPhraseValue) { | ||
// First, override the current node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using MS-Bopomofo style cursor, and it does not correctly insert the associated phrases. I guess we need to care about the line 1386 that sets cursor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
On a side node, I really like the idea of the associated phrases library that takes care of phrase/reading pairs. It'd be great if we can provide an API, or create a CLI tool with it. To manage my user-defined phrase dictionary, I found that typing ㄅㄆㄇㄈ readings can be annoying with vim; hence I have a python script that helps me generate the following output, with that, all I need to do is just to pick the proper reading and paste it to my user dictionary.
Script#!/usr/bin/env python
import itertools
import sys
BPMF_FILE = '/usr/share/fcitx5/data/mcbopomofo-data-plain-bpmf.txt'
def main() -> None:
lookup: dict[str, list[str]] = {}
with open(BPMF_FILE) as f:
lines: list[str] = f.readlines()[508:] # skip the kaomoji secions
for line in lines:
phonetic: str
char: str
phonetic, char = line.split()[:2]
if char not in lookup:
lookup[char] = [phonetic]
else:
lookup[char].append(phonetic)
phrase: str = sys.argv[1]
try:
combo: list[list[str]] = [lookup[c] for c in phrase]
for c in phrase:
print(f'{c} {" ".join(lookup[c])}')
for t in itertools.product(*combo):
print(f'{phrase} {"-".join(t)}')
except KeyError as e:
print(f'{e} not found from {BPMF_FILE}')
if __name__ == '__main__':
main() |
@zonble Done, PTAL.
@xatier Good points. I'm hoping by introducing things
@xatier Have you tried the feature Ctrl-Enter? That is, after you finish type "輸入法", don't press Enter, but press Ctrl-Enter. The output should be "ㄕㄨ-ㄖㄨˋ-ㄈㄚˇ". |
a662a44
to
ac32ca1
Compare
To correctly find where the prefix node is, we need to take into consideration that (1) the cursor is always *after* the prefix when Shift-Enter is pressed, but (2) a user may enter the Associated Phrases state by pressing Shift-Enter within a candidate panel, and the cursor there does not always match the prefix node location due to the Hanyin mode requirements. We therefore make sure that the actual prefix node location is recorded in the AssociatedPhrase state object, and we now provide an extra builder method that takes the ChoosingCandidate->AssociatedPhrases state transition into consideration.
This improves the user experience of inserting an associated phrase after the prefix, as the affected node may have non-prefix parts whose values would be changed upon the next walk if no overrides were made.
ac32ca1
to
6831f6f
Compare
@lukhnos yes, I'm aware of that the ctrl-enter feature, but I'm also using some other scripts to manage my user dictionary (such as the sorting script I showed in another issue). |
@xatier Ah got it. Now I understand better the idea of wanting an API/CLI tool to the data files! |
@lukhnos 👍, this is not urgent at all, just throwing out some ideas that can be future works. |
這個 PR 支援多字聯想詞,同時考慮聯想詞詞首讀音。例如打「重」,現在「ㄔㄨㄥˊ」跟「ㄓㄨㄥˋ」會有不同聯想詞(例如:重新 vs. 重要)。
同時,引進新資料格式,大幅加速聯想詞載入速度:跟先前相比,在高階筆電上加快 200x-300x。
資料檔跟資料製作腳本都暫先放於本專案。待測試調整後,再行移動到 McBopomofo 專案下的
Source/Data/
。Large PR, but it should be able to review each commit independently.
Technical details:
Bug fixes: