feat(developer): output new TrieModel format when compiling 💾 #12129

jahorton · 2024-08-08T04:56:28Z

With these changes in place, Developer will output the new, comparatively-compressed encoded-string Trie format for Trie-based lexical model projects.

For an obvious first trial, when compiling the nrc.en.mtnt model, version 0.3.2:

Original, JSON-object based format: 2652 KB
New, encoded-string based format: 453 KB
- Ratio (new / old): 17.1% - it's just over 1/6th the size it used to be!
- This new version is the same one provided as a test resource for feat(common/models): integrate new trie decoding methods into predictive-text engine 💾 #12128. It is fully functional with that PR's changes in place.

Current multi-model comparison table: #12129 (comment)

@keymanapp-test-bot skip

Though, conceivably, we could request compiling an already-existing model with the Developer artifact, then loading it with the Android or iOS artifact.

keymanapp-test-bot · 2024-08-08T04:56:31Z

User Test Results

Test specification and instructions

User tests are not required

mcdurdin · 2024-08-08T05:52:24Z

I like this. Before we get any further, can you provide some more detail on this:

Does this generate backwardly compatible model code? Or is it going to work only with 18.0?
The linked issue has a lot of detail about the proposed design. Does this precisely match the actual design -- particularly the emitted file format?
I'd like to see performance and memory use comparisons (not just screenshots; tables would be good)

mcdurdin · 2024-08-08T05:53:09Z

Though, conceivably, we could request compiling an already-existing model with the Developer artifact, then loading it with the Android or iOS artifact.

What does this mean?

jahorton · 2024-08-08T06:18:40Z

I like this. Before we get any further, can you provide some more detail on this:

Does this generate backwardly compatible model code? Or is it going to work only with 18.0?

As the changes from #12128 are required to interpret the new code, it will only work with 18.0 at this time. We could consider embedding the decompression functions and a bit more rigging (within the model and the worker) to selectively auto-decompress it if not 18.0+, but per a recent discussion we had, I don't think we'll be planning to do this. The decompression methods should minify pretty well, though.

I don't believe that the worker currently tells the model what version of the engine is active. We'd likely want to provide this as a value usable to conditionally prevent immediate full decompression.
- For 18.0+, we already know how to decode it, so we don't need to decompress during startup. We'd would likely nullify a considerable portion of our performance gains here if we did.
- For 17.0 and before, when the value isn't available, we'd let it trigger, knowing that the engine lacks decoding support of its own.

All models compiled with 12.0 - 17.0 will continue to operate normally, though.

The linked issue has a lot of detail about the proposed design. Does this precisely match the actual design -- particularly the emitted file format?

Linked issue: #10336

I ended up entirely dropping the stored key value - and thus, also keyLength entries. We can always rebuild the keys with the model's included keying function, after all. Before I did so, the redundancy was crystal clear when looking at the encoded-string format; you'd see an extremely high rate of "duplicated words," as the keyed and unkeyed versions were stored adjacent to each other.

I've now used strikeout formatting to indicate this change on the base issue.

Outside of that, it should be fairly precise. There are a few differences, though:

We add +0x0020 to all encoded numbers so that they render more nicely in text editors. (Without the offset, VS Code would think the resulting scripts are binary-encoded, despite being .js files.)
Leaves and internal nodes overload their type flag onto their entry count.
- It's extremely doubtful that a leaf will have over 0x7fff entries, just as a node should not have over 0x7fff direct children.
- High bit: node type
- Remainder: entry / child count
- I have now updated the original issue accordingly.

I'd like to see performance and memory use comparisons (not just screenshots; tables would be good)

Could I get some more specifics here? What sort of scenarios and comparisons are you expecting for these tables?

Would it basically be "the same comparisons," just boiled down to raw text tables, but for more models?

jahorton · 2024-08-08T06:23:19Z

Though, conceivably, we could request compiling an already-existing model with the Developer artifact, then loading it with the Android or iOS artifact.

What does this mean?

HYPOTHETICAL_TEST:

Clone the lexical-models repository.
Download the Developer artifact for this build.
Open the release/gff/gff.am.gff_amharic lexical model project from the lexical-models repository within Developer.
Recompile it.
Using an Android device, install the Android artifact for this build.
Install a keyboard for the Amharic language if you have not previously done so.
- Important: do this before the next step!
Upload the newly recompiled package for gff.am.gff_amharic to the same Android device, then install it.
Verify that the model works as normal.

That's a single test using two separate build artifacts (Developer, Android) plus a locally-built one (a model package, built via Developer).

jahorton · 2024-08-08T07:14:03Z

A small excerpt of the formatting (with line breaks added to loosely simulate word-wrapping):

LMLayerWorker.loadModel(new models.TrieModel({"data":"&ஓ\"ᒲBtaoyifbwhnmsjlcgduprekvqz314567892
 倐\"ᒲ*hoiawreuyv ೋ\"ᒲ'eaioruy ђ\"ᒲ+�yrminsoafe ,\"ᒲ耡 '\"ᒲthe v 䭿%�rvld - 䭿耡 ( 䭿they 0 
ॼ耡 + ॼthey're 0 ƶ耡 + ƶthey've 0 Ɗ耡 + Ɗthey'll / ĸ耡 * ĸthey'd ŝ ⍂#eam ¤ ⍂&�sfobl . ⍂耡 )
 ⍂there 0 ৎ耡 + ৎthere's J ú!o D ú!r > ú耢 - útherefore , 'therefor 0 .耡 + .thereof 0 -耡 +
 -thereby 1 )耡 , )there'll � �!p � �#yie 0 �耡 + �therapy X a!s R a!t L a\"�s 2 a耡 -
 atherapist 3 -耡 . -therapists 4 2耡 / 2therapeutic g ;\"ao 0 ;耡 + ;thermal P *\"sn 3 *耡 .
 *thermostat 6 %耡 1 %thermonuclear ¼ ⊨$�sea - ⊨耡 ( ⊨them 3 ʆ耡 . ʆthemselves T ¯#�sd . ¯耡 )

For a minor breakdown:

taoyifbwhnmsjlcgduprekvqz3145678921 is the list of all one-letter prefixes supported by the model, in sorted frequency order.
- From the default prediction set seen when active:
- Note the order in which each prefix first appears: t, a, o, y, i
Similarly, hoiawreuyv is the ordering for "starting with t, what's the second letter?"
Then, eaioruy for the third letter...
And �yrminsoafe for the fourth.
- The initial entry there is U+FDD0, our "sentinel value", indicating that there's an actual word here: the.
- And sure enough, shortly after, you can see the included in full plain-text.
- We can also get a great supporting live-use picture for this level:
�rvld continues from the y on the prior bullet point - we get they plus its contractions, all in order.
eam continues from the sibling r - ther isn't a word, but there is, and thera (therapy) and therm (thermal) are valid prefixes.

The "random" characters before and between each section is used to encode character lengths and the weight entries from the original, JSON-friendly Trie structure. Those parts are admittedly not human-readable, as char-encoding numbers is far more compact than preserving human-friendly forms for them while still quite simple to convert in code.

jahorton · 2024-08-08T07:49:30Z

For more fun, I locally checked-out the original dotland.hy.armenian 1.0 model - the original 160,000 word version - and compiled it to the new format. It did take about 708 ms seconds to load, but it's usable so far as I can tell - and this is on our old SM-T350.

With the old format: 26887 KB.
- Load time: 17.1 seconds.
File size with new format: 5941 KB.
- Load time: 708 ms.
- File-size ratio: 22.1%

jahorton · 2024-08-08T08:09:34Z

Here's some data for the first model in the repo I found with a primarily non-BMP script: alfareh.xsa.himyarit_musnad. Don't want to break that by accident, after all.

Old format: 548 KB

Load time: 303 ms

New format: 129 KB

Filesize ratio: 23.5%
- Note: this is with heavy use of \u-escaping for unpaired surrogates.
Load time: 30 ms
Other notes: this keyboard appears to have font issues on the device I was using for the test. Lots of lovely Unicode rectangles.

jahorton · 2024-08-08T08:23:48Z

So, for the lexical models I've tested so far... granted with one reload each...

Model	Notes	17.0 size	18.0 size	Size ratio	17.0 load time	18.0 load time	Time ratio
nrc.en.mtnt, 0.3.2	default model	2652 kb	453 KB	17.1%	2.12 sec	163 ms	7.7%
dotland.hy.armenian, 1.0.0	160000 words	26887 KB	5941 KB	22.1%	17.1 sec	708 ms	4.1%
alfareh.xsa.himyarit_musnad, 1.1	non-BMP script	548 KB	129 KB	23.5%	303 ms	30 ms	10%

developer/src/kmc-model/src/build-trie.ts

…e-trie-compression' into feat/developer/compress-compiled-tries

developer/src/kmc-model/src/build-trie.ts

Co-authored-by: Marc Durdin <marc@durdin.net>

feat(developer): output new TrieModel format when compiling

65d45e0

keymanapp-test-bot bot added this to the A18S8 milestone Aug 8, 2024

github-actions bot added developer/ developer/compilers/ feat labels Aug 8, 2024

jahorton mentioned this pull request Aug 8, 2024

feat(common/models): integrate new trie decoding methods into predictive-text engine 💾 #12128

Merged

jahorton marked this pull request as ready for review August 8, 2024 05:29

jahorton requested review from mcdurdin and darcywong00 as code owners August 8, 2024 05:29

jahorton linked an issue Aug 8, 2024 that may be closed by this pull request

feat(developer, common/models): Improve predictive text load performance and file size with binary string Trie model encoding and lazy-initialization #10336

Open

keymanapp-test-bot bot added the epic-user-dict label Aug 8, 2024

keymanapp-test-bot bot changed the title ~~feat(developer): output new TrieModel format when compiling 💾~~ feat(developer): output new TrieModel format when compiling 💾 📖 Aug 8, 2024

jahorton mentioned this pull request Aug 8, 2024

feat(developer, common/models): Improve predictive text load performance and file size with binary string Trie model encoding and lazy-initialization #10336

Open

jahorton commented Aug 8, 2024

View reviewed changes

developer/src/kmc-model/src/build-trie.ts Outdated Show resolved Hide resolved

darcywong00 modified the milestones: A18S8, A18S9 Aug 17, 2024

jahorton changed the title ~~feat(developer): output new TrieModel format when compiling 💾 📖~~ feat(developer): output new TrieModel format when compiling 💾 Aug 23, 2024

keymanapp-test-bot bot added the epic-model-encoding label Aug 23, 2024

chore(developer): Merge branch 'feat/common/models/templates/integrat…

7d61428

…e-trie-compression' into feat/developer/compress-compiled-tries

jahorton removed the epic-user-dict label Aug 23, 2024

github-actions bot added the epic-user-dict label Aug 23, 2024

mcdurdin approved these changes Aug 26, 2024

View reviewed changes

developer/src/kmc-model/src/build-trie.ts Outdated Show resolved Hide resolved

chore(developer): Apply suggestions from code review

83c696b

Co-authored-by: Marc Durdin <marc@durdin.net>

Base automatically changed from feat/common/models/templates/integrate-trie-compression to epic/model-encoding August 28, 2024 07:02

jahorton merged commit 636a3ee into epic/model-encoding Aug 28, 2024
5 of 6 checks passed

jahorton deleted the feat/developer/compress-compiled-tries branch August 28, 2024 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(developer): output new TrieModel format when compiling 💾 #12129

feat(developer): output new TrieModel format when compiling 💾 #12129

jahorton commented Aug 8, 2024 •

edited

Loading

keymanapp-test-bot bot commented Aug 8, 2024 •

edited

Loading

mcdurdin commented Aug 8, 2024

mcdurdin commented Aug 8, 2024

jahorton commented Aug 8, 2024 •

edited

Loading

jahorton commented Aug 8, 2024 •

edited

Loading

jahorton commented Aug 8, 2024 •

edited

Loading

jahorton commented Aug 8, 2024

jahorton commented Aug 8, 2024 •

edited

Loading

jahorton commented Aug 8, 2024 •

edited

Loading

feat(developer): output new TrieModel format when compiling 💾 #12129

feat(developer): output new TrieModel format when compiling 💾 #12129

Conversation

jahorton commented Aug 8, 2024 • edited Loading

keymanapp-test-bot bot commented Aug 8, 2024 • edited Loading

User Test Results

mcdurdin commented Aug 8, 2024

mcdurdin commented Aug 8, 2024

jahorton commented Aug 8, 2024 • edited Loading

jahorton commented Aug 8, 2024 • edited Loading

jahorton commented Aug 8, 2024 • edited Loading

jahorton commented Aug 8, 2024

jahorton commented Aug 8, 2024 • edited Loading

jahorton commented Aug 8, 2024 • edited Loading

jahorton commented Aug 8, 2024 •

edited

Loading

keymanapp-test-bot bot commented Aug 8, 2024 •

edited

Loading

jahorton commented Aug 8, 2024 •

edited

Loading

jahorton commented Aug 8, 2024 •

edited

Loading

jahorton commented Aug 8, 2024 •

edited

Loading

jahorton commented Aug 8, 2024 •

edited

Loading

jahorton commented Aug 8, 2024 •

edited

Loading