Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(developer): output new TrieModel format when compiling 💾 #12129

Merged
merged 3 commits into from
Aug 28, 2024

Conversation

jahorton
Copy link
Contributor

@jahorton jahorton commented Aug 8, 2024

Fixes #10336.

With these changes in place, Developer will output the new, comparatively-compressed encoded-string Trie format for Trie-based lexical model projects.

For an obvious first trial, when compiling the nrc.en.mtnt model, version 0.3.2:

Current multi-model comparison table: #12129 (comment)

@keymanapp-test-bot skip

Though, conceivably, we could request compiling an already-existing model with the Developer artifact, then loading it with the Android or iOS artifact.

@keymanapp-test-bot
Copy link

keymanapp-test-bot bot commented Aug 8, 2024

User Test Results

Test specification and instructions

User tests are not required

@mcdurdin
Copy link
Member

mcdurdin commented Aug 8, 2024

I like this. Before we get any further, can you provide some more detail on this:

  1. Does this generate backwardly compatible model code? Or is it going to work only with 18.0?
  2. The linked issue has a lot of detail about the proposed design. Does this precisely match the actual design -- particularly the emitted file format?
  3. I'd like to see performance and memory use comparisons (not just screenshots; tables would be good)

@mcdurdin
Copy link
Member

mcdurdin commented Aug 8, 2024

Though, conceivably, we could request compiling an already-existing model with the Developer artifact, then loading it with the Android or iOS artifact.

What does this mean?

@jahorton
Copy link
Contributor Author

jahorton commented Aug 8, 2024

I like this. Before we get any further, can you provide some more detail on this:

  1. Does this generate backwardly compatible model code? Or is it going to work only with 18.0?

As the changes from #12128 are required to interpret the new code, it will only work with 18.0 at this time. We could consider embedding the decompression functions and a bit more rigging (within the model and the worker) to selectively auto-decompress it if not 18.0+, but per a recent discussion we had, I don't think we'll be planning to do this. The decompression methods should minify pretty well, though.

  • I don't believe that the worker currently tells the model what version of the engine is active. We'd likely want to provide this as a value usable to conditionally prevent immediate full decompression.
    • For 18.0+, we already know how to decode it, so we don't need to decompress during startup. We'd would likely nullify a considerable portion of our performance gains here if we did.
    • For 17.0 and before, when the value isn't available, we'd let it trigger, knowing that the engine lacks decoding support of its own.

All models compiled with 12.0 - 17.0 will continue to operate normally, though.

  1. The linked issue has a lot of detail about the proposed design. Does this precisely match the actual design -- particularly the emitted file format?

Linked issue: #10336

I ended up entirely dropping the stored key value - and thus, also keyLength entries. We can always rebuild the keys with the model's included keying function, after all. Before I did so, the redundancy was crystal clear when looking at the encoded-string format; you'd see an extremely high rate of "duplicated words," as the keyed and unkeyed versions were stored adjacent to each other.

  • I've now used strikeout formatting to indicate this change on the base issue.

Outside of that, it should be fairly precise. There are a few differences, though:

  • We add +0x0020 to all encoded numbers so that they render more nicely in text editors. (Without the offset, VS Code would think the resulting scripts are binary-encoded, despite being .js files.)
  • Leaves and internal nodes overload their type flag onto their entry count.
    • It's extremely doubtful that a leaf will have over 0x7fff entries, just as a node should not have over 0x7fff direct children.
    • High bit: node type
    • Remainder: entry / child count
    • I have now updated the original issue accordingly.
  1. I'd like to see performance and memory use comparisons (not just screenshots; tables would be good)

Could I get some more specifics here? What sort of scenarios and comparisons are you expecting for these tables?

Would it basically be "the same comparisons," just boiled down to raw text tables, but for more models?

@jahorton
Copy link
Contributor Author

jahorton commented Aug 8, 2024

Though, conceivably, we could request compiling an already-existing model with the Developer artifact, then loading it with the Android or iOS artifact.

What does this mean?

HYPOTHETICAL_TEST:

  1. Clone the lexical-models repository.
  2. Download the Developer artifact for this build.
  3. Open the release/gff/gff.am.gff_amharic lexical model project from the lexical-models repository within Developer.
  4. Recompile it.
  5. Using an Android device, install the Android artifact for this build.
  6. Install a keyboard for the Amharic language if you have not previously done so.
    • Important: do this before the next step!
  7. Upload the newly recompiled package for gff.am.gff_amharic to the same Android device, then install it.
  8. Verify that the model works as normal.

That's a single test using two separate build artifacts (Developer, Android) plus a locally-built one (a model package, built via Developer).

@keymanapp-test-bot keymanapp-test-bot bot changed the title feat(developer): output new TrieModel format when compiling 💾 feat(developer): output new TrieModel format when compiling 💾 📖 Aug 8, 2024
@jahorton
Copy link
Contributor Author

jahorton commented Aug 8, 2024

A small excerpt of the formatting (with line breaks added to loosely simulate word-wrapping):

LMLayerWorker.loadModel(new models.TrieModel({"data":"&ஓ\"ᒲBtaoyifbwhnmsjlcgduprekvqz314567892
 倐\"ᒲ*hoiawreuyv ೋ\"ᒲ'eaioruy ђ\"ᒲ+�yrminsoafe ,\"ᒲ耡 '\"ᒲthe v 䭿%�rvld - 䭿耡 ( 䭿they 0 
ॼ耡 + ॼthey're 0 ƶ耡 + ƶthey've 0 Ɗ耡 + Ɗthey'll / ĸ耡 * ĸthey'd ŝ ⍂#eam ¤ ⍂&�sfobl . ⍂耡 )
 ⍂there 0 ৎ耡 + ৎthere's J ú!o D ú!r > ú耢 - útherefore , 'therefor 0 .耡 + .thereof 0 -耡 +
 -thereby 1 )耡 , )there'll � �!p � �#yie 0 �耡 + �therapy X a!s R a!t L a\"�s 2 a耡 -
 atherapist 3 -耡 . -therapists 4 2耡 / 2therapeutic g ;\"ao 0 ;耡 + ;thermal P *\"sn 3 *耡 .
 *thermostat 6 %耡 1 %thermonuclear ¼ ⊨$�sea - ⊨耡 ( ⊨them 3 ʆ耡 . ʆthemselves T ¯#�sd . ¯耡 ) 

For a minor breakdown:

  • taoyifbwhnmsjlcgduprekvqz3145678921 is the list of all one-letter prefixes supported by the model, in sorted frequency order.

    • From the default prediction set seen when active:
      image
    • Note the order in which each prefix first appears: t, a, o, y, i
  • Similarly, hoiawreuyv is the ordering for "starting with t, what's the second letter?"

  • Then, eaioruy for the third letter...

  • And �yrminsoafe for the fourth.

    • The initial entry there is U+FDD0, our "sentinel value", indicating that there's an actual word here: the.

    • And sure enough, shortly after, you can see the included in full plain-text.

    • We can also get a great supporting live-use picture for this level:

      An active prediction with prefix the

  • �rvld continues from the y on the prior bullet point - we get they plus its contractions, all in order.

  • eam continues from the sibling r - ther isn't a word, but there is, and thera (therapy) and therm (thermal) are valid prefixes.

The "random" characters before and between each section is used to encode character lengths and the weight entries from the original, JSON-friendly Trie structure. Those parts are admittedly not human-readable, as char-encoding numbers is far more compact than preserving human-friendly forms for them while still quite simple to convert in code.

@jahorton
Copy link
Contributor Author

jahorton commented Aug 8, 2024

For more fun, I locally checked-out the original dotland.hy.armenian 1.0 model - the original 160,000 word version - and compiled it to the new format. It did take about 708 ms seconds to load, but it's usable so far as I can tell - and this is on our old SM-T350.

  • With the old format: 26887 KB.
    • Load time: 17.1 seconds.
  • File size with new format: 5941 KB.
    • Load time: 708 ms.
    • File-size ratio: 22.1%

@jahorton
Copy link
Contributor Author

jahorton commented Aug 8, 2024

Here's some data for the first model in the repo I found with a primarily non-BMP script: alfareh.xsa.himyarit_musnad. Don't want to break that by accident, after all.

Old format: 548 KB

  • Load time: 303 ms

New format: 129 KB

  • Filesize ratio: 23.5%
    • Note: this is with heavy use of \u-escaping for unpaired surrogates.
  • Load time: 30 ms
    Other notes: this keyboard appears to have font issues on the device I was using for the test. Lots of lovely Unicode rectangles.

@jahorton
Copy link
Contributor Author

jahorton commented Aug 8, 2024

So, for the lexical models I've tested so far... granted with one reload each...

Model Notes 17.0 size 18.0 size Size ratio 17.0 load time 18.0 load time Time ratio
nrc.en.mtnt, 0.3.2 default model 2652 kb 453 KB 17.1% 2.12 sec 163 ms 7.7%
dotland.hy.armenian, 1.0.0 160000 words 26887 KB 5941 KB 22.1% 17.1 sec 708 ms 4.1%
alfareh.xsa.himyarit_musnad, 1.1 non-BMP script 548 KB 129 KB 23.5% 303 ms 30 ms 10%

@darcywong00 darcywong00 modified the milestones: A18S8, A18S9 Aug 17, 2024
@jahorton jahorton changed the title feat(developer): output new TrieModel format when compiling 💾 📖 feat(developer): output new TrieModel format when compiling 💾 Aug 23, 2024
…e-trie-compression' into feat/developer/compress-compiled-tries
developer/src/kmc-model/src/build-trie.ts Outdated Show resolved Hide resolved
Co-authored-by: Marc Durdin <marc@durdin.net>
Base automatically changed from feat/common/models/templates/integrate-trie-compression to epic/model-encoding August 28, 2024 07:02
@jahorton jahorton merged commit 636a3ee into epic/model-encoding Aug 28, 2024
5 of 6 checks passed
@jahorton jahorton deleted the feat/developer/compress-compiled-tries branch August 28, 2024 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
3 participants