feat(lyra): Add WebAssembly support #194

jkomyno · 2022-11-22T00:38:12Z

Context

This PR introduces Rust+WebAssembly support in Lyra, as asked privately by @micheleriva.
I'm currently targeting Node.js only, but you can extend this PR as needed for supporting Deno, browsers, and other JS runtimes.

As a motivating example, we were asked to write a skeleton for the intersectTokenScores function originally defined here in favor of the intersect_token_scores defined in the new lyra-utils crate here (and exposed to TypeScript via the lyra-utils-wasm crate here).

Tests

It should be noted that tests are currently failing, but we believe that is only due to a different ordering strategy used by Rust for intersect_token_scores. However, we invite the Lyra authors to carefully check that's the case.

How to build Rust → Wasm artifacts

With `Rust` and `Node`

Install Rust v1.6.5
Install Node v16.15.1 or superior
cargo update -p wasm-bindgen
cargo install -f wasm-bindgen-cli@0.2.83
cd ./rust
(cd ./scripts && npm i)
Optional: export LYRA_WASM_PROFILE="release"
Optional: export LYRA_WASM_TARGET="nodejs"
node ./scripts/wasmAll.mjs

With `docker` (used by the CI)

Install Docker

docker buildx build --load \
  -f Dockerfile --build-context rust=rust \
  . -t lyrasearch/lyra-wasm \
  --progress plain

docker create --name tmp lyrasearch/lyra-wasm
docker cp dummy:/opt/app/src/wasm ./src/wasm
docker rm -f tmp

In both cases, you should observe the following artifacts in ./src/wasm/:

lyra_utils_wasm_bg.wasm
lyra_utils_wasm_bg.wasm.d.ts
lyra_utils_wasm.d.ts
lyra_utils_wasm.js

This will need to be included in the bundler of your choice. Moreover, you likely do not wish to store these artifacts in the repo, but would rather generate them on the fly in the CI. Feel free to change this as you see fit.

…ust crate

src/utils.ts

src/wasm/lyra_utils_wasm.d.ts

src/wasm/lyra_utils_wasm.js

rust/scripts/wasm_all.sh

ShogunPanda · 2022-11-23T11:32:53Z

rust/lyra-utils/src/tokenscore.rs

+      }
+    }
+
+    if found == 0 {


@jkomyno I'm not sure about this. Why are we returning early?

… tf-idf score is the same

micheleriva · 2022-11-24T10:23:56Z

PR working fine on Node.js via ESM and CJS. TextEncoder class breaks on browsers, trying to fix it right now.

ShogunPanda

LGTM!

micheleriva · 2022-11-29T08:54:57Z

Life is pain, but this PR is suffering folks

jkomyno · 2022-11-30T22:53:08Z

Note: in the CI, I see warnings about requires in the automatically generated Wasm bindings.
When bundling "nodejs esm" modules, wasm-bindgen should be called with the --target=bundler option, as documented here.

micheleriva · 2022-12-03T20:47:05Z

Ok so @ShogunPanda, @jkomyno, this PR is ready to get merged.

My main concern is that it will force us to introduce a breaking change by making the search function async, as we need to use dynamic imports to load the correct WASM file for every runtime.

The problem is not with the breaking change per se, but with the fact that using a promise could reduce the throughput when running multiple search queries in a row.

My idea would be either to:

Create a searchAsync function (or with a similar name) that returns a promise and uses WASM underneath
Move search to a promise-based approach to support WASM without breaking on different runtimes

A nice fact is that given how Lyra is built, functions are totally isolated and tree-shakable, so it's up to the user to decide if they want to import the WASM implementation or use the default JS fallback.

Pros for making search async by default:

The WASM implementation is a LOT faster (official benchmark will come soon), so the throughput might not be a real problem IMHO

Cons of making search async:

Would force the user to download ~100kb WASM file, not a problem on the server but I can see why people might not like this on a browser.

I think maintaining two separate search functions could also be a problem if we don't split up the function's internals, but by doing that, we will likely slow down the overall execution by introducing more context-switching and function calls.

Let me know what you think folks 🙏

RafaelGSS · 2022-12-03T21:02:00Z

The problem is not with the breaking change per se, but with the fact that using a promise could reduce the throughput when running multiple search queries in a row.

How would it reduce the throughput?

Also, my 2 cents is to introduce the breaking change by making search return a promise. However, it doesn't mean it will be 100% async. Even returning a promise I think when no WASM it will be synchronous.

marco-ippolito · 2022-12-03T21:26:44Z

I'd go for the async search, I dont see why performance would be inferior. When I changed the tree I was wondering to propose it because I wanted to make findAllWords async. Lets benchmark the current implementation with async and see the difference.

micheleriva · 2022-12-04T07:30:57Z

@RafaelGSS

How would it reduce the throughput?

There's a lot of literature regarding promises performances, and I am wondering if that is a legit concern for Lyra; just to name a couple of great articles (there are also benchmarks in there):

But you're the expert here Rafael, I trust you 🙂

I know @ShogunPanda has a different opinion on promises though, so let's wait for him too.

One other concern I have is about DX and consistency: Lyra will expose the following fundamental functions:

create (sync)
insert (sync)
delete (sync)
search (async)
insertBatch (async)

While the reason why insertBatch is async would be clear to the user (as stated in the docs, it will avoid event loop freezes), it's not immediately clear why we can create a new db, insert, delete data synchronously, but we need to search asynchronously. I think having a consistent approach in all the "key" functions would be better.

About @marco-ippolito comment: I think we could easily make a couple of search internals async as well. So yes, that's another good point for async search.

That said, I'd also personally prefer making search async, these are just my concerns.

RafaelGSS

LGTM.

RafaelGSS · 2022-12-05T12:33:17Z

.github/workflows/wasm.yml

+    strategy:
+      fail-fast: true
+      matrix:
+        os: [ubuntu-latest]


We might need to include OSX and Windows here

ShogunPanda · 2022-12-05T23:21:42Z

I already anticipated this to @micheleriva directly, but I'm leaving my opinion here just in case.

I think we should go with an approach similar to fastify and other packages: the search function might accept the intersectTokenScores as a function and, based on it, decided whether it should return a promise.
On the caller side, the developer can either choose to always await or to detect it.

In other words, the function will become something like this (pseudocode):

function search<T>(/* */): T | Promise<T> {
  // Do something

  const intersected = intersectTokenScores(/* ... */); // Note that intersectTokenScores comes from arguments

  if(typeof intersected.then === function) {
    return intersected.then(sets => finalizeSearch(sets))
  } else {
    return finalizeSearch(intersected)
  }
}

micheleriva · 2022-12-06T08:39:01Z

@ShogunPanda my main concern with this approach is that it can be confusing for some developers... here's my proposal.

I'd ship WASM as an experimental feature for now, that can be enabled in the following way:

import { create } from '@lyrasearch/lyra'

const db = create({
  schema: { foo: 'string' },
  optimizations: {
    wasm: true
  }
})

The optimizations property will contain all the code optimizations that we will perform on Lyra. While in an experimental status, the wasm property will be false by default.

Given that we always pass a Lyra instance to the search function, we can easily detect whether the wasm option is set as true or false:

import { create } from '@lyrasearch/lyra'

const db = create({
  schema: { foo: 'string' },
  optimizations: {
    wasm: true
  }
})

search(db, { term: 'foo' }) // <--- search knows if wasm interop is enabled

Given that any Lyra instance is essentially an object, we can override this value with ease on demand:

import { create } from '@lyrasearch/lyra'

const db = create({
  schema: { foo: 'string' },
  optimizations: {
    wasm: true
  }
})

db.optimizations.wasm = false

I'm not sure I like it, but might be useful for benchmarks and tests while in an experimental stage.

With that being said, we could easily change the search type signature from T to Promise<T> depending on the Lyra instance passed as a first argument, but I see a bad DX pattern there. By mutating the original optimization.wasm parameter, we would make the signature uncertain and possibly non-deterministic.

I'd propose then to either choose to move on with a Promise by default or create an asyncSearch function to take care of this.

marco-ippolito · 2022-12-06T08:49:36Z

I already anticipated this to @micheleriva directly, but I'm leaving my opinion here just in case.

I think we should go with an approach similar to fastify and other packages: the search function might accept the intersectTokenScores as a function and, based on it, decided whether it should return a promise. On the caller side, the developer can either choose to always await or to detect it.

In other words, the function will become something like this (pseudocode):
function search<T>(/* */): T | Promise<T> {
  // Do something

  const intersected = intersectTokenScores(/* ... */); // Note that intersectTokenScores comes from arguments

  if(typeof intersected.then === function) {
    return intersected.then(sets => finalizeSearch(sets))
  } else {
    return finalizeSearch(intersected)
  }
}

I'm usually not a big fan of this approach, It's kinda hard to maintain (fast-jwt verify 💀) and to build new components above because you always have to think whether it will return a promise or not. Most of the time you will put an await before independently just not to deal with the dual type return.

I'd propose then to either choose to move on with a Promise by default or create an asyncSearch function to take care of this.

I'd choose whether to go promise by default or asyncSearch based on the benchmark results.

ShogunPanda · 2022-12-06T08:56:20Z

@micheleriva @marco-ippolito I see both your points. Let's try something different.

What about exposing two different interfaces and forbid passing optimizations to the regular one?

Something like allowing both the following codes to be valid:

import { create, search } from '@lyrasearch/lyra'

/*
This create does not allow for optimizations. 
We specifically validate this in Javascript (instead of relying on TS types) so that we
can guide the user in picking the right one.
*/
const db = create(/* ... */)

const results = search(db)

and

import { create, search } from '@lyrasearch/lyra/async'

/*
Technically create should not be async (yet).
But since we are establishing the new interface we declare as async now so 
that we don't need a breaking change later.

Create here accepts optimizations.
*/
const db = await create(/* ... */)
const results = await search(db)

At the beginning of each methods for both version I would add a quick check to avoid mixing API.
This way the DX should be fine, and we will have room for future expansion. Moreover it will encourage us to write composable code.

micheleriva · 2022-12-06T08:58:06Z

@ShogunPanda not a bad idea. So we'd basically alias the methods right?

ShogunPanda · 2022-12-06T08:59:20Z

Pretty much.
In the async version we allow most parts (intersecter, tokenizer, stemmer and so forth) to be pluggable and async.

marco-ippolito · 2022-12-06T09:01:23Z

@ShogunPanda I love this idea. it's like the require('fs').promises

ShogunPanda · 2022-12-06T09:02:56Z

@ShogunPanda I love this idea. it's like the require('fs').promises

How do you know where I copied the idea from? 😁

micheleriva · 2022-12-06T11:08:51Z

@jkomyno I keep on getting the following error when importing the compiled JS binding:

  1) tests/lyra.dataset.test.ts lyra.dataset should correctly populate the database with a large dataset `unwrap_throw` failed:
     Error: `unwrap_throw` failed
      at module.exports.__wbindgen_throw (src/wasm/artifacts/nodejs/lyra_utils_wasm.js:178:9)
      at wasm://wasm/000671c6:wasm-function[126]:0x13c5f
      at wasm://wasm/000671c6:wasm-function[8]:0x51f5
      at Object.module.exports.intersectTokenScores (src/wasm/artifacts/nodejs/lyra_utils_wasm.js:128:20)
      at intersectTokenScores (src/wasm/loader.ts:8:24)
      at search (src/lyra.ts:673:57)
      at Test.<anonymous> (tests/lyra.dataset.test.ts:71:28)

Here is the full CI log:
https://github.com/LyraSearch/lyra/actions/runs/3628897502/jobs/6120466268#step:5:627

any idea?

micheleriva · 2022-12-14T08:36:07Z

Update: we decided that ALL the Lyra functions will be async. Currently working on this.

micheleriva · 2022-12-14T14:51:05Z

All Lyra functions are now async. WASM support will be experimental, and users will have to opt-in via the following interface:

import { create } from '@lyrasearch/lyra'
import { intersectTokenScores } from '@lyrasearch/lyra/dist/esm/wasm/intersectTokenScores' // placeholder 

await create({
  schema: {
    foo: 'string'
  },
  components: {
    algorithms: {
      intersectTokenScores
    }
  }
})

This will provide a unified interface that will allow people to bring their own optimizations, and optionally opt-in for the built-in ones.

Gonna merge this and provide a separate build system for our WASM optimizations.

Thank you all folks, what a ride!

jkomyno added 6 commits November 22, 2022 01:32

feat(lyra): extract intersectTokenScores into the "lyra-utils-wasm" R…

57dc9f5

…ust crate

chore: add bash scripts to build wasm artifacts

10a94dc

feat(lyra): replaced "intersectTokenScores" with its Wasm counterpart

6cd1179

chore: add experimental nix script to build Rust / Wasm

5129e66

ci: add experimental wasm workflow

e71b499

chore: swapped scripts interpreter to "/bin/bash"

ad7782d

jkomyno requested a review from micheleriva November 22, 2022 00:56

ci: attempt to fix "bad interpreter: no such file or directory " error

8804f47

micheleriva reviewed Nov 22, 2022

View reviewed changes

src/utils.ts Show resolved Hide resolved

src/wasm/lyra_utils_wasm.d.ts Outdated Show resolved Hide resolved

src/wasm/lyra_utils_wasm.js Outdated Show resolved Hide resolved

micheleriva reviewed Nov 22, 2022

View reviewed changes

rust/scripts/wasm_all.sh Outdated Show resolved Hide resolved

ShogunPanda reviewed Nov 23, 2022

View reviewed changes

rust/lyra-utils/src/tokenscore.rs

}

}

if found == 0 {

Copy link

Contributor

ShogunPanda Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkomyno I'm not sure about this. Why are we returning early?

jkomyno and others added 12 commits November 24, 2022 02:05

chore: removed nix

5225e95

ci: replaced bash scripts with node.js scripts

f56d8c6

ci: added Dockerfile for Wasm generation pipeline

5fb6822

ci: replaced nix with Docker

eb99522

chore: add examples of optimized wasm and textual wat representation

4dea2f9

chore: swapped order in test to match Rust's lexicographic order when…

b5688f2

… tf-idf score is the same

chore: removed leftover from Dockerfile

4e32323

ci: trigger wasm workflow on PRs

1becdca

ci: fix typo

d1f51a0

ci: fix typo in "docker cp" command

1f0469d

test: removes generated files from tests

75fb53b

build: wip: moves generated files to dist

240fb39

ShogunPanda approved these changes Nov 24, 2022

View reviewed changes

jkomyno marked this pull request as ready for review November 26, 2022 14:11

build(wasm): wip on wasm bundling

67519f3

ShogunPanda approved these changes Nov 29, 2022

View reviewed changes

build(wasm): removes rust compilation process from bundling system

5e03604

Merge branch 'main' into feat/add-wasm

0043cdc

RafaelGSS added the semver-major label Dec 5, 2022

RafaelGSS approved these changes Dec 5, 2022

View reviewed changes

build(wasm): wip on wasm build system

7dfa3b5

build(wasm): adds rust build command

9958fba

micheleriva mentioned this pull request Dec 7, 2022

feat(lyra): adds async aliases #203

Merged

ShogunPanda and others added 4 commits December 7, 2022 13:39

fix: Fixed WASM loader.

edd6fa7

feat: wip on build system

ebad3c9

Merge branch 'main' into feat/add-wasm

b883437

Merge branch 'main' into feat/add-wasm

8561bac

micheleriva added 2 commits December 14, 2022 13:21

Merge branch 'main' into feat/add-wasm

f09a459

feat(lyra): adds components

341cd6a

micheleriva merged commit 151bebe into main Dec 14, 2022

micheleriva deleted the feat/add-wasm branch December 14, 2022 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(lyra): Add WebAssembly support #194

feat(lyra): Add WebAssembly support #194

jkomyno commented Nov 22, 2022 •

edited

Loading

ShogunPanda Nov 23, 2022

micheleriva commented Nov 24, 2022

ShogunPanda left a comment

micheleriva commented Nov 29, 2022

jkomyno commented Nov 30, 2022

micheleriva commented Dec 3, 2022

RafaelGSS commented Dec 3, 2022

marco-ippolito commented Dec 3, 2022

micheleriva commented Dec 4, 2022

RafaelGSS left a comment

RafaelGSS Dec 5, 2022

ShogunPanda commented Dec 5, 2022

micheleriva commented Dec 6, 2022

marco-ippolito commented Dec 6, 2022 •

edited

Loading

ShogunPanda commented Dec 6, 2022 •

edited

Loading

micheleriva commented Dec 6, 2022

ShogunPanda commented Dec 6, 2022

marco-ippolito commented Dec 6, 2022

ShogunPanda commented Dec 6, 2022

micheleriva commented Dec 6, 2022

micheleriva commented Dec 14, 2022

micheleriva commented Dec 14, 2022

feat(lyra): Add WebAssembly support #194

feat(lyra): Add WebAssembly support #194

Conversation

jkomyno commented Nov 22, 2022 • edited Loading

Context

Tests

How to build Rust → Wasm artifacts

With Rust and Node

With docker (used by the CI)

ShogunPanda Nov 23, 2022

Choose a reason for hiding this comment

micheleriva commented Nov 24, 2022

ShogunPanda left a comment

Choose a reason for hiding this comment

micheleriva commented Nov 29, 2022

jkomyno commented Nov 30, 2022

micheleriva commented Dec 3, 2022

RafaelGSS commented Dec 3, 2022

marco-ippolito commented Dec 3, 2022

micheleriva commented Dec 4, 2022

RafaelGSS left a comment

Choose a reason for hiding this comment

RafaelGSS Dec 5, 2022

Choose a reason for hiding this comment

ShogunPanda commented Dec 5, 2022

micheleriva commented Dec 6, 2022

marco-ippolito commented Dec 6, 2022 • edited Loading

ShogunPanda commented Dec 6, 2022 • edited Loading

micheleriva commented Dec 6, 2022

ShogunPanda commented Dec 6, 2022

marco-ippolito commented Dec 6, 2022

ShogunPanda commented Dec 6, 2022

micheleriva commented Dec 6, 2022

micheleriva commented Dec 14, 2022

micheleriva commented Dec 14, 2022

jkomyno commented Nov 22, 2022 •

edited

Loading

With `Rust` and `Node`

With `docker` (used by the CI)

marco-ippolito commented Dec 6, 2022 •

edited

Loading

ShogunPanda commented Dec 6, 2022 •

edited

Loading