Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(lyra): Add WebAssembly support #194

Merged
merged 38 commits into from
Dec 14, 2022
Merged

feat(lyra): Add WebAssembly support #194

merged 38 commits into from
Dec 14, 2022

Conversation

jkomyno
Copy link
Contributor

@jkomyno jkomyno commented Nov 22, 2022

Context

This PR introduces Rust+WebAssembly support in Lyra, as asked privately by @micheleriva.
I'm currently targeting Node.js only, but you can extend this PR as needed for supporting Deno, browsers, and other JS runtimes.

As a motivating example, we were asked to write a skeleton for the intersectTokenScores function originally defined here in favor of the intersect_token_scores defined in the new lyra-utils crate here (and exposed to TypeScript via the lyra-utils-wasm crate here).

Tests

It should be noted that tests are currently failing, but we believe that is only due to a different ordering strategy used by Rust for intersect_token_scores. However, we invite the Lyra authors to carefully check that's the case.

How to build Rust → Wasm artifacts

With Rust and Node

  • Install Rust v1.6.5
  • Install Node v16.15.1 or superior
  • cargo update -p wasm-bindgen
  • cargo install -f wasm-bindgen-cli@0.2.83
  • cd ./rust
  • (cd ./scripts && npm i)
  • Optional: export LYRA_WASM_PROFILE="release"
  • Optional: export LYRA_WASM_TARGET="nodejs"
  • node ./scripts/wasmAll.mjs

With docker (used by the CI)

  • Install Docker
  • docker buildx build --load \
      -f Dockerfile --build-context rust=rust \
      . -t lyrasearch/lyra-wasm \
      --progress plain
  • docker create --name tmp lyrasearch/lyra-wasm
  • docker cp dummy:/opt/app/src/wasm ./src/wasm
  • docker rm -f tmp

In both cases, you should observe the following artifacts in ./src/wasm/:

  • lyra_utils_wasm_bg.wasm
  • lyra_utils_wasm_bg.wasm.d.ts
  • lyra_utils_wasm.d.ts
  • lyra_utils_wasm.js

This will need to be included in the bundler of your choice. Moreover, you likely do not wish to store these artifacts in the repo, but would rather generate them on the fly in the CI. Feel free to change this as you see fit.

@jkomyno jkomyno requested a review from micheleriva November 22, 2022 00:56
src/utils.ts Show resolved Hide resolved
src/wasm/lyra_utils_wasm.d.ts Outdated Show resolved Hide resolved
src/wasm/lyra_utils_wasm.js Outdated Show resolved Hide resolved
}
}

if found == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkomyno I'm not sure about this. Why are we returning early?

@micheleriva
Copy link
Member

PR working fine on Node.js via ESM and CJS. TextEncoder class breaks on browsers, trying to fix it right now.

Copy link
Contributor

@ShogunPanda ShogunPanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jkomyno jkomyno marked this pull request as ready for review November 26, 2022 14:11
@micheleriva
Copy link
Member

Life is pain, but this PR is suffering folks

@jkomyno
Copy link
Contributor Author

jkomyno commented Nov 30, 2022

Note: in the CI, I see warnings about requires in the automatically generated Wasm bindings.
When bundling "nodejs esm" modules, wasm-bindgen should be called with the --target=bundler option, as documented here.

@micheleriva
Copy link
Member

Ok so @ShogunPanda, @jkomyno, this PR is ready to get merged.

My main concern is that it will force us to introduce a breaking change by making the search function async, as we need to use dynamic imports to load the correct WASM file for every runtime.

The problem is not with the breaking change per se, but with the fact that using a promise could reduce the throughput when running multiple search queries in a row.

My idea would be either to:

  1. Create a searchAsync function (or with a similar name) that returns a promise and uses WASM underneath
  2. Move search to a promise-based approach to support WASM without breaking on different runtimes

A nice fact is that given how Lyra is built, functions are totally isolated and tree-shakable, so it's up to the user to decide if they want to import the WASM implementation or use the default JS fallback.

Pros for making search async by default:

  • The WASM implementation is a LOT faster (official benchmark will come soon), so the throughput might not be a real problem IMHO

Cons of making search async:

  • Would force the user to download ~100kb WASM file, not a problem on the server but I can see why people might not like this on a browser.

I think maintaining two separate search functions could also be a problem if we don't split up the function's internals, but by doing that, we will likely slow down the overall execution by introducing more context-switching and function calls.

Let me know what you think folks 🙏

@RafaelGSS
Copy link
Contributor

The problem is not with the breaking change per se, but with the fact that using a promise could reduce the throughput when running multiple search queries in a row.

How would it reduce the throughput?

Also, my 2 cents is to introduce the breaking change by making search return a promise. However, it doesn't mean it will be 100% async. Even returning a promise I think when no WASM it will be synchronous.

@marco-ippolito
Copy link
Contributor

I'd go for the async search, I dont see why performance would be inferior. When I changed the tree I was wondering to propose it because I wanted to make findAllWords async. Lets benchmark the current implementation with async and see the difference.

@micheleriva
Copy link
Member

@RafaelGSS

How would it reduce the throughput?

There's a lot of literature regarding promises performances, and I am wondering if that is a legit concern for Lyra; just to name a couple of great articles (there are also benchmarks in there):

But you're the expert here Rafael, I trust you 🙂

I know @ShogunPanda has a different opinion on promises though, so let's wait for him too.

One other concern I have is about DX and consistency: Lyra will expose the following fundamental functions:

  • create (sync)
  • insert (sync)
  • delete (sync)
  • search (async)
  • insertBatch (async)

While the reason why insertBatch is async would be clear to the user (as stated in the docs, it will avoid event loop freezes), it's not immediately clear why we can create a new db, insert, delete data synchronously, but we need to search asynchronously. I think having a consistent approach in all the "key" functions would be better.

About @marco-ippolito comment: I think we could easily make a couple of search internals async as well. So yes, that's another good point for async search.

That said, I'd also personally prefer making search async, these are just my concerns.

Copy link
Contributor

@RafaelGSS RafaelGSS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

strategy:
fail-fast: true
matrix:
os: [ubuntu-latest]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to include OSX and Windows here

@ShogunPanda
Copy link
Contributor

I already anticipated this to @micheleriva directly, but I'm leaving my opinion here just in case.

I think we should go with an approach similar to fastify and other packages: the search function might accept the intersectTokenScores as a function and, based on it, decided whether it should return a promise.
On the caller side, the developer can either choose to always await or to detect it.

In other words, the function will become something like this (pseudocode):

function search<T>(/* */): T | Promise<T> {
  // Do something

  const intersected = intersectTokenScores(/* ... */); // Note that intersectTokenScores comes from arguments

  if(typeof intersected.then === function) {
    return intersected.then(sets => finalizeSearch(sets))
  } else {
    return finalizeSearch(intersected)
  }
}

@micheleriva
Copy link
Member

@ShogunPanda my main concern with this approach is that it can be confusing for some developers... here's my proposal.

I'd ship WASM as an experimental feature for now, that can be enabled in the following way:

import { create } from '@lyrasearch/lyra'

const db = create({
  schema: { foo: 'string' },
  optimizations: {
    wasm: true
  }
})

The optimizations property will contain all the code optimizations that we will perform on Lyra. While in an experimental status, the wasm property will be false by default.

Given that we always pass a Lyra instance to the search function, we can easily detect whether the wasm option is set as true or false:

import { create } from '@lyrasearch/lyra'

const db = create({
  schema: { foo: 'string' },
  optimizations: {
    wasm: true
  }
})

search(db, { term: 'foo' }) // <--- search knows if wasm interop is enabled

Given that any Lyra instance is essentially an object, we can override this value with ease on demand:

import { create } from '@lyrasearch/lyra'

const db = create({
  schema: { foo: 'string' },
  optimizations: {
    wasm: true
  }
})

db.optimizations.wasm = false

I'm not sure I like it, but might be useful for benchmarks and tests while in an experimental stage.

With that being said, we could easily change the search type signature from T to Promise<T> depending on the Lyra instance passed as a first argument, but I see a bad DX pattern there. By mutating the original optimization.wasm parameter, we would make the signature uncertain and possibly non-deterministic.

I'd propose then to either choose to move on with a Promise by default or create an asyncSearch function to take care of this.

@marco-ippolito
Copy link
Contributor

marco-ippolito commented Dec 6, 2022

I already anticipated this to @micheleriva directly, but I'm leaving my opinion here just in case.

I think we should go with an approach similar to fastify and other packages: the search function might accept the intersectTokenScores as a function and, based on it, decided whether it should return a promise. On the caller side, the developer can either choose to always await or to detect it.

In other words, the function will become something like this (pseudocode):

function search<T>(/* */): T | Promise<T> {
  // Do something

  const intersected = intersectTokenScores(/* ... */); // Note that intersectTokenScores comes from arguments

  if(typeof intersected.then === function) {
    return intersected.then(sets => finalizeSearch(sets))
  } else {
    return finalizeSearch(intersected)
  }
}

I'm usually not a big fan of this approach, It's kinda hard to maintain (fast-jwt verify 💀) and to build new components above because you always have to think whether it will return a promise or not. Most of the time you will put an await before independently just not to deal with the dual type return.

I'd propose then to either choose to move on with a Promise by default or create an asyncSearch function to take care of this.

I'd choose whether to go promise by default or asyncSearch based on the benchmark results.

@ShogunPanda
Copy link
Contributor

ShogunPanda commented Dec 6, 2022

@micheleriva @marco-ippolito I see both your points. Let's try something different.

What about exposing two different interfaces and forbid passing optimizations to the regular one?

Something like allowing both the following codes to be valid:

import { create, search } from '@lyrasearch/lyra'

/*
This create does not allow for optimizations. 
We specifically validate this in Javascript (instead of relying on TS types) so that we
can guide the user in picking the right one.
*/
const db = create(/* ... */)

const results = search(db)

and

import { create, search } from '@lyrasearch/lyra/async'

/*
Technically create should not be async (yet).
But since we are establishing the new interface we declare as async now so 
that we don't need a breaking change later.

Create here accepts optimizations.
*/
const db = await create(/* ... */)
const results = await search(db)

At the beginning of each methods for both version I would add a quick check to avoid mixing API.
This way the DX should be fine, and we will have room for future expansion. Moreover it will encourage us to write composable code.

@micheleriva
Copy link
Member

@ShogunPanda not a bad idea. So we'd basically alias the methods right?

@ShogunPanda
Copy link
Contributor

Pretty much.
In the async version we allow most parts (intersecter, tokenizer, stemmer and so forth) to be pluggable and async.

@marco-ippolito
Copy link
Contributor

@ShogunPanda I love this idea. it's like the require('fs').promises

@ShogunPanda
Copy link
Contributor

@ShogunPanda I love this idea. it's like the require('fs').promises

How do you know where I copied the idea from? 😁

@micheleriva
Copy link
Member

@jkomyno I keep on getting the following error when importing the compiled JS binding:

  1) tests/lyra.dataset.test.ts lyra.dataset should correctly populate the database with a large dataset `unwrap_throw` failed:
     Error: `unwrap_throw` failed
      at module.exports.__wbindgen_throw (src/wasm/artifacts/nodejs/lyra_utils_wasm.js:178:9)
      at wasm://wasm/000671c6:wasm-function[126]:0x13c5f
      at wasm://wasm/000671c6:wasm-function[8]:0x51f5
      at Object.module.exports.intersectTokenScores (src/wasm/artifacts/nodejs/lyra_utils_wasm.js:128:20)
      at intersectTokenScores (src/wasm/loader.ts:8:24)
      at search (src/lyra.ts:673:57)
      at Test.<anonymous> (tests/lyra.dataset.test.ts:71:28)

Here is the full CI log:
https://github.com/LyraSearch/lyra/actions/runs/3628897502/jobs/6120466268#step:5:627

any idea?

@micheleriva
Copy link
Member

Update: we decided that ALL the Lyra functions will be async. Currently working on this.

@micheleriva
Copy link
Member

All Lyra functions are now async. WASM support will be experimental, and users will have to opt-in via the following interface:

import { create } from '@lyrasearch/lyra'
import { intersectTokenScores } from '@lyrasearch/lyra/dist/esm/wasm/intersectTokenScores' // placeholder 

await create({
  schema: {
    foo: 'string'
  },
  components: {
    algorithms: {
      intersectTokenScores
    }
  }
})

This will provide a unified interface that will allow people to bring their own optimizations, and optionally opt-in for the built-in ones.

Gonna merge this and provide a separate build system for our WASM optimizations.

Thank you all folks, what a ride!

@micheleriva micheleriva merged commit 151bebe into main Dec 14, 2022
@micheleriva micheleriva deleted the feat/add-wasm branch December 14, 2022 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants