-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Work in Progress DO NOT MERGE] Port math library from clojure to python #1893
base: edge
Are you sure you want to change the base?
Conversation
Now to figure out what is stored where :)
This would be amazing! |
We serialize to JSON and, optionally, to temp file. We probably will need to expand to take kw-args into account.
And some unit tests for fun, and even coverage!
The makefile will avoid having to remember the test syntax, making them frictionless to run. The partial runs are a way to keep focusing on just the output of whichever part we want -- although it doesn't save the biggest drag: clojure launch time.
We now have serialization of arguments in Clojure, with tests (in Docker). And Serialization+Deserialization of arguments in Python, with tests (locally, not in Docker). Next: wrapping a clojure function with a serializer of input and output, and a python script to replay that call but to the python function, and check the output matches. |
I am adding notes to PCALet's start with the PCA, as it's well known and nicely isolated. Let's first:
Checking its clojure calling graph in graph TD
wrapped_pca[wrapped-pca] --> powerit_pca[powerit-pca]
powerit_pca --> power_iteration[power-iteration]
powerit_pca --> rand_starting_vec[rand-starting-vec]
powerit_pca --> factor_matrix[factor-matrix]
power_iteration --> xtxr[xtxr]
power_iteration --> repeatv[repeatv]
factor_matrix --> proj_vec[proj-vec]
xtxr --> repeatv
pca_project[pca-project]
sparsity_aware_project_ptpts[sparsity-aware-project-ptpts] --> sparsity_aware_project_ptpt[sparsity-aware-project-ptpt]
pca_project_cmnts[pca-project-cmnts] --> sparsity_aware_project_ptpts
|
- Add cloverage - Explicitely deactivate conv-man tests, which require a full database, as warned in the comments. - Fix a bug in the new test runner when parsing command line arguments.
I am trying to get full _branch_ coverage of the function `power-iteration`, but failing to. There must be some corner cases I am not surfacing (and Cursor neither). I will leave it aside for now, make a note that it needs fixed, and will go to cover the entirely missing lines in the wrapped-pca function.
In spite of lots of PCA tests, I do not get full branch coverage in the key function I do wonder whether we really need full branch coverage, knowing that the PCA method, while currently being an elegantly manually coded power-iteration, might eventually be passed to SKlearn or Lapack. But before that, we will need to check whether the iterative nature of the power-iteration is exploited for incremental updates of the conversation, as I suspect it is. So for now, we will stick to power-iteration. |
See compdemocracy#1894 and `README.pythonport.md` in this commit for explanations of why the code is unreachable.
|
I would rather keep it in the math folder, but calva does not find it there...
This will allow to reflect changes or new files added in the container, which is easier when developing.
Awesome! 🔥 Using JAX? |
Thanks for the enthusiasm :) At first, using numpy and sklearn for the port, because we are so far using CPU nodes (constant worker polling the database every few seconds), and as far as I know Jax is not necessarily faster than numpy on those type of nodes. Plus don't necessarily want the cost of running a GPU node 24/7 for a PCA and a k-means 😅 Also thinking the codebase will be more accessible without Jax, for now. However the idea is to keep with the current functional- style code organization (thanks Clojure!) which will then lend itself well to a potential move to jax.numpy once we start seeing the need (whether through scale of matrices, or through algorithmic ideas -- for example RL when we review comment routing ;) ) I'm curious: what use case did you have in mind for a Jax port? Now's a great time to send tons of ideas :) |
Ah, sorry it was just a random thought as my Google friend always throw Jax on any topic 😆 Maybe some JIT compilers will work better than Jax for optimizing |
Yes, for example numba is a very powerful jit compiler these days. Does not necessarily accelerate the core functions of numpy which are often outsourced to BLAS (or Intel MKL if you have it set up), but all the python code outside indeed. Jax is really powerful indeed, when you have hardware accelerators :) And I love the split-stream RNG for reproducibility, as well as the instruction-level parallelization with The marginal cost to tech into account, however, is that it does come with extra complexity (see e.g. https://docs.jax.dev/en/latest/notebooks/Common_Gotchas_in_JAX.html ), and can get difficult to debug. Within Google Deepmind I used to have tons Research Engineer colleagues, which are terrific to help sort out such issues, but which we do not have outside of Google 😆 Plus it's not yet version 1.0, so the API can and does shift from version to version, thus requiring extra maintenance -- unlike numpy which is stable (well, except when moving from 1.x to 2.0 !) It's definitely a cost worth paying when the time comes, though. I'm dreaming of having full-on differentiable reinforcement learning environments with the conversations... |
I am not familiar with libraries, my comfort zone was to optimize specific algorithms myself with C++/rust using SIMD or maybe highway based on deployment environment, guess I need to adopt modern development practice. My friend always try to convince me using Jax, but I feel so comfortable with PyTorch and maybe also PyTorch/XLA for TPUs never really used Jax (except for Neural Tangent I guess). I am not sure about the differentiable reinforcement learning environments with the conversations... I prefer a rule based/ranking algorithm based routing strategy with a metric reflecting opinion diversity as input to the routing strategy. [sidetracked] Working with people in CivicTechTO for user research to ensure the deliverables are aligned with the need of end user IRL (and a data lakehouse based pipeline as an modularized backend upgrade to our polis fork so that ranking algorithm can be integrated/test other routing solution). |
Context
We (polis core team and advisors) have been discussing for the least few years about whether Clojure was still the optimal language for the math library, given the evolution of the landscape and of polis needs.
This came up again when @metasoarous raised potential performance issues #1579 (comment) .
Porting any codebase, let alone in this case 7000+ lines of scientific Clojure written by a very smart developer (hats off @metasoarous !), with tons of embedded real-world safeguards, and ten years of battle-testing is a Very Big Endeavour, super risky. Most of all, we need to keep all the domain knowledge that is embedded in the current codebase. This is not a rewrite from a scratch, but a port!
So, crazy, but there's a lot to gain (massive ML ecosystem: people, libraries, etc), so we would be remiss not to at least explore how far we can go. Worst that can happen is that this completely fails, we lost time and I've got egg on my face. I'll mitigate the former by still working on the new LLM features with @colinmegill, and for the latter, well, I can live with that :)
So let's go!
Plan
I'll be focusing first on the core functionality: the math. Once that is clear and done, then I will work on the poller and runner. I'll be using
numpy
for all vector operations.Preparation
Core iteration
then, iterating this way:
Expectations
It'll be a real slog at first for step 0 and 1, getting familiar with running the various functions one by one in clojure when needed (i.e. without the realtime poller etc, which I'll keep for the end), but pace should then increase.
Performance should also mechanically improve, as per #1579 and #1062 and #1580 .
Math part will be fun -- although we might start to see some small numerical differences appearing as we go, hopefully keeping them small.