Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate latentscope into Vector-io #29

Closed
dhruv-anand-aintech opened this issue Mar 3, 2024 · 12 comments
Closed

Integrate latentscope into Vector-io #29

dhruv-anand-aintech opened this issue Mar 3, 2024 · 12 comments
Labels
enhancement New feature or request

Comments

@dhruv-anand-aintech
Copy link
Contributor

Hi @enjalot,

I'm working on a project called Vector-io https://github.com/AI-Northstar-Tech/vector-io, which allows people to port over their vector datasets across various vector DBs and store snapshots on disk in a simple format called VDF (parquet files and a metadata json file).

I would love to integrate latentscope as a way to visualize the vectors that people have stored in their dataset.

I'm linking to the issue I have in my repo for the integration: AI-Northstar-Tech/vector-io#61.

I wanted to start by asking for help on a bug that I faced while using the web UI to load data from a parquet file in an example dataset I have: https://huggingface.co/datasets/aintech/vdf_20240125_130746_ac5a6_medium_articles/blob/main/medium_articles/medium_articles_2.parquet

I was able to complete the embedding step (though I plan to integrate into the new functionality you're planning for allowing people to use existing vectors), but for the clustering step I got this error:

Loading environment variables from: /Users/dhruvanand/Code/vector-io/.env
loading embeddings
RUNNING: umap-001
loading umap None
Traceback (most recent call last):
File "/Users/dhruvanand/miniforge3/bin/ls-umap", line 8, in <module>
sys.exit(main())
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/latentscope/scripts/umapper.py", line 29, in main
umapper(args.dataset_id, args.embedding_id, args.neighbors, args.min_dist, save=args.save, init=args.init, align=args.align)
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/latentscope/scripts/umapper.py", line 153, in umapper
initial_df = pd.read_parquet(os.path.join(umap_dir, f"{init}.parquet"))
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/pandas/io/parquet.py", line 670, in read_parquet
return impl.read(
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/pandas/io/parquet.py", line 265, in read
path_or_handle, handles, filesystem = _get_path_or_handle(
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/pandas/io/parquet.py", line 139, in _get_path_or_handle
handles = get_handle(
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/pandas/io/common.py", line 872, in get_handle
handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: './medium_articles_2/umaps/None.parquet'
@enjalot
Copy link
Owner

enjalot commented Mar 3, 2024

Oh no, I think I know what the issue is here. Could you go to the Job History page on the setup for your medium_articls_2 dataset? You should see the umap command that errored. it would be helpful to know the full command it tried to execute.

@dhruv-anand-aintech
Copy link
Contributor Author

info_map job 577ae365-c579-4fe7-bacf-11f0db68fa4c
error
Running umap
ls-umap info_map undefined 25 0.1 --init=None
Loading environment variables from: /Users/dhruvanand/Code/vector-io/.env
loading embeddings
Traceback (most recent call last):
File "/Users/dhruvanand/miniforge3/bin/ls-umap", line 8, in <module>
sys.exit(main())
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/latentscope/scripts/umapper.py", line 29, in main
umapper(args.dataset_id, args.embedding_id, args.neighbors, args.min_dist, save=args.save, init=args.init, align=args.align)
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/latentscope/scripts/umapper.py", line 49, in umapper
with h5py.File(embedding_path, 'r') as f:
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/h5py/_hl/files.py", line 562, in __init__
fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
File "/Users/dhruvanand/miniforge3/lib/python3.10/site-packages/h5py/_hl/files.py", line 235, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 102, in h5py.h5f.open
FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = './info_map/embeddings/undefined.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
🔁 Rerun
8838 seconds since last update

@enjalot
Copy link
Owner

enjalot commented Mar 4, 2024

ah thank you, I see the issue. I'm working on a fix and will release a patch shortly

@enjalot
Copy link
Owner

enjalot commented Mar 4, 2024

I've released 0.1.2 which fixes the issue you encountered. Do you mind upgrading and trying again?

@enjalot
Copy link
Owner

enjalot commented Mar 4, 2024

As a first step towards supporting a more direct integration I made a function in embed.py that allows you to create a latentscope embedding from a numpy array. I used your dataset as an example in this notebook:
https://github.com/enjalot/latent-scope/blob/main/notebooks/medium-articles.ipynb

it looks like the VDF file format would have all the parameters you'd need to call this function too

@enjalot enjalot added the enhancement New feature or request label Mar 4, 2024
@dhruv-anand-aintech
Copy link
Contributor Author

dhruv-anand-aintech commented Mar 5, 2024

When I run the notebook, I'm getting:

AttributeError                            Traceback (most recent call last)
Cell In[15], [line 1](vscode-notebook-cell:?execution_count=15&line=1)
----> [1](vscode-notebook-cell:?execution_count=15&line=1) ls.import_embeddings("medium_articles", embeddings, text_column="title", model_id="openai-text-embedding-3-small")

AttributeError: module 'latentscope' has no attribute 'import_embeddings'

even after updating to 0.1.2

@dhruv-anand-aintech
Copy link
Contributor Author

When I re-run via UI, I run into #30 again

@enjalot
Copy link
Owner

enjalot commented Mar 5, 2024

Sorry, I didn't include the import_embeddings in v0.1.2 so I just pushed v0.1.3 which should add the function.

@dhruv-anand-aintech
Copy link
Contributor Author

That line works in the python notebook now.

When I try to load the file in the Web UI, the UI crashes (without any error logs on server side).

@dhruv-anand-aintech
Copy link
Contributor Author

dhruv-anand-aintech commented Mar 5, 2024

There is some confusion on my setup, since I have imported files with the same name multiple times, and it seems to open the same scope for them. Would be good to create a new scope for each new parquet file loaded in via UI.

Renaming the parquet file and loading it in again works (Web UI loads).
Screenshot 2024-03-05 at 11 16 24 PM

It would be nice to have the existing vector columns (listed on top) as options in the embeddings menu below.

Side note: Making the panes resizable would be nice, as the plot is shown in a narrow section on my machine

@enjalot
Copy link
Owner

enjalot commented Mar 5, 2024 via email

@enjalot
Copy link
Owner

enjalot commented Mar 8, 2024

I'm going to close this issue as I've captured the new ideas in #34 and vector-io compatibility in #32 and we fixed the original bugs you ran into. thank you!

@enjalot enjalot closed this as completed Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants