You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
dani
I have a question, when I store a stream in the storage, how does it identify the insertion if it should be done or the index already exists, I have seen that DataSource.FromStream does not use a path, doesn't it make it difficult to find an element?
HavenDV — 05/02/2024 9:58 PM
DataSource.FromStream simply retrieves Documents from this Stream. Although there is metadata here, it is not currently used in any way, and the presence of the same data in the VectorCollection is not determined
dani — 05/02/2024 10:01 PM
I think I have not expressed myself well, for example in the code I am testing, I insert files from a repository, can I decide whether to insert or not if the vector already exists in the database?
foreach(var f in files){if(!ignoreExt.Contains(Path.GetExtension(f.FilePath).ToLower())){index=await vectordb.AddDocumentsFromAsync<GitlabLoader>(
embeddings,
dimensions: dimensions,
dataSource: DataSource.FromStream(f.ContentToStream),
collectionName: collection);}}
HavenDV — 05/02/2024 10:06 PM
IVectorDatabase.IsCollectionExistsAsync probably the best choice at the moment if you can store files in different collections
IVectorCollection.IsEmptyAsync can also be used, but it is not yet implemented/tested for all databases
dani — 05/02/2024 10:11 PM
My idea was to use a collection to store an entire repository, would it be viable to have a method to check if a path already exists? and that DataSource.FromStream has as an option to be able to pass it a path to have it indexed in some way
HavenDV — 05/02/2024 10:16 PM
I understand your problem, I'll think about it, for now the solution is only to recreate the entire collection for all files if any of the files have changed
The problem is that one DataSource can turn into several Documents, and as a result into several vectors in the database. And we need to add metadata to the database, such as FilePath, and then check for the presence of vectors with this metadata
Id for a specific vector won't work because it needs to be unique
dani — 05/02/2024 10:19 PM
You could recover a vector count with the same path, right?
HavenDV — 05/02/2024 10:23 PM
But what if the file changes partially?
dani — 05/02/2024 10:25 PM
Maybe you could have a hash function and if it doesn't match delete and reinsert?
HavenDV — 05/02/2024 10:31 PM
Yeah, that sounds like a good suggestion.
And add AddOrUpdateFrom, which does this automatically.
Also add the ability to pass metadata to the DataSource so that you can control this.
But this seems to go a little beyond the scope of the current release and sounds like a good plan for the near future
The text was updated successfully, but these errors were encountered:
I'm currently working on a way to search for metadata, including file path (I think we are looking to do the same thing!). So what you would do is loop through the files, search for entries with that metadata/filepath, delete them (no easy way to update existing entries given the number of them could change) and then re-add.
We could look in the future at storing the filehash in the metadata to do what you suggested.
The code is written for sql lite, in memory and I hope to have mongo and postgres done soon.
dani
I have a question, when I store a stream in the storage, how does it identify the insertion if it should be done or the index already exists, I have seen that DataSource.FromStream does not use a path, doesn't it make it difficult to find an element?
HavenDV — 05/02/2024 9:58 PM
DataSource.FromStream simply retrieves Documents from this Stream. Although there is metadata here, it is not currently used in any way, and the presence of the same data in the VectorCollection is not determined
dani — 05/02/2024 10:01 PM
I think I have not expressed myself well, for example in the code I am testing, I insert files from a repository, can I decide whether to insert or not if the vector already exists in the database?
HavenDV — 05/02/2024 10:06 PM
IVectorDatabase.IsCollectionExistsAsync probably the best choice at the moment if you can store files in different collections
IVectorCollection.IsEmptyAsync can also be used, but it is not yet implemented/tested for all databases
dani — 05/02/2024 10:11 PM
My idea was to use a collection to store an entire repository, would it be viable to have a method to check if a path already exists? and that DataSource.FromStream has as an option to be able to pass it a path to have it indexed in some way
HavenDV — 05/02/2024 10:16 PM
I understand your problem, I'll think about it, for now the solution is only to recreate the entire collection for all files if any of the files have changed
The problem is that one DataSource can turn into several Documents, and as a result into several vectors in the database. And we need to add metadata to the database, such as FilePath, and then check for the presence of vectors with this metadata
Id for a specific vector won't work because it needs to be unique
dani — 05/02/2024 10:19 PM
You could recover a vector count with the same path, right?
HavenDV — 05/02/2024 10:23 PM
But what if the file changes partially?
dani — 05/02/2024 10:25 PM
Maybe you could have a hash function and if it doesn't match delete and reinsert?
HavenDV — 05/02/2024 10:31 PM
Yeah, that sounds like a good suggestion.
And add AddOrUpdateFrom, which does this automatically.
Also add the ability to pass metadata to the DataSource so that you can control this.
But this seems to go a little beyond the scope of the current release and sounds like a good plan for the near future
The text was updated successfully, but these errors were encountered: