You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Do you mean by updating an already indexed document, or by adding new ones? Adding new ones work and will just index the added doc. I have not tried updating an existing document. The question could also be expanded as can you remove an indexed document be deleting it from the input folder.
Adding new documents to the input folder will trigger indexing for those new documents. However, it will not index existing ones. Be aware that existing communities might get re-generated each time you add new documents, which can be time-consuming and consume valuable LLM credits.
It would be beneficial to have an option to create only new communities and skip reprocessing existing ones, allowing users to decide when to update existing community summaries. This approach would save significant LLM processing and cost, at the expense of a slight decrease in precision.
Personally, I prefer quickly indexing new documents, creating any necessary new communities, and then, at the end of the day, allowing the system to rebuild existing communities if needed based on the new documents added.
We cant regenerate a new parquet file and communities when I just added a file when I have 1000s of files preprocessed.
Like in other vector DBs we need something like just add it dynamically to the main DB and build relationships automatically
We cant regenerate a new parquet file and communities when I just added a file when I have 1000s of files preprocessed. Like in other vector DBs we need something like just add it dynamically to the main DB and build relationships automatically
I totally agree. Even with 10 files it quickly become super combersome and lenghty everytime a new file is added to the mix.
The claim_extraction has an enabled: true section that is commented out. I assume the default is false... So the same would be nice for community_report... maybe?
Would prevent new community from being created... so perhaps not optimal... Maybe a new optional variable called only_generate_new_communities: true ?
Hi, I've been playing a bit with graphrag changing specially the way LLM calls to increase concurrent requests and support my own inference class. I cannot fork and PR because I've already changed too much, but I had a couple ideas I could validate for this idea:
Deleting an item from the graph should be relatively easy, we need to delete the document_id, then text_unt, then relationships, then claims and relationships and others from parquet files, that would be enough, but index problems from pandas could appear. -> Cons: index issues
To add new items, I made the community creation optional, and used a LLM to classify which community should be linked to the new text. -> Cons: Limited communities, the first upload should be pretty general so the communities can be formed and then just classify
@bmaltais@CraftsMan-Labs you guys seem the only ones interested in this with me, as I said before, I do not have time to make a PR but I could upload code blocks with these ideas, so we can validate together, interested?
When I add a new document and use the usual index method, it looks like it is recreating everything from scratch, it creates new artifacts/parquet files with both documents, which part @bmaltais is not redone in your opinion?
It will not reprocess all documents… but adding new documents will lead to needing to update a lot of existing claims and community notes… and those are what will take the bulk of the time.
Thank you for making it clearer. Can you point me to the code, I went through the index module and couldn't spot the moment we do a diff or so on the documents?
I have a similar question. I performed two insertions of data into the same graphrag, which resulted in the creation of two different folders containing .parquet files. Where does the local search look for data? Which of the two folders should I use for the notebook examples involving local search?
Activity
bmaltais commentedon Jul 4, 2024
Do you mean by updating an already indexed document, or by adding new ones? Adding new ones work and will just index the added doc. I have not tried updating an existing document. The question could also be expanded as can you remove an indexed document be deleting it from the input folder.
CraftsMan-Labs commentedon Jul 4, 2024
Yep adding new docs on an already indexed system. Yes we would need upsert and remove stuff. but for starters upsert would be great.
bmaltais commentedon Jul 4, 2024
Adding new documents to the input folder will trigger indexing for those new documents. However, it will not index existing ones. Be aware that existing communities might get re-generated each time you add new documents, which can be time-consuming and consume valuable LLM credits.
It would be beneficial to have an option to create only new communities and skip reprocessing existing ones, allowing users to decide when to update existing community summaries. This approach would save significant LLM processing and cost, at the expense of a slight decrease in precision.
Personally, I prefer quickly indexing new documents, creating any necessary new communities, and then, at the end of the day, allowing the system to rebuild existing communities if needed based on the new documents added.
CraftsMan-Labs commentedon Jul 4, 2024
We cant regenerate a new parquet file and communities when I just added a file when I have 1000s of files preprocessed.
Like in other vector DBs we need something like just add it dynamically to the main DB and build relationships automatically
bmaltais commentedon Jul 4, 2024
I totally agree. Even with 10 files it quickly become super combersome and lenghty everytime a new file is added to the mix.
The claim_extraction has an
enabled: true
section that is commented out. I assume the default is false... So the same would be nice for community_report... maybe?Would prevent new community from being created... so perhaps not optimal... Maybe a new optional variable called
only_generate_new_communities: true
?bgonzalezfractal commentedon Jul 19, 2024
Hi, I've been playing a bit with graphrag changing specially the way LLM calls to increase concurrent requests and support my own inference class. I cannot fork and PR because I've already changed too much, but I had a couple ideas I could validate for this idea:
@bmaltais @CraftsMan-Labs you guys seem the only ones interested in this with me, as I said before, I do not have time to make a PR but I could upload code blocks with these ideas, so we can validate together, interested?
CraftsMan-Labs commentedon Jul 19, 2024
I think let's do it.
But i need more details on how ur doing the upsert after reading all the community details
sebnapi commentedon Jul 19, 2024
When I add a new document and use the usual index method, it looks like it is recreating everything from scratch, it creates new artifacts/parquet files with both documents, which part @bmaltais is not redone in your opinion?
bmaltais commentedon Jul 19, 2024
It will not reprocess all documents… but adding new documents will lead to needing to update a lot of existing claims and community notes… and those are what will take the bulk of the time.
sebnapi commentedon Jul 20, 2024
Thank you for making it clearer. Can you point me to the code, I went through the index module and couldn't spot the moment we do a diff or so on the documents?
kouskouss commentedon Jul 24, 2024
I have a similar question. I performed two insertions of data into the same graphrag, which resulted in the creation of two different folders containing .parquet files. Where does the local search look for data? Which of the two folders should I use for the notebook examples involving local search?
natoverse commentedon Jul 26, 2024
Consolidating index update requests with #741