Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor node deduplication - how to have more control? #495

Open
jbartot opened this issue Dec 19, 2024 · 3 comments
Open

Poor node deduplication - how to have more control? #495

jbartot opened this issue Dec 19, 2024 · 3 comments

Comments

@jbartot
Copy link

jbartot commented Dec 19, 2024

I see issues with node deduplication. Ingesting transcripts of informal conversations, I am getting, for example, duplicate nodes for what is clearly the same person, e.g., "John Doe" and "John T. Doe". Is there a way to have more control over this, of even post-training, a capability to collapse these nodes into a single one?

@rabner
Copy link

rabner commented Dec 19, 2024

I've posted the same question on discord yesterday. I'm testing LightRAG with scientific papers which use abbreviated names very often. The same is for entities that are rephrased.

@jbartot
Copy link
Author

jbartot commented Dec 19, 2024

@rabner - I guess I could try to preprocess the raw text and normalize the names before training, but being able to have some control over the deduping seems like a basic capability that is still missing.

@flight505
Copy link

flight505 commented Jan 5, 2025

@jbartot Hi I am having the same issues, also dealing with scientific papers as @rabner. Did any of you find a solution? Also I am trying to insure references are consistant which seems like a hot topic in the RAG world. One obvious issue is papers are referenced with different styles even based on publisher. I have tried using Marker with llm and made some changes to the joint metadata.json, each paper then has several fields including the equations and references. I am not sure if is the right way of doing it, since I am also creating opportunities for introducing errors.
What are you using for converting pdf to txt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants