Poor node deduplication - how to have more control? #495

jbartot · 2024-12-19T16:55:32Z

I see issues with node deduplication. Ingesting transcripts of informal conversations, I am getting, for example, duplicate nodes for what is clearly the same person, e.g., "John Doe" and "John T. Doe". Is there a way to have more control over this, of even post-training, a capability to collapse these nodes into a single one?

rabner · 2024-12-19T19:24:36Z

I've posted the same question on discord yesterday. I'm testing LightRAG with scientific papers which use abbreviated names very often. The same is for entities that are rephrased.

jbartot · 2024-12-19T23:21:43Z

@rabner - I guess I could try to preprocess the raw text and normalize the names before training, but being able to have some control over the deduping seems like a basic capability that is still missing.

flight505 · 2025-01-05T17:08:22Z

@jbartot Hi I am having the same issues, also dealing with scientific papers as @rabner. Did any of you find a solution? Also I am trying to insure references are consistant which seems like a hot topic in the RAG world. One obvious issue is papers are referenced with different styles even based on publisher. I have tried using Marker with llm and made some changes to the joint metadata.json, each paper then has several fields including the equations and references. I am not sure if is the right way of doing it, since I am also creating opportunities for introducing errors.
What are you using for converting pdf to txt?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor node deduplication - how to have more control? #495

Poor node deduplication - how to have more control? #495

jbartot commented Dec 19, 2024

rabner commented Dec 19, 2024

jbartot commented Dec 19, 2024 •

edited

Loading

flight505 commented Jan 5, 2025 •

edited

Loading

Poor node deduplication - how to have more control? #495

Poor node deduplication - how to have more control? #495

Comments

jbartot commented Dec 19, 2024

rabner commented Dec 19, 2024

jbartot commented Dec 19, 2024 • edited Loading

flight505 commented Jan 5, 2025 • edited Loading

jbartot commented Dec 19, 2024 •

edited

Loading

flight505 commented Jan 5, 2025 •

edited

Loading