[Issue]: Neo4j graphrag import notebook is outdated #1550

DanielGBabel · 2024-12-22T18:36:54Z

Do you need to file an issue?

I have searched the existing issues and this bug is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the issue

This tutorial is indicating the "description_embeddings" and "title" columns and a few other things that have changed since v0.4.0

I need to know how can I get the description_embeddings again in the .parquet files since the new workflow removes them from there and now are represented directly into a vectorstore.

What would be the most appropiate way to import this to neo4j now ?

Steps to reproduce

Install graphrag v0.4.0 or higher +
index any inputs, and follow the instructions in the neo4j import notebook

GraphRAG Config Used

### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

encoding_model: cl100k_base # this needs to be matched to your model!

llm:
  api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
  type: openai_chat # or azure_openai_chat
  model: gpt-4o-mini
  model_supports_json: true # recommended if this is available for your model.
  # audience: "https://cognitiveservices.azure.com/.default"
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>

parallelization:
  stagger: 0.3
  # num_threads: 50

async_mode: threaded # or asyncio

embeddings:
  async_mode: threaded # or asyncio
  vector_store: 
    type: lancedb
    db_uri: 'output/lancedb'
    container_name: default
    overwrite: true
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-3-large
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>

### Input settings ###

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.(txt|md)$"

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id]

### Storage settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided

cache:
  type: file # or blob
  base_dir: "cache"

reporting:
  type: file # or console, blob
  base_dir: "logs"

storage:
  type: file # or blob
  base_dir: "output"

## only turn this on if running `graphrag index` with custom settings
## we normally use `graphrag update` with the defaults
update_index_storage:
  # type: file # or blob
  # base_dir: "update_output"

### Workflow settings ###

skip_workflows: []

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event,concept,component,specification, business entity, attribute, value, field, system, process, role]
  max_gleanings: 3

summarize_descriptions:
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 2

community_reports:
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  embeddings: false
  transient: false

### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
  prompt: "prompts/local_search_system_prompt.txt"

global_search:
  map_prompt: "prompts/global_search_map_system_prompt.txt"
  reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
  knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
  prompt: "prompts/drift_search_system_prompt.txt"

Logs and screenshots

No response

Additional Information

GraphRAG Version: v0.4.0 +
Operating System: Linux
Python Version: 3.12
Related Issues: [Bug]: (#1296) KeyError: "Column(s) ['description_embedding'] do not exist" #1345
Related Commits: 17658c5

The text was updated successfully, but these errors were encountered:

DanielGBabel added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: Neo4j graphrag import notebook is outdated #1550

[Issue]: Neo4j graphrag import notebook is outdated #1550

DanielGBabel commented Dec 22, 2024

[Issue]: Neo4j graphrag import notebook is outdated #1550

[Issue]: Neo4j graphrag import notebook is outdated #1550

Comments

DanielGBabel commented Dec 22, 2024

Do you need to file an issue?

Describe the issue

Steps to reproduce

GraphRAG Config Used

Logs and screenshots

Additional Information