Skip to content

[Bug]: Blob storage loads all files regardless of base_dir #2115

@Robotuks

Description

@Robotuks

Do you need to file an issue?

  • I have searched the existing issues and this bug is not already filed.
  • My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

Using blob storage for input. But when running indexing noticed that it tries to read more files than it should. It gives me all files within a container.

input config example:

input:
  storage:
    type: blob # [file, blob]
    base_dir: "input/folder_1/"
    container_name: "graphrag-container"
    storage_account_blob_url: "https://graphragexample.blob.core.windows.net"
  file_type: text

Using this example config, getting warning messages during indexing like:
2025-10-24 11:30:38.0647 - WARNING - graphrag.storage.blob_pipeline_storage - Error getting key input/folder_1/input/folder_2/text.txt

In code, noticed you are getting all blobs from the container:

all_blobs = list(container_client.list_blobs())

Couldn't pin point exact place why other directories are not being filtered out but you could just get files under a base dir to avoid this issue:
all_blobs = list(container_client.list_blobs(name_starts_with=base_dir))

Steps to reproduce

No response

Expected Behavior

No response

GraphRAG Config Used

No response

Logs and screenshots

No response

Additional Information

  • GraphRAG Version: 2.6.0
  • Operating System: Ubuntu 22
  • Python Version: 3.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    backlogWe've confirmed some action is needed on this and will plan itbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions