Building wheels for collected package: jq failed in Windows #4396

AlbelTec · 2023-05-09T12:58:59Z

System Info

Hi, can't update langchain. any insight ?

Building wheels for collected packages: jq
  Building wheel for jq (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for jq (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [5 lines of output]
      running bdist_wheel
      running build
      running build_ext
      Executing: ./configure CFLAGS=-fPIC --prefix=C:\Users\mysuser\AppData\Local\Temp\pip-install-7643mu3e\jq_64b3898552df463e990cf884cae8a414\_deps\build\onig-install-6.9.4
      error: [WinError 2] The system cannot find the file specified
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for jq
Failed to build jq
ERROR: Could not build wheels for jq, which is required to install pyproject.toml-based projects

Who can help?

No response

Information

The official example notebooks/scripts
My own modified scripts

Related Components

Reproduction

pip install langchain[all] --upgrade

Expected behavior

no issue during installtion

The text was updated successfully, but these errors were encountered:

Badrul-Goomblepop · 2023-05-12T17:59:10Z

I have the same issue. JQ (https://pypi.org/project/jq/) just doesn't seem to be supported for windows.

Given that I don't expect help to arrive any time soon I tool a look into the JSON loader code. It seems ultimately just returns a document object which needs a text and a dict from what I can tell. Going to try and just write my own code that gives back this rather than relying on jq.

Update: Here is the code I came up with for my particular use case. You should be able to modify it easily for your own.


from langchain.docstore.document import Document
import json

def loadJSONFile(file_path):
    docs=[]
    # Load JSON file
    with open(file_path) as file:
        data = json.load(file)

    # Iterate through 'pages'
    for page in data['pages']:
        parenturl = page['parenturl']
        pagetitle = page['pagetitle']
        indexeddate = page['indexeddate']
        snippets = page['snippets']
        metadata={"title":pagetitle}

        # Process snippets for each page
        for snippet in snippets:
            index = snippet['index']
            childurl = snippet['childurl']
            text = snippet['text']
            docs.append(Document(page_content=text, metadata=metadata))
    return docs

AlbelTec · 2023-05-13T12:26:51Z

@Badrul-Goomblepop Thanks. Indeed I came across some comments and it appears that jq is linux based and not windows. The only issue I'm seeing so far, is that when doing : pip install langchain[all] --upgrade. As windows users we can't anymore do it like so as jq will put in fail the whole stuff. Really annoying when people add OS dependencies.

Badrul-Goomblepop · 2023-05-13T16:04:18Z

As a further update here is the final code I went with followed by example usage. Someone brighter than me should probably turn it into a generic class that could be submitted back to LangChain as an alternative to the current JSONLoader. My one is tailored for my JSON file which is of the structure:
{pages {parenturl:"",pagetitle:"",snippets[ {...} ]}

"""Loader that loads data from JSON."""
import json
from pathlib import Path
from typing import Callable, Dict, List, Optional, Union

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader


class JSONLoader(BaseLoader):
    def __init__(
        self,
        file_path: Union[str, Path],
        content_key: Optional[str] = None,
        ):
        self.file_path = Path(file_path).resolve()
        self._content_key = content_key
        
    def load(self) -> List[Document]:
        """Load and return documents from the JSON file."""

        docs=[]
        # Load JSON file
        with open(self.file_path) as file:
            data = json.load(file)

            # Iterate through 'pages'
            for page in data['pages']:
                parenturl = page['parenturl']
                pagetitle = page['pagetitle']
                indexeddate = page['indexeddate']
                snippets = page['snippets']

                # Process snippets for each page
                for snippet in snippets:
                    index = snippet['index']
                    childurl = snippet['childurl']
                    text = snippet['text']
                    metadata = dict(
                        source=childurl,
                        title=pagetitle)

                    docs.append(Document(page_content=text, metadata=metadata))
        return docs

file_path='data/data.json'
loader = JSONLoader(file_path=file_path)
data = loader.load()
from langchain.indexes import VectorstoreIndexCreator
index = VectorstoreIndexCreator().from_loaders([loader])

tsaijamey · 2023-07-08T03:18:52Z

As a further update here is the final code I went with followed by example usage. Someone brighter than me should probably turn it into a generic class that could be submitted back to LangChain as an alternative to the current JSONLoader. My one is tailored for my JSON file which is of the structure: {pages {parenturl:"",pagetitle:"",snippets[ {...} ]}

"""Loader that loads data from JSON."""
import json
from pathlib import Path
from typing import Callable, Dict, List, Optional, Union

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader


class JSONLoader(BaseLoader):
    def __init__(
        self,
        file_path: Union[str, Path],
        content_key: Optional[str] = None,
        ):
        self.file_path = Path(file_path).resolve()
        self._content_key = content_key
        
    def load(self) -> List[Document]:
        """Load and return documents from the JSON file."""

        docs=[]
        # Load JSON file
        with open(self.file_path) as file:
            data = json.load(file)

            # Iterate through 'pages'
            for page in data['pages']:
                parenturl = page['parenturl']
                pagetitle = page['pagetitle']
                indexeddate = page['indexeddate']
                snippets = page['snippets']

                # Process snippets for each page
                for snippet in snippets:
                    index = snippet['index']
                    childurl = snippet['childurl']
                    text = snippet['text']
                    metadata = dict(
                        source=childurl,
                        title=pagetitle)

                    docs.append(Document(page_content=text, metadata=metadata))
        return docs

file_path='data/data.json'
loader = JSONLoader(file_path=file_path)
data = loader.load()
from langchain.indexes import VectorstoreIndexCreator
index = VectorstoreIndexCreator().from_loaders([loader])

@Badrul-Goomblepop Well done! thanks for your work

ajeeto · 2023-07-11T10:58:50Z

This is very useful. how can i change your code for the my json file, which is pretty compelx and nested json

utkucanaytac · 2023-07-14T10:12:49Z

Hello. I had the same problem with JsonLoader on Windows. I found this repo : https://github.com/jerryjliu/llama_index . Please check this. It may help to load any data into LLM

ajeeto · 2023-07-14T10:34:32Z

Thank you, but how can then I use it as part of langchain. I have the chains working well for all other documents except json, i am looking for help for a generic code like

As a further update here is the final code I went with followed by example usage. Someone brighter than me should probably turn it into a generic class that could be submitted back to LangChain as an alternative to the current JSONLoader. My one is tailored for my JSON file which is of the structure: {pages {parenturl:"",pagetitle:"",snippets[ {...} ]}

"""Loader that loads data from JSON."""
import json
from pathlib import Path
from typing import Callable, Dict, List, Optional, Union

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader


class JSONLoader(BaseLoader):
    def __init__(
        self,
        file_path: Union[str, Path],
        content_key: Optional[str] = None,
        ):
        self.file_path = Path(file_path).resolve()
        self._content_key = content_key
        
    def load(self) -> List[Document]:
        """Load and return documents from the JSON file."""

        docs=[]
        # Load JSON file
        with open(self.file_path) as file:
            data = json.load(file)

            # Iterate through 'pages'
            for page in data['pages']:
                parenturl = page['parenturl']
                pagetitle = page['pagetitle']
                indexeddate = page['indexeddate']
                snippets = page['snippets']

                # Process snippets for each page
                for snippet in snippets:
                    index = snippet['index']
                    childurl = snippet['childurl']
                    text = snippet['text']
                    metadata = dict(
                        source=childurl,
                        title=pagetitle)

                    docs.append(Document(page_content=text, metadata=metadata))
        return docs

file_path='data/data.json'
loader = JSONLoader(file_path=file_path)
data = loader.load()
from langchain.indexes import VectorstoreIndexCreator
index = VectorstoreIndexCreator().from_loaders([loader])

@Badrul-Goomblepop Well done! thanks for your work

This is great, but like you said , need someoone to make it work for any json data if possible.

utkucanaytac · 2023-07-22T09:23:16Z

We were unable to discover a solution to make it work on Windows too. Due to JsonLoader being based on jq scheme, unfortunately, there are no available wheel packages tailored for Windows. Let me know if you find any workarounds ?

ddematheu · 2023-08-09T23:12:48Z

A bit late to the party, but wrote a more generic loader that simply processes the JSON into individual documents for each property in it:

import json
from pathlib import Path
from typing import Callable, Dict, List, Optional, Union

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class JSONLoader(BaseLoader):
    def __init__(
        self,
        file_path: Union[str, Path],
        content_key: Optional[str] = None,
        ):
        self.file_path = Path(file_path).resolve()
        self._content_key = content_key
    
    def create_documents(processed_data):
        documents = []
        for item in processed_data:
            content = ''.join(item)
            document = Document(page_content=content, metadata={})
            documents.append(document)
        return documents
    
    def process_item(self, item, prefix=""):
        if isinstance(item, dict):
            result = []
            for key, value in item.items():
                new_prefix = f"{prefix}.{key}" if prefix else key
                result.extend(self.process_item(value, new_prefix))
            return result
        elif isinstance(item, list):
            result = []
            for value in item:
                result.extend(self.process_item(value, prefix))
            return result
        else:
            return [f"{prefix}: {item}"]

    def process_json(self,data):
        if isinstance(data, list):
            processed_data = []
            for item in data:
                processed_data.extend(self.process_item(item))
            return processed_data
        elif isinstance(data, dict):
            return self.process_item(data)
        else:
            return []

    def load(self) -> List[Document]:
        """Load and return documents from the JSON file."""

        docs=[]
        with open(self.file_path, 'r') as json_file:
            try:
                data = json.load(json_file)
                processed_json = self.process_json(data)
                docs = self.create_documents(processed_json)
            except json.JSONDecodeError:
                print("Error: Invalid JSON format in the file.")
        return docs

kumar9005 · 2023-08-10T18:16:00Z

file_path='data/data.json'
loader = JSONLoader(file_path=file_path)
data = loader.load()

This is really great, thank you so much for your work.

abdkumar · 2023-08-29T12:50:20Z

@ddematheu , @Badrul-Goomblepop

A bit late to the party, but wrote a more generic loader that simply processes the JSON into individual documents for each property in it:

import json
from pathlib import Path
from typing import Callable, Dict, List, Optional, Union

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class JSONLoader(BaseLoader):
    def __init__(
        self,
        file_path: Union[str, Path],
        content_key: Optional[str] = None,
        ):
        self.file_path = Path(file_path).resolve()
        self._content_key = content_key
    
    def create_documents(processed_data):
        documents = []
        for item in processed_data:
            content = ''.join(item)
            document = Document(page_content=content, metadata={})
            documents.append(document)
        return documents
    
    def process_item(self, item, prefix=""):
        if isinstance(item, dict):
            result = []
            for key, value in item.items():
                new_prefix = f"{prefix}.{key}" if prefix else key
                result.extend(self.process_item(value, new_prefix))
            return result
        elif isinstance(item, list):
            result = []
            for value in item:
                result.extend(self.process_item(value, prefix))
            return result
        else:
            return [f"{prefix}: {item}"]

    def process_json(self,data):
        if isinstance(data, list):
            processed_data = []
            for item in data:
                processed_data.extend(self.process_item(item))
            return processed_data
        elif isinstance(data, dict):
            return self.process_item(data)
        else:
            return []

    def load(self) -> List[Document]:
        """Load and return documents from the JSON file."""

        docs=[]
        with open(self.file_path, 'r') as json_file:
            try:
                data = json.load(json_file)
                processed_json = self.process_json(data)
                docs = self.create_documents(processed_json)
            except json.JSONDecodeError:
                print("Error: Invalid JSON format in the file.")
        return docs

How about using JMESPath library to replace jq in windows os?

wujianming1996 · 2023-09-03T13:42:34Z

thanks !!!

Childcity · 2023-09-08T23:35:12Z

A bit late to the party, but wrote a more generic loader that simply processes the JSON into individual documents for each property in it:

import json
from pathlib import Path
from typing import Callable, Dict, List, Optional, Union

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class JSONLoader(BaseLoader):
    def __init__(
        self,
        file_path: Union[str, Path],
        content_key: Optional[str] = None,
        ):
        self.file_path = Path(file_path).resolve()
        self._content_key = content_key
    
    def create_documents(processed_data):
        documents = []
        for item in processed_data:
            content = ''.join(item)
            document = Document(page_content=content, metadata={})
            documents.append(document)
        return documents
    
    def process_item(self, item, prefix=""):
        if isinstance(item, dict):
            result = []
            for key, value in item.items():
                new_prefix = f"{prefix}.{key}" if prefix else key
                result.extend(self.process_item(value, new_prefix))
            return result
        elif isinstance(item, list):
            result = []
            for value in item:
                result.extend(self.process_item(value, prefix))
            return result
        else:
            return [f"{prefix}: {item}"]

    def process_json(self,data):
        if isinstance(data, list):
            processed_data = []
            for item in data:
                processed_data.extend(self.process_item(item))
            return processed_data
        elif isinstance(data, dict):
            return self.process_item(data)
        else:
            return []

    def load(self) -> List[Document]:
        """Load and return documents from the JSON file."""

        docs=[]
        with open(self.file_path, 'r') as json_file:
            try:
                data = json.load(json_file)
                processed_json = self.process_json(data)
                docs = self.create_documents(processed_json)
            except json.JSONDecodeError:
                print("Error: Invalid JSON format in the file.")
        return docs

You missed to add self in def create_documents(processed_data): as first arg 😀

oldgitt · 2023-09-28T11:04:22Z

ddematheu is the real MVP!

ddematheu · 2023-09-28T17:28:02Z

I have published the loader through a small package that we created which includes both CSV and JSON versions of it.

https://pypi.org/project/neumai-tools/

Keven1894 · 2023-10-04T17:08:19Z

A bit late to the party, but wrote a more generic loader that simply processes the JSON into individual documents for each property in it:

import json
from pathlib import Path
from typing import Callable, Dict, List, Optional, Union

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class JSONLoader(BaseLoader):
    def __init__(
        self,
        file_path: Union[str, Path],
        content_key: Optional[str] = None,
        ):
        self.file_path = Path(file_path).resolve()
        self._content_key = content_key
    
    def create_documents(processed_data):
        documents = []
        for item in processed_data:
            content = ''.join(item)
            document = Document(page_content=content, metadata={})
            documents.append(document)
        return documents
    
    def process_item(self, item, prefix=""):
        if isinstance(item, dict):
            result = []
            for key, value in item.items():
                new_prefix = f"{prefix}.{key}" if prefix else key
                result.extend(self.process_item(value, new_prefix))
            return result
        elif isinstance(item, list):
            result = []
            for value in item:
                result.extend(self.process_item(value, prefix))
            return result
        else:
            return [f"{prefix}: {item}"]

    def process_json(self,data):
        if isinstance(data, list):
            processed_data = []
            for item in data:
                processed_data.extend(self.process_item(item))
            return processed_data
        elif isinstance(data, dict):
            return self.process_item(data)
        else:
            return []

    def load(self) -> List[Document]:
        """Load and return documents from the JSON file."""

        docs=[]
        with open(self.file_path, 'r') as json_file:
            try:
                data = json.load(json_file)
                processed_json = self.process_json(data)
                docs = self.create_documents(processed_json)
            except json.JSONDecodeError:
                print("Error: Invalid JSON format in the file.")
        return docs

Thanks, very helpful. Very minor change if anyone received "takes 1 positional argument but 2 were given" error. def create_documents(processed_data): --> def create_documents(self,processed_data):

hexbenjamin · 2023-10-23T13:35:35Z

I have published the loader through a small package that we created which includes both CSV and JSON versions of it.

https://pypi.org/project/neumai-tools/

imho, this is the absolute best-case ending to an issue like this one. excited to go try it, thank you @ddematheu for your work!

ddematheu · 2023-10-23T22:38:00Z

Let me know if you have any questions :)

sabatale · 2024-01-06T19:29:17Z

With the missing self argument and utf-8 encoding in case anyone reads this:

"""Loader that loads data from JSON."""
import json
from pathlib import Path
from typing import Callable, Dict, List, Optional, Union

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class JSONLoader(BaseLoader):
    def __init__(
        self,
        file_path: Union[str, Path],
        content_key: Optional[str] = None,
        ):
        self.file_path = Path(file_path).resolve()
        self._content_key = content_key
    
    def create_documents(self, processed_data):
        documents = []
        for item in processed_data:
            content = ''.join(item)
            document = Document(page_content=content, metadata={})
            documents.append(document)
        return documents
    
    def process_item(self, item, prefix=""):
        if isinstance(item, dict):
            result = []
            for key, value in item.items():
                new_prefix = f"{prefix}.{key}" if prefix else key
                result.extend(self.process_item(value, new_prefix))
            return result
        elif isinstance(item, list):
            result = []
            for value in item:
                result.extend(self.process_item(value, prefix))
            return result
        else:
            return [f"{prefix}: {item}"]

    def process_json(self,data):
        if isinstance(data, list):
            processed_data = []
            for item in data:
                processed_data.extend(self.process_item(item))
            return processed_data
        elif isinstance(data, dict):
            return self.process_item(data)
        else:
            return []

    def load(self) -> List[Document]:
        """Load and return documents from the JSON file."""

        docs=[]
        with open(self.file_path, mode="r", encoding="utf-8") as json_file:
            try:
                data = json.load(json_file)
                processed_json = self.process_json(data)
                docs = self.create_documents(processed_json)
            except json.JSONDecodeError:
                print("Error: Invalid JSON format in the file.")
        return docs

zpf7879 · 2024-02-04T08:56:15Z

How can further improve the lib to add the support for meta_data mapping function like:

def item_metadata_func(record: dict, metadata: dict) -> dict: 

    metadata["name"] = record.get("name")
    metadata["url"] = record.get("url")

    return metadata

loader = JSONLoader(
        file_path="services.json",
        content_key='description'),
        metadata_func=item_metadata_func)

A bit late to the party, but wrote a more generic loader that simply processes the JSON into individual documents for each property in it:

import json
from pathlib import Path
from typing import Callable, Dict, List, Optional, Union

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class JSONLoader(BaseLoader):
    def __init__(
        self,
        file_path: Union[str, Path],
        content_key: Optional[str] = None,
        ):
        self.file_path = Path(file_path).resolve()
        self._content_key = content_key
    
    def create_documents(processed_data):
        documents = []
        for item in processed_data:
            content = ''.join(item)
            document = Document(page_content=content, metadata={})
            documents.append(document)
        return documents
    
    def process_item(self, item, prefix=""):
        if isinstance(item, dict):
            result = []
            for key, value in item.items():
                new_prefix = f"{prefix}.{key}" if prefix else key
                result.extend(self.process_item(value, new_prefix))
            return result
        elif isinstance(item, list):
            result = []
            for value in item:
                result.extend(self.process_item(value, prefix))
            return result
        else:
            return [f"{prefix}: {item}"]

    def process_json(self,data):
        if isinstance(data, list):
            processed_data = []
            for item in data:
                processed_data.extend(self.process_item(item))
            return processed_data
        elif isinstance(data, dict):
            return self.process_item(data)
        else:
            return []

    def load(self) -> List[Document]:
        """Load and return documents from the JSON file."""

        docs=[]
        with open(self.file_path, 'r') as json_file:
            try:
                data = json.load(json_file)
                processed_json = self.process_json(data)
                docs = self.create_documents(processed_json)
            except json.JSONDecodeError:
                print("Error: Invalid JSON format in the file.")
        return docs

Thanks, very helpful. Very minor change if anyone received "takes 1 positional argument but 2 were given" error. def create_documents(processed_data): --> def create_documents(self,processed_data):

MattMindQ · 2024-02-28T17:47:29Z

How can further improve the lib to add the support for meta_data mapping function like:

def item_metadata_func(record: dict, metadata: dict) -> dict: 

    metadata["name"] = record.get("name")
    metadata["url"] = record.get("url")

    return metadata

loader = JSONLoader(
        file_path="services.json",
        content_key='description'),
        metadata_func=item_metadata_func)

A bit late to the party, but wrote a more generic loader that simply processes the JSON into individual documents for each property in it:

import json
from pathlib import Path
from typing import Callable, Dict, List, Optional, Union

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class JSONLoader(BaseLoader):
    def __init__(
        self,
        file_path: Union[str, Path],
        content_key: Optional[str] = None,
        ):
        self.file_path = Path(file_path).resolve()
        self._content_key = content_key
    
    def create_documents(processed_data):
        documents = []
        for item in processed_data:
            content = ''.join(item)
            document = Document(page_content=content, metadata={})
            documents.append(document)
        return documents
    
    def process_item(self, item, prefix=""):
        if isinstance(item, dict):
            result = []
            for key, value in item.items():
                new_prefix = f"{prefix}.{key}" if prefix else key
                result.extend(self.process_item(value, new_prefix))
            return result
        elif isinstance(item, list):
            result = []
            for value in item:
                result.extend(self.process_item(value, prefix))
            return result
        else:
            return [f"{prefix}: {item}"]

    def process_json(self,data):
        if isinstance(data, list):
            processed_data = []
            for item in data:
                processed_data.extend(self.process_item(item))
            return processed_data
        elif isinstance(data, dict):
            return self.process_item(data)
        else:
            return []

    def load(self) -> List[Document]:
        """Load and return documents from the JSON file."""

        docs=[]
        with open(self.file_path, 'r') as json_file:
            try:
                data = json.load(json_file)
                processed_json = self.process_json(data)
                docs = self.create_documents(processed_json)
            except json.JSONDecodeError:
                print("Error: Invalid JSON format in the file.")
        return docs

Thanks, very helpful. Very minor change if anyone received "takes 1 positional argument but 2 were given" error. def create_documents(processed_data): --> def create_documents(self,processed_data):

This one can use metadata function, so far it is working for me :

import json
from pathlib import Path
from typing import Callable, Dict, List, Optional, Union

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class JSONLoader(BaseLoader):
    def __init__(
        self,
        file_path: Union[str, Path],
        content_key: Optional[str] = None,
        metadata_func: Optional[Callable[[dict, dict], dict]] = None,
    ):
        """
        Initializes the JSONLoader with a file path, an optional content key to extract specific content,
        and an optional metadata function to extract metadata from each record.
        """
        self.file_path = Path(file_path).resolve()
        self.content_key = content_key 
        self.metadata_func = metadata_func

    def create_documents(self, processed_data):
        """
        Creates Document objects from processed data.
        """
        documents = []
        for item in processed_data:
            content = item.get('content', '')  
            metadata = item.get('metadata', {})
            document = Document(page_content=content, metadata=metadata)
            documents.append(document)
        return documents

    def process_json(self, data):
        """
        Processes JSON data to prepare for document creation, extracting content based on the content_key
        and applying the metadata function if provided.
        """
        processed_data = []
        if isinstance(data, list):
            for item in data:
                content = item.get(self.content_key, '') if self.content_key else ''
                metadata = {}
                if self.metadata_func and isinstance(item, dict):
                    metadata = self.metadata_func(item, {})
                processed_data.append({'content': content, 'metadata': metadata})
        return processed_data

    def load(self) -> List[Document]:
        """
        Load and return documents from the JSON file.
        """
        docs = []
        with open(self.file_path, mode="r", encoding="utf-8") as json_file:
            try:
                data = json.load(json_file)
                processed_json = self.process_json(data)
                docs = self.create_documents(processed_json)
            except json.JSONDecodeError:
                print("Error: Invalid JSON format in the file.")
        return docs

skywolf123 · 2024-03-01T08:56:41Z

How can further improve the lib to add the support for meta_data mapping function like:

def item_metadata_func(record: dict, metadata: dict) -> dict: 

    metadata["name"] = record.get("name")
    metadata["url"] = record.get("url")

    return metadata

loader = JSONLoader(
        file_path="services.json",
        content_key='description'),
        metadata_func=item_metadata_func)

A bit late to the party, but wrote a more generic loader that simply processes the JSON into individual documents for each property in it:

import json
from pathlib import Path
from typing import Callable, Dict, List, Optional, Union

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class JSONLoader(BaseLoader):
    def __init__(
        self,
        file_path: Union[str, Path],
        content_key: Optional[str] = None,
        ):
        self.file_path = Path(file_path).resolve()
        self._content_key = content_key
    
    def create_documents(processed_data):
        documents = []
        for item in processed_data:
            content = ''.join(item)
            document = Document(page_content=content, metadata={})
            documents.append(document)
        return documents
    
    def process_item(self, item, prefix=""):
        if isinstance(item, dict):
            result = []
            for key, value in item.items():
                new_prefix = f"{prefix}.{key}" if prefix else key
                result.extend(self.process_item(value, new_prefix))
            return result
        elif isinstance(item, list):
            result = []
            for value in item:
                result.extend(self.process_item(value, prefix))
            return result
        else:
            return [f"{prefix}: {item}"]

    def process_json(self,data):
        if isinstance(data, list):
            processed_data = []
            for item in data:
                processed_data.extend(self.process_item(item))
            return processed_data
        elif isinstance(data, dict):
            return self.process_item(data)
        else:
            return []

    def load(self) -> List[Document]:
        """Load and return documents from the JSON file."""

        docs=[]
        with open(self.file_path, 'r') as json_file:
            try:
                data = json.load(json_file)
                processed_json = self.process_json(data)
                docs = self.create_documents(processed_json)
            except json.JSONDecodeError:
                print("Error: Invalid JSON format in the file.")
        return docs

Thanks, very helpful. Very minor change if anyone received "takes 1 positional argument but 2 were given" error. def create_documents(processed_data): --> def create_documents(self,processed_data):

This one can use metadata function, so far it is working for me :

import json
from pathlib import Path
from typing import Callable, Dict, List, Optional, Union

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class JSONLoader(BaseLoader):
    def __init__(
        self,
        file_path: Union[str, Path],
        content_key: Optional[str] = None,
        metadata_func: Optional[Callable[[dict, dict], dict]] = None,
    ):
        """
        Initializes the JSONLoader with a file path, an optional content key to extract specific content,
        and an optional metadata function to extract metadata from each record.
        """
        self.file_path = Path(file_path).resolve()
        self.content_key = content_key 
        self.metadata_func = metadata_func

    def create_documents(self, processed_data):
        """
        Creates Document objects from processed data.
        """
        documents = []
        for item in processed_data:
            content = item.get('content', '')  
            metadata = item.get('metadata', {})
            document = Document(page_content=content, metadata=metadata)
            documents.append(document)
        return documents

    def process_json(self, data):
        """
        Processes JSON data to prepare for document creation, extracting content based on the content_key
        and applying the metadata function if provided.
        """
        processed_data = []
        if isinstance(data, list):
            for item in data:
                content = item.get(self.content_key, '') if self.content_key else ''
                metadata = {}
                if self.metadata_func and isinstance(item, dict):
                    metadata = self.metadata_func(item, {})
                processed_data.append({'content': content, 'metadata': metadata})
        return processed_data

    def load(self) -> List[Document]:
        """
        Load and return documents from the JSON file.
        """
        docs = []
        with open(self.file_path, mode="r", encoding="utf-8") as json_file:
            try:
                data = json.load(json_file)
                processed_json = self.process_json(data)
                docs = self.create_documents(processed_json)
            except json.JSONDecodeError:
                print("Error: Invalid JSON format in the file.")
        return docs

Add support for jsonl format files:

class JSONLoader(BaseLoader):
    def __init__(
        self,
        file_path: Union[str, Path],
        content_key: Optional[str] = None,
        metadata_func: Optional[Callable[[dict, dict], dict]] = None,
        json_lines: bool = False
    ):
        """
        Initializes the JSONLoader with a file path, an optional content key to extract specific content,
        and an optional metadata function to extract metadata from each record.
        """
        self.file_path = Path(file_path).resolve()
        self._content_key = content_key
        self._metadata_func = metadata_func
        self._json_lines = json_lines

    def create_documents(self, processed_data):
        """
        Creates Document objects from processed data.
        """
        documents = []
        for item in processed_data:
            content = item.get('content', '')  
            metadata = item.get('metadata', {})
            document = Document(page_content=content, metadata=metadata)
            documents.append(document)
        return documents

    def process_json(self, data):
        """
        Processes JSON data to prepare for document creation, extracting content based on the content_key
        and applying the metadata function if provided.
        """
        processed_data = []
        if isinstance(data, list):
            for item in data:
                content = item.get(self.content_key, '') if self.content_key else ''
                metadata = {}
                if self._metadata_func and isinstance(item, dict):
                    metadata = self._metadata_func(item, {})
                processed_data.append({'content': content, 'metadata': metadata})
        return processed_data

    def load(self) -> List[Document]:
        """
        Load and return documents from the JSON or JSON Lines file.
        """
        docs = []
        with open(self.file_path, 'r', encoding="utf-8") as file:
            if self._json_lines:
                # Handle JSON Lines
                for line_number, line in enumerate(file, start=1):
                    try:
                        data = json.loads(line)
                        processed_json = self.process_json(data)
                        docs.extend(self.create_documents(processed_json))
                    except json.JSONDecodeError:
                        print(f"Error: Invalid JSON format at line {line_number}.")
            else:
                # Handle regular JSON
                try:
                    data = json.load(file)
                    processed_json = self.process_json(data)
                    docs = self.create_documents(processed_json)
                except json.JSONDecodeError:
                    print("Error: Invalid JSON format in the file.")
        return docs

skywolf123 · 2024-03-01T13:54:04Z

Source code based JSONLoader class modification, only removing the jq library, no problem with simple JSON and JSONL files.

class JSONLoader(BaseLoader):
    def __init__(
        self,
        file_path: Union[str, Path],
        content_key: Optional[str] = None,
        metadata_func: Optional[Callable[[Dict, Dict], Dict]] = None,
        text_content: bool = True,
        json_lines: bool = False,
    ):
        """
        Initializes the JSONLoader with a file path, an optional content key to extract specific content,
        and an optional metadata function to extract metadata from each record.
        """
        self.file_path = Path(file_path).resolve()
        self._content_key = content_key
        self._metadata_func = metadata_func
        self._text_content = text_content
        self._json_lines = json_lines

    def load(self) -> List[Document]:
        """Load and return documents from the JSON file."""
        docs: List[Document] = []
        if self._json_lines:
            with self.file_path.open(encoding="utf-8") as f:
                for line in f:
                    line = line.strip()
                    if line:
                        self._parse(line, docs)
        else:
            self._parse(self.file_path.read_text(encoding="utf-8"), docs)
        return docs

    def _parse(self, content: str, docs: List[Document]) -> None:
        """Convert given content to documents."""
        data = json.loads(content)

        # Perform some validation
        # This is not a perfect validation, but it should catch most cases
        # and prevent the user from getting a cryptic error later on.
        if self._content_key is not None:
            self._validate_content_key(data)
        if self._metadata_func is not None:
            self._validate_metadata_func(data)

        for i, sample in enumerate(data, len(docs) + 1):
            text = self._get_text(sample=sample)
            metadata = self._get_metadata(sample=sample, source=str(self.file_path), seq_num=i)
            docs.append(Document(page_content=text, metadata=metadata))

    def _get_text(self, sample: Any) -> str:
        """Convert sample to string format"""
        if self._content_key is not None:
            content = sample.get(self._content_key)
        else:
            content = sample

        if self._text_content and not isinstance(content, str):
            raise ValueError(
                f"Expected page_content is string, got {type(content)} instead. \
                    Set `text_content=False` if the desired input for \
                    `page_content` is not a string"
            )

        # In case the text is None, set it to an empty string
        elif isinstance(content, str):
            return content
        elif isinstance(content, dict):
            return json.dumps(content) if content else ""
        else:
            return str(content) if content is not None else ""

    def _get_metadata(self, sample: Dict[str, Any], **additional_fields: Any) -> Dict[str, Any]:
        """
        Return a metadata dictionary base on the existence of metadata_func
        :param sample: single data payload
        :param additional_fields: key-word arguments to be added as metadata values
        :return:
        """
        if self._metadata_func is not None:
            return self._metadata_func(sample, additional_fields)
        else:
            return additional_fields

    def _validate_content_key(self, data: Any) -> None:
        """Check if a content key is valid"""
        sample = data.first()
        if not isinstance(sample, dict):
            raise ValueError(
                f"Expected the jq schema to result in a list of objects (dict), \
                    so sample must be a dict but got `{type(sample)}`"
            )

        if sample.get(self._content_key) is None:
            raise ValueError(
                f"Expected the jq schema to result in a list of objects (dict) \
                    with the key `{self._content_key}`"
            )

    def _validate_metadata_func(self, data: Any) -> None:
        """Check if the metadata_func output is valid"""

        sample = data.first()
        if self._metadata_func is not None:
            sample_metadata = self._metadata_func(sample, {})
            if not isinstance(sample_metadata, dict):
                raise ValueError(
                    f"Expected the metadata_func to return a dict but got \
                        `{type(sample_metadata)}`"
                )

AngelaO · 2024-03-19T13:04:58Z

Thanks so much @skywolf123 🙏 the above worked for me with a couple of tweaks:

changed to from typing import Any, Callable, Dict, List, Optional, Union - was missing the Any import
no first() method available if data is type List, which it was in my case, so changed to data: List instead of data: Any in both the _validate_content_key and _validate_metadata_func methods plus switched sample = data.first() to sample = data[0] in both those methods as well
changed if sample.get(self._content_key) is None: to if not self._content_key in sample.keys(): - the value can be None but the key is still there
changed the default value of self.text_content to bool = False - just a heads up to others using this, set this to True if your JSON field text_content should never be None, the code will throw exception then if it is, otherwise it will replace with empty string
Thanks again for saving me some time reinventing the wheel with this! :)

sivaprasad49 · 2024-04-29T12:38:59Z

Thanks you so much @skywolf123 . But this expects a json file to be passed as parameter. Is there a way to pass an API data i.e dict to call this JSONLoader class??

sivaprasad49 · 2024-04-29T12:42:33Z

Thank you so much @sabatale . This worked for me with json files. But I couldn't pass a dict (a API data) to this class. How to modify this function?

Olfredos6 · 2024-06-12T04:59:06Z

For those looking to run jq on Windows, there's a wheel available for jq at https://jeffreyknockel.com/jq/. The author mentions that there are not many tests done on this yet . So, if you do choose to work with it, it's at your own risks. Been using it, so far, no issue. How to go about it?

Download the .whlfile matching your Python version
Install it:
```
   pip install [PATH TO .WHL FILE]
```

Ref: mwilliamson/jq.py#20 (comment)

rasavant-ms mentioned this issue Nov 29, 2023

jq requirement causing errors on Windows microsoft/rag-experiment-accelerator#167

Closed

parisologist mentioned this issue Dec 8, 2023

Can you update docs to clarify yq doesn't work on windows? mikefarah/yq#1897

Closed

CchenDdong mentioned this issue Apr 19, 2024

[BUG] python init_database.py --recreate-vs时报错 chatchat-space/Langchain-Chatchat#3121

Closed

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 11, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 18, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building wheels for collected package: jq failed in Windows #4396

Building wheels for collected package: jq failed in Windows #4396

AlbelTec commented May 9, 2023

Badrul-Goomblepop commented May 12, 2023 •

edited

Loading

AlbelTec commented May 13, 2023 •

edited

Loading

Badrul-Goomblepop commented May 13, 2023 •

edited

Loading

tsaijamey commented Jul 8, 2023

ajeeto commented Jul 11, 2023

utkucanaytac commented Jul 14, 2023

ajeeto commented Jul 14, 2023

utkucanaytac commented Jul 22, 2023

ddematheu commented Aug 9, 2023 •

edited

Loading

kumar9005 commented Aug 10, 2023

abdkumar commented Aug 29, 2023 •

edited

Loading

wujianming1996 commented Sep 3, 2023

Childcity commented Sep 8, 2023 •

edited

Loading

oldgitt commented Sep 28, 2023 •

edited

Loading

ddematheu commented Sep 28, 2023

Keven1894 commented Oct 4, 2023

hexbenjamin commented Oct 23, 2023

ddematheu commented Oct 23, 2023

sabatale commented Jan 6, 2024 •

edited

Loading

zpf7879 commented Feb 4, 2024

MattMindQ commented Feb 28, 2024

skywolf123 commented Mar 1, 2024 •

edited

Loading

skywolf123 commented Mar 1, 2024

AngelaO commented Mar 19, 2024 •

edited

Loading

sivaprasad49 commented Apr 29, 2024 •

edited

Loading

sivaprasad49 commented Apr 29, 2024

Olfredos6 commented Jun 12, 2024 •

edited

Loading

Building wheels for collected package: jq failed in Windows #4396

Building wheels for collected package: jq failed in Windows #4396

Comments

AlbelTec commented May 9, 2023

System Info

Who can help?

Information

Related Components

Reproduction

Expected behavior

Badrul-Goomblepop commented May 12, 2023 • edited Loading

AlbelTec commented May 13, 2023 • edited Loading

Badrul-Goomblepop commented May 13, 2023 • edited Loading

tsaijamey commented Jul 8, 2023

ajeeto commented Jul 11, 2023

utkucanaytac commented Jul 14, 2023

ajeeto commented Jul 14, 2023

utkucanaytac commented Jul 22, 2023

ddematheu commented Aug 9, 2023 • edited Loading

kumar9005 commented Aug 10, 2023

abdkumar commented Aug 29, 2023 • edited Loading

wujianming1996 commented Sep 3, 2023

Childcity commented Sep 8, 2023 • edited Loading

oldgitt commented Sep 28, 2023 • edited Loading

ddematheu commented Sep 28, 2023

Keven1894 commented Oct 4, 2023

hexbenjamin commented Oct 23, 2023

ddematheu commented Oct 23, 2023

sabatale commented Jan 6, 2024 • edited Loading

zpf7879 commented Feb 4, 2024

MattMindQ commented Feb 28, 2024

skywolf123 commented Mar 1, 2024 • edited Loading

skywolf123 commented Mar 1, 2024

AngelaO commented Mar 19, 2024 • edited Loading

sivaprasad49 commented Apr 29, 2024 • edited Loading

sivaprasad49 commented Apr 29, 2024

Olfredos6 commented Jun 12, 2024 • edited Loading

Badrul-Goomblepop commented May 12, 2023 •

edited

Loading

AlbelTec commented May 13, 2023 •

edited

Loading

Badrul-Goomblepop commented May 13, 2023 •

edited

Loading

ddematheu commented Aug 9, 2023 •

edited

Loading

abdkumar commented Aug 29, 2023 •

edited

Loading

Childcity commented Sep 8, 2023 •

edited

Loading

oldgitt commented Sep 28, 2023 •

edited

Loading

sabatale commented Jan 6, 2024 •

edited

Loading

skywolf123 commented Mar 1, 2024 •

edited

Loading

AngelaO commented Mar 19, 2024 •

edited

Loading

sivaprasad49 commented Apr 29, 2024 •

edited

Loading

Olfredos6 commented Jun 12, 2024 •

edited

Loading