-
Notifications
You must be signed in to change notification settings - Fork 16.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Building wheels for collected package: jq failed in Windows #4396
Comments
I have the same issue. JQ (https://pypi.org/project/jq/) just doesn't seem to be supported for windows. Given that I don't expect help to arrive any time soon I tool a look into the JSON loader code. It seems ultimately just returns a document object which needs a text and a dict from what I can tell. Going to try and just write my own code that gives back this rather than relying on jq. Update: Here is the code I came up with for my particular use case. You should be able to modify it easily for your own.
|
@Badrul-Goomblepop Thanks. Indeed I came across some comments and it appears that jq is linux based and not windows. The only issue I'm seeing so far, is that when doing : pip install langchain[all] --upgrade. As windows users we can't anymore do it like so as jq will put in fail the whole stuff. Really annoying when people add OS dependencies. |
As a further update here is the final code I went with followed by example usage. Someone brighter than me should probably turn it into a generic class that could be submitted back to LangChain as an alternative to the current JSONLoader. My one is tailored for my JSON file which is of the structure:
|
@Badrul-Goomblepop Well done! thanks for your work |
This is very useful. how can i change your code for the my json file, which is pretty compelx and nested json |
Hello. I had the same problem with JsonLoader on Windows. I found this repo : https://github.com/jerryjliu/llama_index . Please check this. It may help to load any data into LLM |
Thank you, but how can then I use it as part of langchain. I have the chains working well for all other documents except json, i am looking for help for a generic code like
This is great, but like you said , need someoone to make it work for any json data if possible. |
We were unable to discover a solution to make it work on Windows too. Due to JsonLoader being based on jq scheme, unfortunately, there are no available wheel packages tailored for Windows. Let me know if you find any workarounds ? |
A bit late to the party, but wrote a more generic loader that simply processes the JSON into individual documents for each property in it:
|
This is really great, thank you so much for your work. |
@ddematheu , @Badrul-Goomblepop
How about using JMESPath library to replace jq in windows os? |
thanks !!! |
You missed to add |
ddematheu is the real MVP! |
I have published the loader through a small package that we created which includes both CSV and JSON versions of it. |
Thanks, very helpful. Very minor change if anyone received "takes 1 positional argument but 2 were given" error. def create_documents(processed_data): --> def create_documents(self,processed_data): |
imho, this is the absolute best-case ending to an issue like this one. excited to go try it, thank you @ddematheu for your work! |
Let me know if you have any questions :) |
With the missing self argument and utf-8 encoding in case anyone reads this:
|
How can further improve the lib to add the support for meta_data mapping function like:
|
This one can use metadata function, so far it is working for me :
|
Add support for jsonl format files: class JSONLoader(BaseLoader):
def __init__(
self,
file_path: Union[str, Path],
content_key: Optional[str] = None,
metadata_func: Optional[Callable[[dict, dict], dict]] = None,
json_lines: bool = False
):
"""
Initializes the JSONLoader with a file path, an optional content key to extract specific content,
and an optional metadata function to extract metadata from each record.
"""
self.file_path = Path(file_path).resolve()
self._content_key = content_key
self._metadata_func = metadata_func
self._json_lines = json_lines
def create_documents(self, processed_data):
"""
Creates Document objects from processed data.
"""
documents = []
for item in processed_data:
content = item.get('content', '')
metadata = item.get('metadata', {})
document = Document(page_content=content, metadata=metadata)
documents.append(document)
return documents
def process_json(self, data):
"""
Processes JSON data to prepare for document creation, extracting content based on the content_key
and applying the metadata function if provided.
"""
processed_data = []
if isinstance(data, list):
for item in data:
content = item.get(self.content_key, '') if self.content_key else ''
metadata = {}
if self._metadata_func and isinstance(item, dict):
metadata = self._metadata_func(item, {})
processed_data.append({'content': content, 'metadata': metadata})
return processed_data
def load(self) -> List[Document]:
"""
Load and return documents from the JSON or JSON Lines file.
"""
docs = []
with open(self.file_path, 'r', encoding="utf-8") as file:
if self._json_lines:
# Handle JSON Lines
for line_number, line in enumerate(file, start=1):
try:
data = json.loads(line)
processed_json = self.process_json(data)
docs.extend(self.create_documents(processed_json))
except json.JSONDecodeError:
print(f"Error: Invalid JSON format at line {line_number}.")
else:
# Handle regular JSON
try:
data = json.load(file)
processed_json = self.process_json(data)
docs = self.create_documents(processed_json)
except json.JSONDecodeError:
print("Error: Invalid JSON format in the file.")
return docs |
Source code based JSONLoader class modification, only removing the jq library, no problem with simple JSON and JSONL files. class JSONLoader(BaseLoader):
def __init__(
self,
file_path: Union[str, Path],
content_key: Optional[str] = None,
metadata_func: Optional[Callable[[Dict, Dict], Dict]] = None,
text_content: bool = True,
json_lines: bool = False,
):
"""
Initializes the JSONLoader with a file path, an optional content key to extract specific content,
and an optional metadata function to extract metadata from each record.
"""
self.file_path = Path(file_path).resolve()
self._content_key = content_key
self._metadata_func = metadata_func
self._text_content = text_content
self._json_lines = json_lines
def load(self) -> List[Document]:
"""Load and return documents from the JSON file."""
docs: List[Document] = []
if self._json_lines:
with self.file_path.open(encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
self._parse(line, docs)
else:
self._parse(self.file_path.read_text(encoding="utf-8"), docs)
return docs
def _parse(self, content: str, docs: List[Document]) -> None:
"""Convert given content to documents."""
data = json.loads(content)
# Perform some validation
# This is not a perfect validation, but it should catch most cases
# and prevent the user from getting a cryptic error later on.
if self._content_key is not None:
self._validate_content_key(data)
if self._metadata_func is not None:
self._validate_metadata_func(data)
for i, sample in enumerate(data, len(docs) + 1):
text = self._get_text(sample=sample)
metadata = self._get_metadata(sample=sample, source=str(self.file_path), seq_num=i)
docs.append(Document(page_content=text, metadata=metadata))
def _get_text(self, sample: Any) -> str:
"""Convert sample to string format"""
if self._content_key is not None:
content = sample.get(self._content_key)
else:
content = sample
if self._text_content and not isinstance(content, str):
raise ValueError(
f"Expected page_content is string, got {type(content)} instead. \
Set `text_content=False` if the desired input for \
`page_content` is not a string"
)
# In case the text is None, set it to an empty string
elif isinstance(content, str):
return content
elif isinstance(content, dict):
return json.dumps(content) if content else ""
else:
return str(content) if content is not None else ""
def _get_metadata(self, sample: Dict[str, Any], **additional_fields: Any) -> Dict[str, Any]:
"""
Return a metadata dictionary base on the existence of metadata_func
:param sample: single data payload
:param additional_fields: key-word arguments to be added as metadata values
:return:
"""
if self._metadata_func is not None:
return self._metadata_func(sample, additional_fields)
else:
return additional_fields
def _validate_content_key(self, data: Any) -> None:
"""Check if a content key is valid"""
sample = data.first()
if not isinstance(sample, dict):
raise ValueError(
f"Expected the jq schema to result in a list of objects (dict), \
so sample must be a dict but got `{type(sample)}`"
)
if sample.get(self._content_key) is None:
raise ValueError(
f"Expected the jq schema to result in a list of objects (dict) \
with the key `{self._content_key}`"
)
def _validate_metadata_func(self, data: Any) -> None:
"""Check if the metadata_func output is valid"""
sample = data.first()
if self._metadata_func is not None:
sample_metadata = self._metadata_func(sample, {})
if not isinstance(sample_metadata, dict):
raise ValueError(
f"Expected the metadata_func to return a dict but got \
`{type(sample_metadata)}`"
) |
Thanks so much @skywolf123 🙏 the above worked for me with a couple of tweaks:
|
Thanks you so much @skywolf123 . But this expects a json file to be passed as parameter. Is there a way to pass an API data i.e dict to call this JSONLoader class?? |
Thank you so much @sabatale . This worked for me with json files. But I couldn't pass a dict (a API data) to this class. How to modify this function? |
For those looking to run
|
System Info
Hi, can't update langchain. any insight ?
Who can help?
No response
Information
Related Components
Reproduction
pip install langchain[all] --upgrade
Expected behavior
no issue during installtion
The text was updated successfully, but these errors were encountered: