Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A bug of Dataset.to_json() function #7037

Open
LinglingGreat opened this issue Jul 10, 2024 · 2 comments · May be fixed by #7039
Open

A bug of Dataset.to_json() function #7037

LinglingGreat opened this issue Jul 10, 2024 · 2 comments · May be fixed by #7039
Assignees
Labels
bug Something isn't working

Comments

@LinglingGreat
Copy link

Describe the bug

When using the Dataset.to_json() function, an unexpected error occurs if the parameter is set to lines=False. The stored data should be in the form of a list, but it actually turns into multiple lists, which causes an error when reading the data again.
The reason is that to_json() writes to the file in several segments based on the batch size. This is not a problem when lines=True, but it is incorrect when lines=False, because writing in several times will produce multiple lists(when len(dataset) > batch_size).

Steps to reproduce the bug

try this code:

from datasets import load_dataset
import json

train_dataset = load_dataset("Anthropic/hh-rlhf", data_dir="harmless-base")["train"]
output_path = "./harmless-base_hftojs.json"
print(len(train_dataset))
train_dataset.to_json(output_path, lines=False, force_ascii=False, indent=2)

with open(output_path, encoding="utf-8") as f:
    data = json.loads(f.read())

it raise error: json.decoder.JSONDecodeError: Extra data: line 4003 column 1 (char 1373709)

Extra square brackets have appeared here:
image

Expected behavior

The code runs normally.

Environment info

datasets=2.20.0

@albertvillanova albertvillanova self-assigned this Jul 10, 2024
@albertvillanova
Copy link
Member

Thanks for reporting, @LinglingGreat.

I confirm this is a bug.

@albertvillanova albertvillanova added the bug Something isn't working label Jul 10, 2024
@varadhbhatnagar
Copy link
Contributor

@albertvillanova I would like to take a shot at this if you aren't working on it currently. Let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants