-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_json should support usecols option (or hook for arbitrary transformation after each line is read) #19821
Comments
Correct me if I am wrong but whitespace is not significant in JSON, so I don't think it is generalizable to have the hook that you envision. Are your memory issues on creation of a DataFrame or purely from parsing the JSON? I suppose there are some options with the former but the latter would be quite the undertaking, if even possible |
I am not sure why whitespace not being significant is important. this is specifically for |
My point was that with how you described it the hook would not be generalizable without You are aware of the |
I see your point, thought still think it is helpful to have those arguments when |
@sam-cohan have a calling hook on every line would make this unbearably slow. This would require quite a major effort and the json parser is just not flexibile enough to allow this. This suggestion is already noted in the design of the next generation parser being contemplated by @wesm see: wesm/pandas2#71, though this is still very much in the early design phase. |
@jreback yes, I am essentially reading the file line by line and doing the same myself and that is very slow. I was hoping there would be some internal optimization that could be built in to discard certain keys from the json after each line is read. I guess that is what the enhancement you are referring to is for. Thanks. |
One other workaround is to parse out all the unneeded columns in the json file during development so subsequent reads in production will be faster. |
Code Sample, a copy-pastable example if possible
Problem description
The
read_csv
function'susecols
argument is very helpful in reading a subset of columns from a very large file. Unfortunately it does not exist inread_json
so I have to manually read the file line by line and parse out the fields I am interested in to avoid memory issues. This comes at a big cost of slowness to loading json files.One possible implementation that may be worth considering might be to have hooks to allow for user-defined transformations after each line read and after json transformation. In that way, we could also support an additional use case of applying custom decode function to lines when dealing with hybrid proprietary file types.
The text was updated successfully, but these errors were encountered: