Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_json should support usecols option (or hook for arbitrary transformation after each line is read) #19821

Closed
sam-cohan opened this issue Feb 21, 2018 · 7 comments
Labels
IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance

Comments

@sam-cohan
Copy link
Contributor

sam-cohan commented Feb 21, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd
pd.read_json(path_to_big_file, lines=True, usecols=some_subset_of_cols)

Problem description

The read_csv function's usecols argument is very helpful in reading a subset of columns from a very large file. Unfortunately it does not exist in read_json so I have to manually read the file line by line and parse out the fields I am interested in to avoid memory issues. This comes at a big cost of slowness to loading json files.

One possible implementation that may be worth considering might be to have hooks to allow for user-defined transformations after each line read and after json transformation. In that way, we could also support an additional use case of applying custom decode function to lines when dealing with hybrid proprietary file types.

def strip_preamble(line):
   '''Example line transformation before valid json object'''
    return line.split("|", 1)[1]

def select_subset(json_rec):
   '''Example selection of columns from all possible columns in json lines'''
   return {k: json_rec.get(k) for k in ["col_A", "col_B"]}

pd.read_json(path_to_big_file, lines=True, line_loaded_hook=strip_preamble, json_created_hook=select_subset)
@WillAyd
Copy link
Member

WillAyd commented Feb 22, 2018

Correct me if I am wrong but whitespace is not significant in JSON, so I don't think it is generalizable to have the hook that you envision. Are your memory issues on creation of a DataFrame or purely from parsing the JSON? I suppose there are some options with the former but the latter would be quite the undertaking, if even possible

@sam-cohan
Copy link
Contributor Author

I am not sure why whitespace not being significant is important. this is specifically for lines=True so presumably just beofre and just after each line is parsed into json there should be a hook to pass the object for arbitrary manipulation rather than having to wait for all the DataFrame to be created before dropping some of the columns.

@WillAyd
Copy link
Member

WillAyd commented Feb 22, 2018

My point was that with how you described it the hook would not be generalizable without lines=True so it would be pretty difficult to implement usecols outside of a limited scope if at all. I'm not an expert on JSON so perhaps there is a way others can chime in on.

You are aware of the chunksize argument right? If you are running into memory issues parsing the file that can be helpful at the cost of some performance

@sam-cohan
Copy link
Contributor Author

I see your point, thought still think it is helpful to have those arguments when lines=True. I guess one way around my current problem is indeed using chunksize parameter.

@jreback
Copy link
Contributor

jreback commented Feb 23, 2018

@sam-cohan have a calling hook on every line would make this unbearably slow. This would require quite a major effort and the json parser is just not flexibile enough to allow this. This suggestion is already noted in the design of the next generation parser being contemplated by @wesm see: wesm/pandas2#71, though this is still very much in the early design phase.

@jreback jreback closed this as completed Feb 23, 2018
@jreback jreback added IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance labels Feb 23, 2018
@jreback jreback added this to the No action milestone Feb 23, 2018
@sam-cohan
Copy link
Contributor Author

@jreback yes, I am essentially reading the file line by line and doing the same myself and that is very slow. I was hoping there would be some internal optimization that could be built in to discard certain keys from the json after each line is read. I guess that is what the enhancement you are referring to is for. Thanks.

@JackOfSpade
Copy link

One other workaround is to parse out all the unneeded columns in the json file during development so subsequent reads in production will be faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants