read_json should support usecols option (or hook for arbitrary transformation after each line is read) #19821

sam-cohan · 2018-02-21T19:03:50Z

Code Sample, a copy-pastable example if possible

import pandas as pd
pd.read_json(path_to_big_file, lines=True, usecols=some_subset_of_cols)

Problem description

The read_csv function's usecols argument is very helpful in reading a subset of columns from a very large file. Unfortunately it does not exist in read_json so I have to manually read the file line by line and parse out the fields I am interested in to avoid memory issues. This comes at a big cost of slowness to loading json files.

One possible implementation that may be worth considering might be to have hooks to allow for user-defined transformations after each line read and after json transformation. In that way, we could also support an additional use case of applying custom decode function to lines when dealing with hybrid proprietary file types.

def strip_preamble(line):
   '''Example line transformation before valid json object'''
    return line.split("|", 1)[1]

def select_subset(json_rec):
   '''Example selection of columns from all possible columns in json lines'''
   return {k: json_rec.get(k) for k in ["col_A", "col_B"]}

pd.read_json(path_to_big_file, lines=True, line_loaded_hook=strip_preamble, json_created_hook=select_subset)

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-02-22T19:01:26Z

Correct me if I am wrong but whitespace is not significant in JSON, so I don't think it is generalizable to have the hook that you envision. Are your memory issues on creation of a DataFrame or purely from parsing the JSON? I suppose there are some options with the former but the latter would be quite the undertaking, if even possible

sam-cohan · 2018-02-22T22:30:06Z

I am not sure why whitespace not being significant is important. this is specifically for lines=True so presumably just beofre and just after each line is parsed into json there should be a hook to pass the object for arbitrary manipulation rather than having to wait for all the DataFrame to be created before dropping some of the columns.

WillAyd · 2018-02-22T23:00:41Z

My point was that with how you described it the hook would not be generalizable without lines=True so it would be pretty difficult to implement usecols outside of a limited scope if at all. I'm not an expert on JSON so perhaps there is a way others can chime in on.

You are aware of the chunksize argument right? If you are running into memory issues parsing the file that can be helpful at the cost of some performance

sam-cohan · 2018-02-22T23:24:22Z

I see your point, thought still think it is helpful to have those arguments when lines=True. I guess one way around my current problem is indeed using chunksize parameter.

jreback · 2018-02-23T02:09:47Z

@sam-cohan have a calling hook on every line would make this unbearably slow. This would require quite a major effort and the json parser is just not flexibile enough to allow this. This suggestion is already noted in the design of the next generation parser being contemplated by @wesm see: wesm/pandas2#71, though this is still very much in the early design phase.

sam-cohan · 2018-02-23T02:34:25Z

@jreback yes, I am essentially reading the file line by line and doing the same myself and that is very slow. I was hoping there would be some internal optimization that could be built in to discard certain keys from the json after each line is read. I guess that is what the enhancement you are referring to is for. Thanks.

JackOfSpade · 2020-07-15T18:44:13Z

One other workaround is to parse out all the unneeded columns in the json file during development so subsequent reads in production will be faster.

jreback closed this as completed Feb 23, 2018

jreback added IO JSON Performance labels Feb 23, 2018

jreback added this to the No action milestone Feb 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

read_json should support usecols option (or hook for arbitrary transformation after each line is read) #19821

read_json should support usecols option (or hook for arbitrary transformation after each line is read) #19821

sam-cohan commented Feb 21, 2018 •

edited

Loading

WillAyd commented Feb 22, 2018

sam-cohan commented Feb 22, 2018

WillAyd commented Feb 22, 2018 •

edited

Loading

sam-cohan commented Feb 22, 2018

jreback commented Feb 23, 2018

sam-cohan commented Feb 23, 2018

JackOfSpade commented Jul 15, 2020

read_json should support usecols option (or hook for arbitrary transformation after each line is read) #19821

read_json should support usecols option (or hook for arbitrary transformation after each line is read) #19821

Comments

sam-cohan commented Feb 21, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

WillAyd commented Feb 22, 2018

sam-cohan commented Feb 22, 2018

WillAyd commented Feb 22, 2018 • edited Loading

sam-cohan commented Feb 22, 2018

jreback commented Feb 23, 2018

sam-cohan commented Feb 23, 2018

JackOfSpade commented Jul 15, 2020

sam-cohan commented Feb 21, 2018 •

edited

Loading

WillAyd commented Feb 22, 2018 •

edited

Loading