-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Challenges with Nested Metadata Extraction Using pandas.json_normalize( #60254
Comments
Thanks for the report. Is this the same as #59233? |
Yes, it is. Missed that, sorry about that. |
Found a solution that resolved it for "me". The change involves modifying the _recursive_extract function in _normalize.py. Below is the specific change I made: def _recursive_extract(data, path, seen_meta, level: int = 0, root_obj=None) -> None:
if isinstance(data, dict):
data = [data]
if len(path) > 1:
for obj in data:
if root_obj is None:
root_obj = obj
for val, key in zip(_meta, meta_keys):
if level + 1 == len(val):
seen_meta[key] = _pull_field(root_obj, val)
_recursive_extract(obj[path[0]], path[1:], seen_meta, level=level + 1, root_obj=root_obj)
else:
for obj in data:
recs = _pull_records(obj, path[0])
recs = [nested_to_record(r, sep=sep, max_level=max_level) if isinstance(r, dict) else r for r in recs]
# For repeating the metadata later
lengths.append(len(recs))
for val, key in zip(_meta, meta_keys):
if level + 1 > len(val):
meta_val = seen_meta[key]
else:
meta_val = _pull_field(root_obj, val)
meta_vals[key].append(meta_val)
records.extend(recs) |
It does support nested lists, however it seems to make the assumption that all except the last element agree with data = {
"level1": [
{
"rows": [{"col1": 1, "col2": 2}, {"col1": 3, "col2": 4}],
"meta1": 1,
},
{
"rows": [{"col1": 5, "col2": 6}, {"col1": 7, "col2": 8}],
"meta1": 2,
},
],
}
df = pd.json_normalize(data, record_path=["level1", "rows"], meta=[["level1", "meta1"]])
print(df)
# col1 col2 level1.meta1
# 0 1 2 1
# 1 3 4 1
# 2 5 6 2
# 3 7 8 2 While I'm not very familiar with this functionality, I believe the intention is to have metadata that sits alongside each collection of records. If the metadata does not sit alongside each collection of records, then I think the result would necessarily be a constant column. Is that your desire @DavidNaizheZhou? |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Description of the Issue
This reproducible example demonstrates the challenges and potential pitfalls when using
pandas.json_normalize()
to extract and flatten hierarchical data structures with nested metadata:Data Structure
The
data
dictionary is multi-layered, with nested dictionaries and a list of dictionaries (rows
) underlevel1
. Additionally,meta1
is structured as a dictionary containing subfields.Successful Normalization
The first call to
pd.json_normalize()
extracts the data fromrows
underlevel1 and includes
meta1as a top-level metadata field. This works as intended because
meta1 is accessed directly as a single key.Output:
KeyError with Nested Meta Fields
The second
pd.json_normalize()
call attempts to extract subfields frommeta1
using a nested path (meta=[["meta1", "meta_sub1"]]
). This results in aKeyError
becausejson_normalize()
does not natively support nested lists for specifying paths within themeta
parameter.Expected Behavior
Installed Versions
INSTALLED VERSIONS
commit : 0691c5c
python : 3.12.1
python-bits : 64
OS : Windows
OS-release : 11
Version : 10.0.22631
machine : AMD64
processor : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : de_DE.cp1252
pandas : 2.2.3
numpy : 1.26.2
pytz : 2024.1
dateutil : 2.8.2
pip : 24.3.1
Cython : None
sphinx : 8.1.3
IPython : 8.17.2
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.9.0
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.3
lxml.etree : 5.2.2
matplotlib : 3.8.3
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 15.0.0
pyreadstat : None
pytest : 8.1.1
python-calamine : None
pyxlsb : 1.0.10
s3fs : None
scipy : 1.11.4
sqlalchemy : 2.0.28
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
xlsxwriter : 3.2.0
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: