BUG: Challenges with Nested Metadata Extraction Using pandas.json_normalize( #60254

DavidNaizheZhou · 2024-11-08T20:06:24Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

data = {
    "level1": {
        "rows": [
            {"col1": 1, "col2": 2},
        ]
    },
    "meta1": {
        "meta_sub1": 1,
    }, 
}

df = pd.json_normalize(data, record_path=["level1", "rows"], meta=["meta1"])
print(df)

df = pd.json_normalize(
    data,
    record_path=["level1", "rows"],
    meta=[["meta1", "meta_sub1"]],  # Trying to access sub-fields within meta1
)

Issue Description

Description of the Issue

This reproducible example demonstrates the challenges and potential pitfalls when using pandas.json_normalize() to extract and flatten hierarchical data structures with nested metadata:

Data Structure

The data dictionary is multi-layered, with nested dictionaries and a list of dictionaries (rows) under level1. Additionally, meta1 is structured as a dictionary containing subfields.

Successful Normalization

The first call to pd.json_normalize() extracts the data from rows under level1 and includes meta1as a top-level metadata field. This works as intended becausemeta1 is accessed directly as a single key.

Output:

   col1  col2             meta1
0     1     2  {'meta_sub1': 1}

KeyError with Nested Meta Fields

The second pd.json_normalize() call attempts to extract subfields from meta1 using a nested path (meta=[["meta1", "meta_sub1"]]). This results in a KeyError because json_normalize() does not natively support nested lists for specifying paths within the meta parameter.

Expected Behavior

df = pd.json_normalize(
    data,
    record_path=["level1", "rows"],
    meta=[["meta1", "meta_sub1"]],  # Trying to access sub-fields within meta1
)

   col1  col2  meta1
0     1     2     1

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.12.1
python-bits : 64
OS : Windows
OS-release : 11
Version : 10.0.22631
machine : AMD64
processor : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : de_DE.cp1252

pandas : 2.2.3
numpy : 1.26.2
pytz : 2024.1
dateutil : 2.8.2
pip : 24.3.1
Cython : None
sphinx : 8.1.3
IPython : 8.17.2
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.9.0
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.3
lxml.etree : 5.2.2
matplotlib : 3.8.3
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 15.0.0
pyreadstat : None
pytest : 8.1.1
python-calamine : None
pyxlsb : 1.0.10
s3fs : None
scipy : 1.11.4
sqlalchemy : 2.0.28
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
xlsxwriter : 3.2.0
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2024-11-08T21:53:03Z

Thanks for the report. Is this the same as #59233?

DavidNaizheZhou · 2024-11-08T22:01:25Z

Yes, it is. Missed that, sorry about that.

DavidNaizheZhou · 2024-11-08T22:49:17Z

Found a solution that resolved it for "me". The change involves modifying the _recursive_extract function in _normalize.py. Below is the specific change I made:

    def _recursive_extract(data, path, seen_meta, level: int = 0, root_obj=None) -> None:
        if isinstance(data, dict):
            data = [data]
        if len(path) > 1:
            for obj in data:
                if root_obj is None:
                    root_obj = obj
                for val, key in zip(_meta, meta_keys):
                    if level + 1 == len(val):
                        seen_meta[key] = _pull_field(root_obj, val)

                _recursive_extract(obj[path[0]], path[1:], seen_meta, level=level + 1, root_obj=root_obj)
        else:
            for obj in data:
                recs = _pull_records(obj, path[0])
                recs = [nested_to_record(r, sep=sep, max_level=max_level) if isinstance(r, dict) else r for r in recs]

                # For repeating the metadata later
                lengths.append(len(recs))
                for val, key in zip(_meta, meta_keys):
                    if level + 1 > len(val):
                        meta_val = seen_meta[key]
                    else:
                        meta_val = _pull_field(root_obj, val)
                    meta_vals[key].append(meta_val)
                records.extend(recs)

rhshadrach · 2024-11-10T14:46:26Z

This results in a KeyError because json_normalize() does not natively support nested lists for specifying paths within the meta parameter.

It does support nested lists, however it seems to make the assumption that all except the last element agree with record_path. E.g.

data = {
    "level1": [
        {
            "rows": [{"col1": 1, "col2": 2}, {"col1": 3, "col2": 4}],
            "meta1": 1,
        },
        {
            "rows": [{"col1": 5, "col2": 6}, {"col1": 7, "col2": 8}],
            "meta1": 2,
        },
    ],
}
df = pd.json_normalize(data, record_path=["level1", "rows"], meta=[["level1", "meta1"]])
print(df)
#    col1  col2 level1.meta1
# 0     1     2            1
# 1     3     4            1
# 2     5     6            2
# 3     7     8            2

While I'm not very familiar with this functionality, I believe the intention is to have metadata that sits alongside each collection of records.

If the metadata does not sit alongside each collection of records, then I think the result would necessarily be a constant column. Is that your desire @DavidNaizheZhou?

DavidNaizheZhou added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 8, 2024

rhshadrach added IO JSON read_json, to_json, json_normalize Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 8, 2024

rhshadrach mentioned this issue Nov 10, 2024

BUG: json_normalize KeyError Key not found #59233

Closed

3 tasks

rhshadrach removed the Needs Info Clarification about behavior needed to assess issue label Nov 10, 2024

rhshadrach added the Needs Info Clarification about behavior needed to assess issue label Nov 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Challenges with Nested Metadata Extraction Using pandas.json_normalize( #60254

BUG: Challenges with Nested Metadata Extraction Using pandas.json_normalize( #60254

DavidNaizheZhou commented Nov 8, 2024 •

edited

Loading

INSTALLED VERSIONS

rhshadrach commented Nov 8, 2024

DavidNaizheZhou commented Nov 8, 2024

DavidNaizheZhou commented Nov 8, 2024 •

edited

Loading

rhshadrach commented Nov 10, 2024

BUG: Challenges with Nested Metadata Extraction Using pandas.json_normalize( #60254

BUG: Challenges with Nested Metadata Extraction Using pandas.json_normalize( #60254

Comments

DavidNaizheZhou commented Nov 8, 2024 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Description of the Issue

Data Structure

Successful Normalization

Output:

KeyError with Nested Meta Fields

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented Nov 8, 2024

DavidNaizheZhou commented Nov 8, 2024

DavidNaizheZhou commented Nov 8, 2024 • edited Loading

rhshadrach commented Nov 10, 2024

DavidNaizheZhou commented Nov 8, 2024 •

edited

Loading

DavidNaizheZhou commented Nov 8, 2024 •

edited

Loading