Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Challenges with Nested Metadata Extraction Using pandas.json_normalize( #60254

Open
3 tasks done
DavidNaizheZhou opened this issue Nov 8, 2024 · 4 comments
Open
3 tasks done
Labels
Bug IO JSON read_json, to_json, json_normalize Needs Info Clarification about behavior needed to assess issue

Comments

@DavidNaizheZhou
Copy link

DavidNaizheZhou commented Nov 8, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

data = {
    "level1": {
        "rows": [
            {"col1": 1, "col2": 2},
        ]
    },
    "meta1": {
        "meta_sub1": 1,
    }, 
}

df = pd.json_normalize(data, record_path=["level1", "rows"], meta=["meta1"])
print(df)

df = pd.json_normalize(
    data,
    record_path=["level1", "rows"],
    meta=[["meta1", "meta_sub1"]],  # Trying to access sub-fields within meta1
)

Issue Description

Description of the Issue

This reproducible example demonstrates the challenges and potential pitfalls when using pandas.json_normalize() to extract and flatten hierarchical data structures with nested metadata:

Data Structure

The data dictionary is multi-layered, with nested dictionaries and a list of dictionaries (rows) under level1. Additionally, meta1 is structured as a dictionary containing subfields.

Successful Normalization

The first call to pd.json_normalize() extracts the data from rows under level1 and includes meta1as a top-level metadata field. This works as intended becausemeta1 is accessed directly as a single key.

Output:

   col1  col2             meta1
0     1     2  {'meta_sub1': 1}

KeyError with Nested Meta Fields

The second pd.json_normalize() call attempts to extract subfields from meta1 using a nested path (meta=[["meta1", "meta_sub1"]]). This results in a KeyError because json_normalize() does not natively support nested lists for specifying paths within the meta parameter.

Expected Behavior

df = pd.json_normalize(
    data,
    record_path=["level1", "rows"],
    meta=[["meta1", "meta_sub1"]],  # Trying to access sub-fields within meta1
)
   col1  col2  meta1
0     1     2     1

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.12.1
python-bits : 64
OS : Windows
OS-release : 11
Version : 10.0.22631
machine : AMD64
processor : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : de_DE.cp1252

pandas : 2.2.3
numpy : 1.26.2
pytz : 2024.1
dateutil : 2.8.2
pip : 24.3.1
Cython : None
sphinx : 8.1.3
IPython : 8.17.2
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.9.0
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.3
lxml.etree : 5.2.2
matplotlib : 3.8.3
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 15.0.0
pyreadstat : None
pytest : 8.1.1
python-calamine : None
pyxlsb : 1.0.10
s3fs : None
scipy : 1.11.4
sqlalchemy : 2.0.28
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
xlsxwriter : 3.2.0
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@DavidNaizheZhou DavidNaizheZhou added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 8, 2024
@rhshadrach
Copy link
Member

Thanks for the report. Is this the same as #59233?

@rhshadrach rhshadrach added IO JSON read_json, to_json, json_normalize Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 8, 2024
@DavidNaizheZhou
Copy link
Author

Yes, it is. Missed that, sorry about that.

@DavidNaizheZhou
Copy link
Author

DavidNaizheZhou commented Nov 8, 2024

Found a solution that resolved it for "me". The change involves modifying the _recursive_extract function in _normalize.py. Below is the specific change I made:

    def _recursive_extract(data, path, seen_meta, level: int = 0, root_obj=None) -> None:
        if isinstance(data, dict):
            data = [data]
        if len(path) > 1:
            for obj in data:
                if root_obj is None:
                    root_obj = obj
                for val, key in zip(_meta, meta_keys):
                    if level + 1 == len(val):
                        seen_meta[key] = _pull_field(root_obj, val)

                _recursive_extract(obj[path[0]], path[1:], seen_meta, level=level + 1, root_obj=root_obj)
        else:
            for obj in data:
                recs = _pull_records(obj, path[0])
                recs = [nested_to_record(r, sep=sep, max_level=max_level) if isinstance(r, dict) else r for r in recs]

                # For repeating the metadata later
                lengths.append(len(recs))
                for val, key in zip(_meta, meta_keys):
                    if level + 1 > len(val):
                        meta_val = seen_meta[key]
                    else:
                        meta_val = _pull_field(root_obj, val)
                    meta_vals[key].append(meta_val)
                records.extend(recs)

@rhshadrach rhshadrach removed the Needs Info Clarification about behavior needed to assess issue label Nov 10, 2024
@rhshadrach
Copy link
Member

This results in a KeyError because json_normalize() does not natively support nested lists for specifying paths within the meta parameter.

It does support nested lists, however it seems to make the assumption that all except the last element agree with record_path. E.g.

data = {
    "level1": [
        {
            "rows": [{"col1": 1, "col2": 2}, {"col1": 3, "col2": 4}],
            "meta1": 1,
        },
        {
            "rows": [{"col1": 5, "col2": 6}, {"col1": 7, "col2": 8}],
            "meta1": 2,
        },
    ],
}
df = pd.json_normalize(data, record_path=["level1", "rows"], meta=[["level1", "meta1"]])
print(df)
#    col1  col2 level1.meta1
# 0     1     2            1
# 1     3     4            1
# 2     5     6            2
# 3     7     8            2

While I'm not very familiar with this functionality, I believe the intention is to have metadata that sits alongside each collection of records.

If the metadata does not sit alongside each collection of records, then I think the result would necessarily be a constant column. Is that your desire @DavidNaizheZhou?

@rhshadrach rhshadrach added the Needs Info Clarification about behavior needed to assess issue label Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

2 participants