Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet-cli reports nested columns as null #3095

Open
acdha opened this issue Dec 3, 2024 · 0 comments
Open

parquet-cli reports nested columns as null #3095

acdha opened this issue Dec 3, 2024 · 0 comments

Comments

@acdha
Copy link

acdha commented Dec 3, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Using Parquet CLI 1.15.0 via Mac Homebrew, I noticed some surprising behaviour with the parquet-cli and nested columns.

parquet schema catalog.parquet returns a schema showing the nested types (I've trimmed the field list slightly):

{
  "type" : "record",
  "name" : "schema",
  "fields" : [ {
    "name" : "item_id",
    "type" : "string"
  }, {
    "name" : "title",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "language",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "subjects",
    "type" : [ "null", {
      "type" : "array",
      "items" : {
        "type" : "record",
        "name" : "list",
        "fields" : [ {
          "name" : "element",
          "type" : "string"
        } ]
      }
    } ],
    "default" : null
  }, {
    "name" : "authors",
    "type" : [ "null", {
      "type" : "array",
      "items" : {
        "type" : "record",
        "name" : "list",
        "namespace" : "list2",
        "fields" : [ {
          "name" : "element",
          "type" : "string"
        } ]
      }
    } ],
    "default" : null
  } ]
}

parquet dictionary -c subjects.list.element catalog.parquet will return the expected values for those fields as well:

Row group 0 dictionary for "subjects.list.element":
     0: "Bestsellers"
     1: "Biography"
     2: "Fantasy Fiction"
     3: "Music Theory"
     4: "Disability"
     5: "Family"
     6: "Young Adult"

However, when using cat or head to display the file contents those fields are displayed as null:

{"bmc_id": "id1", "title": null, "language": "en", "subjects": null, "authors": null}
{"bmc_id": "id2", "title": null, "language": "en", "subjects": null, "authors": null,}
{"bmc_id": "id3", "title": null, "language": "en", "subjects": null, "authors": null}

Other tools like PyArrow or Pandas do display those values as arrays. I created this as a bug because it looks like it's working and if those fields are nullable, there's no way to tell whether the null value is correct.

Component(s)

No response

@acdha acdha added the Type: bug label Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant