Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add returnUnmatched to spec options #96

Merged
merged 6 commits into from
Oct 7, 2024
Merged

Add returnUnmatched to spec options #96

merged 6 commits into from
Oct 7, 2024

Conversation

pipliggins
Copy link
Collaborator

Add setting to return values when they aren't converted or mapped.

(or warning text including values if apply function used)
@jsbrittain
Copy link

There is an issue with this mode and polars/parquet. If I amend parse_adtl() to read/write using csv then it works fine, but using the parquet interface it breaks as follows (I would consider this non-critical in the short-term as we can use csv, but worth investigating further / marking an issue / etc.; unless you know how to fix it now):

I tried the parser on D1 (you get better error reporting by calling the parser directly after adding the following code for quick testing):

if __name__ == "__main__":
    df = pd.read_excel("data.xlsx")
    print(parse(df))

The error is (some filenames removed):

[---] parsing tmpfgbmk_ow.csv: 4217it [00:00, 14305.54it/s]
Traceback (most recent call last):
  File "---", line 14, in <module>
    print(parse(df))
          ^^^^^^^^^
  File "---", line 10, in parse
    return parse_adtl(df, spec_file, ["linelist"])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "---/InsightBoard/src/InsightBoard/parsers/__init__.py", line 40, in parse_adtl
    parsed.write_parquet(table_name, parsed_temp_file.name)
  File "---/python3.12/site-packages/adtl/__init__.py", line 961, in write_parquet
    df = pl.DataFrame(data, infer_schema_length=len(data))
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "---/python3.12/site-packages/polars/dataframe/frame.py", line 374, in __init__
    self._df = sequence_to_pydf(
               ^^^^^^^^^^^^^^^^^
  File "---/python3.12/site-packages/polars/_utils/construction/dataframe.py", line 460, in sequence_to_pydf
    return _sequence_to_pydf_dispatcher(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "---/python3.12/functools.py", line 907, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "---/python3.12/site-packages/polars/_utils/construction/dataframe.py", line 712, in _sequence_of_dict_to_pydf
    pydf = PyDataFrame.from_dicts(
           ^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: could not append value: ["crust"] of type: list[str] to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

it might also be that a value overflows the data-type's capacity

@jsbrittain
Copy link

jsbrittain commented Oct 3, 2024

@pipliggins It can also sometimes return "No matches found for: 'Non'" (for example), instead of just 'Non'. i.e. this is what appears in the data table.

@pipliggins
Copy link
Collaborator Author

@pipliggins It can also sometimes return "No matches found for: 'Non'" (for example), instead of just 'Non'. i.e. this is what appears in the data table.

@jsbrittain pushed a commit that changes this behaviour

@pipliggins
Copy link
Collaborator Author

There is an issue with this mode and polars/parquet. If I amend parse_adtl() to read/write using csv then it works fine, but using the parquet interface it breaks as follows (I would consider this non-critical in the short-term as we can use csv, but worth investigating further / marking an issue / etc.; unless you know how to fix it now):

polars.exceptions.ComputeError: could not append value: ["crust"] of type: list[str] to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

Paraphrasing from Teams: This happens because when you return unmapped values, they are often the 'wrong' type according to the schema. E.g. 'eight' cannot be converted to a float value, so when returned unconverted a string ends up in the supposedly float-typed age_years column. Parquet requires each column to be of a single type, hence the error. We will therefore use the original csv format to pass data from adtl into InsightBoard; this will require some re-working of the validation against a json schema outside of ADTL (and therefore outside the scope of this PR/repo).

Copy link

@jsbrittain jsbrittain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I haven't extensively tested it but the 'no matches found' error appears to have been resolved in the datatable, and we're aware of the parquet issue now (perhaps a note if you try and export parquet with returnUnmatched set true that it is likely to fail?)

@pipliggins
Copy link
Collaborator Author

Looks good. I haven't extensively tested it but the 'no matches found' error appears to have been resolved in the datatable, and we're aware of the parquet issue now (perhaps a note if you try and export parquet with returnUnmatched set true that it is likely to fail?)

Thanks! I've added a note to the specification file about it, but planning on adding a check to the CLI for parquet+returnUnmatched before merging.

@pipliggins pipliggins merged commit 5b8498c into main Oct 7, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants