Add returnUnmatched to spec options #96

pipliggins · 2024-10-03T17:02:07Z

Add setting to return values when they aren't converted or mapped.

(or warning text including values if apply function used)

jsbrittain · 2024-10-03T17:56:13Z

There is an issue with this mode and polars/parquet. If I amend parse_adtl() to read/write using csv then it works fine, but using the parquet interface it breaks as follows (I would consider this non-critical in the short-term as we can use csv, but worth investigating further / marking an issue / etc.; unless you know how to fix it now):

I tried the parser on D1 (you get better error reporting by calling the parser directly after adding the following code for quick testing):

if __name__ == "__main__":
    df = pd.read_excel("data.xlsx")
    print(parse(df))

The error is (some filenames removed):

[---] parsing tmpfgbmk_ow.csv: 4217it [00:00, 14305.54it/s]
Traceback (most recent call last):
  File "---", line 14, in <module>
    print(parse(df))
          ^^^^^^^^^
  File "---", line 10, in parse
    return parse_adtl(df, spec_file, ["linelist"])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "---/InsightBoard/src/InsightBoard/parsers/__init__.py", line 40, in parse_adtl
    parsed.write_parquet(table_name, parsed_temp_file.name)
  File "---/python3.12/site-packages/adtl/__init__.py", line 961, in write_parquet
    df = pl.DataFrame(data, infer_schema_length=len(data))
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "---/python3.12/site-packages/polars/dataframe/frame.py", line 374, in __init__
    self._df = sequence_to_pydf(
               ^^^^^^^^^^^^^^^^^
  File "---/python3.12/site-packages/polars/_utils/construction/dataframe.py", line 460, in sequence_to_pydf
    return _sequence_to_pydf_dispatcher(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "---/python3.12/functools.py", line 907, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "---/python3.12/site-packages/polars/_utils/construction/dataframe.py", line 712, in _sequence_of_dict_to_pydf
    pydf = PyDataFrame.from_dicts(
           ^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: could not append value: ["crust"] of type: list[str] to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

it might also be that a value overflows the data-type's capacity

jsbrittain · 2024-10-03T17:56:37Z

@pipliggins It can also sometimes return "No matches found for: 'Non'" (for example), instead of just 'Non'. i.e. this is what appears in the data table.

…ched

pipliggins · 2024-10-04T15:28:09Z

@pipliggins It can also sometimes return "No matches found for: 'Non'" (for example), instead of just 'Non'. i.e. this is what appears in the data table.

@jsbrittain pushed a commit that changes this behaviour

for more information, see https://pre-commit.ci

pipliggins · 2024-10-04T15:37:43Z

There is an issue with this mode and polars/parquet. If I amend parse_adtl() to read/write using csv then it works fine, but using the parquet interface it breaks as follows (I would consider this non-critical in the short-term as we can use csv, but worth investigating further / marking an issue / etc.; unless you know how to fix it now):

polars.exceptions.ComputeError: could not append value: ["crust"] of type: list[str] to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

Paraphrasing from Teams: This happens because when you return unmapped values, they are often the 'wrong' type according to the schema. E.g. 'eight' cannot be converted to a float value, so when returned unconverted a string ends up in the supposedly float-typed age_years column. Parquet requires each column to be of a single type, hence the error. We will therefore use the original csv format to pass data from adtl into InsightBoard; this will require some re-working of the validation against a json schema outside of ADTL (and therefore outside the scope of this PR/repo).

jsbrittain

Looks good. I haven't extensively tested it but the 'no matches found' error appears to have been resolved in the datatable, and we're aware of the parquet issue now (perhaps a note if you try and export parquet with returnUnmatched set true that it is likely to fail?)

pipliggins · 2024-10-04T16:16:52Z

Looks good. I haven't extensively tested it but the 'no matches found' error appears to have been resolved in the datatable, and we're aware of the parquet issue now (perhaps a note if you try and export parquet with returnUnmatched set true that it is likely to fail?)

Thanks! I've added a note to the specification file about it, but planning on adding a check to the CLI for parquet+returnUnmatched before merging.

for more information, see https://pre-commit.ci

pipliggins added 2 commits October 3, 2024 17:17

Add 'returnUnmatched' option to return original value

fcbb63f

(or warning text including values if apply function used)

Add docs, test

c08289f

pipliggins requested a review from jsbrittain October 3, 2024 17:02

Return current field value when using transformations and returnUnmat…

ca88239

…ched

[pre-commit.ci] auto fixes from pre-commit.com hooks

10e2d72

for more information, see https://pre-commit.ci

jsbrittain approved these changes Oct 4, 2024

View reviewed changes

pipliggins and others added 2 commits October 7, 2024 12:23

Add check for incompatible 'parquet' and 'returnUnmatched' options

adcecc5

[pre-commit.ci] auto fixes from pre-commit.com hooks

6affba8

for more information, see https://pre-commit.ci

pipliggins merged commit 5b8498c into main Oct 7, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add returnUnmatched to spec options #96

Add returnUnmatched to spec options #96

pipliggins commented Oct 3, 2024

jsbrittain commented Oct 3, 2024

jsbrittain commented Oct 3, 2024 •

edited

Loading

pipliggins commented Oct 4, 2024

pipliggins commented Oct 4, 2024

jsbrittain left a comment

pipliggins commented Oct 4, 2024

Add returnUnmatched to spec options #96

Add returnUnmatched to spec options #96

Conversation

pipliggins commented Oct 3, 2024

jsbrittain commented Oct 3, 2024

jsbrittain commented Oct 3, 2024 • edited Loading

pipliggins commented Oct 4, 2024

pipliggins commented Oct 4, 2024

jsbrittain left a comment

Choose a reason for hiding this comment

pipliggins commented Oct 4, 2024

jsbrittain commented Oct 3, 2024 •

edited

Loading