-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add returnUnmatched to spec options #96
Conversation
(or warning text including values if apply function used)
There is an issue with this mode and polars/parquet. If I amend parse_adtl() to read/write using csv then it works fine, but using the parquet interface it breaks as follows (I would consider this non-critical in the short-term as we can use csv, but worth investigating further / marking an issue / etc.; unless you know how to fix it now): I tried the parser on D1 (you get better error reporting by calling the parser directly after adding the following code for quick testing): if __name__ == "__main__":
df = pd.read_excel("data.xlsx")
print(parse(df)) The error is (some filenames removed): [---] parsing tmpfgbmk_ow.csv: 4217it [00:00, 14305.54it/s]
Traceback (most recent call last):
File "---", line 14, in <module>
print(parse(df))
^^^^^^^^^
File "---", line 10, in parse
return parse_adtl(df, spec_file, ["linelist"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "---/InsightBoard/src/InsightBoard/parsers/__init__.py", line 40, in parse_adtl
parsed.write_parquet(table_name, parsed_temp_file.name)
File "---/python3.12/site-packages/adtl/__init__.py", line 961, in write_parquet
df = pl.DataFrame(data, infer_schema_length=len(data))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "---/python3.12/site-packages/polars/dataframe/frame.py", line 374, in __init__
self._df = sequence_to_pydf(
^^^^^^^^^^^^^^^^^
File "---/python3.12/site-packages/polars/_utils/construction/dataframe.py", line 460, in sequence_to_pydf
return _sequence_to_pydf_dispatcher(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "---/python3.12/functools.py", line 907, in wrapper
return dispatch(args[0].__class__)(*args, **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "---/python3.12/site-packages/polars/_utils/construction/dataframe.py", line 712, in _sequence_of_dict_to_pydf
pydf = PyDataFrame.from_dicts(
^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: could not append value: ["crust"] of type: list[str] to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`
it might also be that a value overflows the data-type's capacity |
@pipliggins It can also sometimes return "No matches found for: 'Non'" (for example), instead of just 'Non'. i.e. this is what appears in the data table. |
@jsbrittain pushed a commit that changes this behaviour |
for more information, see https://pre-commit.ci
Paraphrasing from Teams: This happens because when you return unmapped values, they are often the 'wrong' type according to the schema. E.g. 'eight' cannot be converted to a float value, so when returned unconverted a string ends up in the supposedly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I haven't extensively tested it but the 'no matches found' error appears to have been resolved in the datatable, and we're aware of the parquet issue now (perhaps a note if you try and export parquet with returnUnmatched
set true
that it is likely to fail?)
Thanks! I've added a note to the specification file about it, but planning on adding a check to the CLI for parquet+returnUnmatched before merging. |
Add setting to return values when they aren't converted or mapped.