NANs are not properly handled with newer pandas version #68

rsanchezgarc · 2024-10-18T00:09:59Z

starfile version: 0.5.8 and master branch
Python version: 3.11
Operating System: Ubuntu 22.04

Description

I updated my enviorment to new starfile and pandas version and the starfile.read() behaviour with respect NaNs has changed. Now, the columns with some NANs are parsed as object, but before they were treated as floats (starfile==0.4.2 pandas=2.0 numpy==1.26.3).

Minimal example to reproduce it

import tempfile

import starfile
import pandas as pd
import numpy as np

parts = pd.DataFrame({"property1":np.arange(10), "property2": np.random.rand(10)})
parts["property2"].values[-1] *= np.nan
print(parts)
data = {
    "particles":parts
}

with tempfile.NamedTemporaryFile(mode="w") as tmpfile:
    starfile.write(data, tmpfile.name)
    tmpfile.seek(0)
    data = starfile.read(tmpfile.name)
    print(data["property2"].dtype) #This should be a float, not object

Potential fixes?

Perhaps, it is as easy as change this line

starfile/src/starfile/parser.py

Line 130 in a2e9927

keep_default_na=False,

jojoelfe · 2024-10-18T00:28:50Z

Hi @rsanchezgarc ,

thanks a lot for the bug report! I have reproduced this with pandas version 2.2.0. Do you happen to know a version of pandas where this should work?

Changing the line you suggested fails a bunch of tests, so this might be a bit more complicated.

Best,

Johannes

jojoelfe · 2024-10-18T00:41:33Z

Ok, I have a temporary fix in #69, could you try that out? I also copy and pasted your example code as a unit test, is that ok?

The main problem was that the default list of possible NaN values includes an empty string, which we/I before had decided should be parsed as a string/object. Now we manually add back strings that should be considered a NaN.

rsanchezgarc · 2024-10-18T11:03:10Z

That works!
Feel free to use my code as test.

Just for completeness I will add a few more possible nan strings. Here you have the default ones by pandas

“, “#N/A”, “#N/A N/A”, “#NA”, “-1.#IND”, “-1.#QNAN”, “-NaN”, “-nan”, “1.#IND”, “1.#QNAN”, “”, “N/A”, “NA”, “NULL”, “NaN”, “None”, “n/a”, “nan”, “null “.

jojoelfe mentioned this issue Oct 18, 2024

Fix NaN parsing #69

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NANs are not properly handled with newer pandas version #68

NANs are not properly handled with newer pandas version #68

rsanchezgarc commented Oct 18, 2024 •

edited

Loading

jojoelfe commented Oct 18, 2024

jojoelfe commented Oct 18, 2024

rsanchezgarc commented Oct 18, 2024

NANs are not properly handled with newer pandas version #68

NANs are not properly handled with newer pandas version #68

Comments

rsanchezgarc commented Oct 18, 2024 • edited Loading

Description

Minimal example to reproduce it

Potential fixes?

jojoelfe commented Oct 18, 2024

jojoelfe commented Oct 18, 2024

rsanchezgarc commented Oct 18, 2024

rsanchezgarc commented Oct 18, 2024 •

edited

Loading