Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NANs are not properly handled with newer pandas version #68

Open
rsanchezgarc opened this issue Oct 18, 2024 · 3 comments
Open

NANs are not properly handled with newer pandas version #68

rsanchezgarc opened this issue Oct 18, 2024 · 3 comments

Comments

@rsanchezgarc
Copy link

rsanchezgarc commented Oct 18, 2024

  • starfile version: 0.5.8 and master branch
  • Python version: 3.11
  • Operating System: Ubuntu 22.04

Description

I updated my enviorment to new starfile and pandas version and the starfile.read() behaviour with respect NaNs has changed. Now, the columns with some NANs are parsed as object, but before they were treated as floats (starfile==0.4.2 pandas=2.0 numpy==1.26.3).

Minimal example to reproduce it

import tempfile

import starfile
import pandas as pd
import numpy as np

parts = pd.DataFrame({"property1":np.arange(10), "property2": np.random.rand(10)})
parts["property2"].values[-1] *= np.nan
print(parts)
data = {
    "particles":parts
}

with tempfile.NamedTemporaryFile(mode="w") as tmpfile:
    starfile.write(data, tmpfile.name)
    tmpfile.seek(0)
    data = starfile.read(tmpfile.name)
    print(data["property2"].dtype) #This should be a float, not object

Potential fixes?

Perhaps, it is as easy as change this line

keep_default_na=False,

@jojoelfe
Copy link
Collaborator

Hi @rsanchezgarc ,

thanks a lot for the bug report! I have reproduced this with pandas version 2.2.0. Do you happen to know a version of pandas where this should work?

Changing the line you suggested fails a bunch of tests, so this might be a bit more complicated.

Best,

Johannes

@jojoelfe
Copy link
Collaborator

Ok, I have a temporary fix in #69, could you try that out? I also copy and pasted your example code as a unit test, is that ok?

The main problem was that the default list of possible NaN values includes an empty string, which we/I before had decided should be parsed as a string/object. Now we manually add back strings that should be considered a NaN.

@rsanchezgarc
Copy link
Author

That works!
Feel free to use my code as test.

Just for completeness I will add a few more possible nan strings. Here you have the default ones by pandas

“, “#N/A”, “#N/A N/A”, “#NA”, “-1.#IND”, “-1.#QNAN”, “-NaN”, “-nan”, “1.#IND”, “1.#QNAN”, “”, “N/A”, “NA”, “NULL”, “NaN”, “None”, “n/a”, “nan”, “null “.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants