Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BTHofmann2023 additions #305

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from
Draft

BTHofmann2023 additions #305

wants to merge 15 commits into from

Conversation

steinnymir
Copy link
Member

@steinnymir steinnymir commented Nov 29, 2023

Bug fixes and utilities added during the beamtime at FLASH in nov2023.

Additions:

  • A pretty print of the processor, including some statistical information about the data loaded. Still requires testing for non-flash loaders.
  • filter the dataframe before binning to only look at the required columns. This avoids errors when there are issues with values in other columns (i.e. if a column is entirely nans)
  • added the option of sequentially loading data in the flash loader. This helps to track errors otherwise lost in the parallel processes.

Bugs:

  • added an error catching when trying to bin a column which has only nans. The error handled by numba was otherwise unclear.

Comment on lines +296 to +297
# filter dataframe to use only the columns needed for the binning
df = df[axes]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this before in your commits. I don't think this is helpful, as it introduces another graph layer. Did you try whether this really improve computation time? In my tests, it slowed things down. Or why did you introduce this in the first place?

@rettigl
Copy link
Member

rettigl commented Dec 1, 2023

  • filter the dataframe before binning to only look at the required columns. This avoids errors when there are issues with values in other columns (i.e. if a column is entirely nans)

I don't quite understand this issue, and as I have commented, it makes computation of the dataframe, and thus binning, slower. Can you elaborate on this problem? I tried to reproduce by adding a NaN column to the df, which worked without problems...

@steinnymir
Copy link
Member Author

steinnymir commented Dec 1, 2023

it makes computation of the dataframe, and thus binning, slower.

did you test this? does it really make it slower?

I cannot recall exactly what the problem was, but will get back to it during analysis certainly. The problem was with a column which had dtypes not accepted by numba. I don't know why numba needed to test all dtypes of all columns, but this was a quick fix which solved the problem.

@rettigl
Copy link
Member

rettigl commented Dec 1, 2023

it makes computation of the dataframe, and thus binning, slower.

did you test this? does it really make it slower?

I cannot recall exactly what the problem was, but will get back to it during analysis certainly. The problem was with a column which had dtypes not accepted by numba. I don't know why numba needed to test all dtypes of all columns, but this was a quick fix which solved the problem.

For mpes, I get something like this pretty consistently:
grafik
And also binning is notably slower. For Flash dataframes, the difference is smaller, but also there.
grafik
Note the added graph layer:
grafik

@rettigl
Copy link
Member

rettigl commented Feb 5, 2024

Tests for the repr methods are failing. This is due to various reasons, the first being that you don't check if the dataframe is loaded, another that you don't check if the columns you request exist (e.g. for non-Flash-loaders).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extract statistical data from FLASH parquets
2 participants