Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StimulusSet data types may clash with DataAssembly data types after s3 upload #905

Open
benlonnqvist opened this issue Jun 7, 2024 · 1 comment
Assignees

Comments

@benlonnqvist
Copy link
Contributor

benlonnqvist commented Jun 7, 2024

When uploading StimulusSets, the stimulus_id has to be coded as a string, as otherwise zip packaging of the StimulusSet fails.

If the StimulusSet['stimulus_id'] field is a string that contains e.g. only digit characters, when it is saved as a .csv and loaded from s3, the string datatype for any values that do not contain characters is not respected, resulting in other data types being loaded (as opposed to what were saved).

This is opposed to DataAssembly, which do respect data types when being loaded.

When brain-score merges the StimulusSet into the DataAssembly along the stimulus_id dim when loading the DataAssembly,
interesting errors pop up. This is because while the stimulus_id needs to be a string in the StimulusSet in order for the StimulusSet to be uploaded, the stimulus_id also needs to be a csv-inferrable type in the DataAssembly (rather than a string) in order for the merging of the two to succeed when loading the DataAssembly

This issue is also present for fields that are not stimulus_id: string types are saved as .csv and the data types of values are then inferred on a value-by-value basis. If a column of the StimulusSet contains values where some values could be interpreted as strings, and others as integers (e.g., 'condition' = {'100', '35', 'contours', 'RGB'}), these are inferred differently, resulting in a mix of strings and integers in the StimulusSet after loading from s3. This results in errors on any tests that test for the integrity of the data.

Since it does not seem to be possible to fix this like above by enforcing data types on the DataAssembly (since DataArrays don't seem to allow mixed types), the two most reasonable workarounds to this issue seem to be to either code such values explicitly as strings (e.g., 'condition' = {'100a', '35a', 'contours', 'RGB'} instead of 'condition' = {'100', '35', 'contours', 'RGB'}), or to enforce the data types after loading.

I would suggest saving the StimulusSet in a data format that respects data types, e.g. xarray netcdf4 instead of .csv, or to add more descriptive error messages when aforementioned errors occur.

@mike-ferguson
Copy link
Member

Thanks @benlonnqvist for opening an issue - we will look into this and get back ASAP!

@samwinebrake samwinebrake self-assigned this Jun 7, 2024
@deirdre-k deirdre-k assigned deirdre-k and unassigned samwinebrake Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants