StimulusSet data types may clash with DataAssembly data types after s3 upload #905

benlonnqvist · 2024-06-07T12:16:04Z

When uploading StimulusSets, the stimulus_id has to be coded as a string, as otherwise zip packaging of the StimulusSet fails.

If the StimulusSet['stimulus_id'] field is a string that contains e.g. only digit characters, when it is saved as a .csv and loaded from s3, the string datatype for any values that do not contain characters is not respected, resulting in other data types being loaded (as opposed to what were saved).

This is opposed to DataAssembly, which do respect data types when being loaded.

When brain-score merges the StimulusSet into the DataAssembly along the stimulus_id dim when loading the DataAssembly,
interesting errors pop up. This is because while the stimulus_id needs to be a string in the StimulusSet in order for the StimulusSet to be uploaded, the stimulus_id also needs to be a csv-inferrable type in the DataAssembly (rather than a string) in order for the merging of the two to succeed when loading the DataAssembly

This issue is also present for fields that are not stimulus_id: string types are saved as .csv and the data types of values are then inferred on a value-by-value basis. If a column of the StimulusSet contains values where some values could be interpreted as strings, and others as integers (e.g., 'condition' = {'100', '35', 'contours', 'RGB'}), these are inferred differently, resulting in a mix of strings and integers in the StimulusSet after loading from s3. This results in errors on any tests that test for the integrity of the data.

Since it does not seem to be possible to fix this like above by enforcing data types on the DataAssembly (since DataArrays don't seem to allow mixed types), the two most reasonable workarounds to this issue seem to be to either code such values explicitly as strings (e.g., 'condition' = {'100a', '35a', 'contours', 'RGB'} instead of 'condition' = {'100', '35', 'contours', 'RGB'}), or to enforce the data types after loading.

I would suggest saving the StimulusSet in a data format that respects data types, e.g. xarray netcdf4 instead of .csv, or to add more descriptive error messages when aforementioned errors occur.

The text was updated successfully, but these errors were encountered:

mike-ferguson · 2024-06-07T15:05:20Z

Thanks @benlonnqvist for opening an issue - we will look into this and get back ASAP!

samwinebrake self-assigned this Jun 7, 2024

deirdre-k assigned deirdre-k and unassigned samwinebrake Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StimulusSet data types may clash with DataAssembly data types after s3 upload #905

StimulusSet data types may clash with DataAssembly data types after s3 upload #905

benlonnqvist commented Jun 7, 2024 •

edited

Loading

mike-ferguson commented Jun 7, 2024

StimulusSet data types may clash with DataAssembly data types after s3 upload #905

StimulusSet data types may clash with DataAssembly data types after s3 upload #905

Comments

benlonnqvist commented Jun 7, 2024 • edited Loading

mike-ferguson commented Jun 7, 2024

benlonnqvist commented Jun 7, 2024 •

edited

Loading