Inconsistent formatting information in SPSS metadata? #77

TheManInTheShack · 2020-09-11T23:01:36Z

Hi! I've been using this package for a good while now, and love it immensely - it is the centerpiece of several advanced applications that I have written for organizing and modifying SPSS files, and it's made a real difference to my organization and clients. I can't thank you enough for providing it.

This issue is something that I detected a while back, but have heretofore just been working around; I'm not sure how to classify it, and I'm hoping I can get some information about how the metadata information is gathered.

Describe the issue

The basic problem is that there is a difference between these three things:
Here's what we see in variable view of SPSS:

Here's the original_variable_types:
{'ResponseId': 'A18', 'StartDate': 'A255', 'Duration__in_seconds_': 'F40.2', 'Finished': 'F1.0'}
...and here's the variable_storage_width:
{'ResponseId': 24, 'StartDate': 1024, 'Duration__in_seconds_': 8, 'Finished': 8}

Look at the two text variables: ResponseId reads the A18 'correctly', but the StartDate field is showing A255 when it should be showing A1024. If it were always that the variable_storage_width were the reliable source, I could use that to overwrite the format, but, looking again at ResponseId, if I did that in this case, I would get A24, which would be incorrect. Note that the numeric variables do provide the correct thing - I just left those in for visibility/comparison.

So I guess the question is, how does original_variable_types gather its data, and is there a way that I can predict which one of these items is the one that SPSS will expect, so that I can reliably hold the 'real' format? Or is this a bug, and the A255 is showing because it's hitting some kind of small-string limit? Thinking about it as I'm writing all of this out, I suppose 255 is a very suspicious number for that to insert...

To Reproduce

This isn't really a code issue, but here's the simple code I ran to produce those, nothing out of the ordinary:

import pyreadstat
df, meta = pyreadstat.read_sav("test_width.sav")
print(meta.original_variable_types)
print(meta.variable_storage_width)

File example

test_width.zip

Expected behavior

I guess what I'm really after is how do I reliably recreate the 'actual' format as shown in the variable view, so that I can write syntax against it that refers to the correct size.

Setup Information:

How did you install pyreadstat? (pip)
Platform (windows, 64 bit)
Python Version (3.7)
Python Distribution (plain python)
Using Virtualenv or condaenv? (No)

The text was updated successfully, but these errors were encountered:

evanmiller · 2020-09-12T00:56:05Z

Hi, your issue looks very much like WizardMac/ReadStat#210. Try updating to pyreadstat 1.0.2 and see if that fixes the issue. pip install --upgrade pyreadstat should do it.

TheManInTheShack · 2020-09-12T01:03:20Z

I agree it does look similar, but I'm afraid this is happening on 1.0.2.

evanmiller · 2020-09-12T01:51:18Z

Okay, it looks like each variable has several "widths" that need to be distinguished.

ReadStat's display_width corresponds to SPSS's "Columns"
ReadStat's format corresponds to SPSS's "Width" and "Decimals"

Then there is the storage_width, which SPSS does not display. (For strings, this should be the format width rounded up to the nearest 8-byte.)

It looks like there is a bug in ReadStat similar to #210 that affects format rather than display_width. The underlying cause is similar: the old "print format" and "write format" data fields were limited to a single byte, i.e. maxed out at 255. SPSS later added other records to override those values.

While ReadStat is successfully reading the special record that indicates the length of a long string, it's only using that information to determine the storage width. It should be using that information to override the format width as well.

All of this is to say, I think I know what's going on, and a fix should find its way through pipes before long.

Thanks for the detailed report!

See Roche/pyreadstat#77

TheManInTheShack · 2020-09-12T05:05:51Z

Great to hear, and thanks so much for the swift attention! :)

ofajardo · 2020-12-03T17:32:02Z

the issue will be solved on the next version as it has already been fixed in Readstat.

ofajardo · 2020-12-17T16:55:02Z

solved in pyreadstat version 1.0.6

evanmiller added a commit to WizardMac/ReadStat that referenced this issue Sep 12, 2020

Fix format widths for very long strings

e727bcc

See Roche/pyreadstat#77

ofajardo closed this as completed Dec 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent formatting information in SPSS metadata? #77

Inconsistent formatting information in SPSS metadata? #77

TheManInTheShack commented Sep 11, 2020

evanmiller commented Sep 12, 2020

TheManInTheShack commented Sep 12, 2020

evanmiller commented Sep 12, 2020

TheManInTheShack commented Sep 12, 2020

ofajardo commented Dec 3, 2020

ofajardo commented Dec 17, 2020

Inconsistent formatting information in SPSS metadata? #77

Inconsistent formatting information in SPSS metadata? #77

Comments

TheManInTheShack commented Sep 11, 2020

Describe the issue

To Reproduce

File example

Expected behavior

Setup Information:

evanmiller commented Sep 12, 2020

TheManInTheShack commented Sep 12, 2020

evanmiller commented Sep 12, 2020

TheManInTheShack commented Sep 12, 2020

ofajardo commented Dec 3, 2020

ofajardo commented Dec 17, 2020