Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO/SAS (sas7bdat) deleted observations are not filtered out #15963

Open
Winand opened this issue Apr 10, 2017 · 5 comments
Open

IO/SAS (sas7bdat) deleted observations are not filtered out #15963

Winand opened this issue Apr 10, 2017 · 5 comments
Labels
Bug IO SAS SAS: read_sas

Comments

@Winand
Copy link
Contributor

Winand commented Apr 10, 2017

Problem description

I filled a table, then deleted two observations (rows).
image
datetime.sas7bdat
Pandas reads 5 rows with read_sas:

       Date1      Date2            DateTime                 DateTimeHi  \
0 1960-01-06 1960-01-04 1677-09-21 00:12:44 1677-09-21 00:12:43.145226   
1 1960-01-03 1960-01-05 2262-04-11 23:47:16 1960-01-01 00:00:00.000000   
2        NaT        NaT                 NaT                        NaT   
3 1960-01-06 1960-01-04 1677-09-21 00:12:44 2262-04-11 23:47:16.854774   
4 1960-01-01 1960-01-01                 NaT 1960-01-01 00:00:00.000000   

        Taiw  
0 1912-01-01  
1 1960-01-02  
2        NaT  
3 1912-01-01  
4 1960-01-01  

Expected Output

DataFrame with 3 rows instead of 5 rows (w/o index no.2 and no.4)

Output of pd.show_versions()

last commit is cd24fa9 (ENH: add origin to to_datetime)

INSTALLED VERSIONS ------------------ commit: None python: 3.5.3.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None

pandas: 0+unknown
pytest: 3.0.6
pip: 9.0.1
setuptools: 34.3.2
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.19.0
xarray: 0.9.1
IPython: 5.3.0
sphinx: 1.5.2
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: 1.5.0
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: 0.9.5
lxml: 3.7.3
bs4: 4.5.3
html5lib: 1.0b10
sqlalchemy: 1.1.6
pymysql: None
psycopg2: None
jinja2: 2.9.5
s3fs: 0.0.9
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Apr 10, 2017

very odd that SAS actually keeps deleted things. is this a 'feature'?

@jreback jreback added this to the Next Major Release milestone Apr 10, 2017
@Winand
Copy link
Contributor Author

Winand commented Apr 10, 2017

I think, here's the answer:
When an observation is deleted from a non-compressed SAS data file, the observation is marked for deletion, but not physically removed. <...> With compressed data files, <...>, if the REUSE option is set to NO, the entry for that observation remains in the segment table, but the actual observation's data is returned to the available space pool. When REUSE=YES, <...>, the marker is removed from the segment table as well.

@jreback
Copy link
Contributor

jreback commented Apr 10, 2017

ok, so I guess we should check if they are actually deleted then. PR welcome. It is very strange to do this, but that's SAS for you.....

@jreback
Copy link
Contributor

jreback commented Apr 11, 2017

cc @bashtage
cc @kshedden

@Winand
Copy link
Contributor Author

Winand commented Apr 12, 2017

Another related bug.
image
If a data file has compression and deleted rows, pandas fails to read it:

Traceback (most recent call last):
  File "D:\andray\Software\Python\WinPython-64bit-3.5.3.1Qt5b3\settings\.spyder-py3\temp.py", line 17, in <module>
    d=pd.read_sas(r"D:\andray\Software\Python\projects\abdpv2\sas\bin\work\_TD508_MAKAROVAS_\q_del.sas7bdat")
  File "D:\andray\Software\Python\pandas\pandas\io\sas\sasreader.py", line 66, in read_sas
    data = reader.read()
  File "D:\andray\Software\Python\pandas\pandas\io\sas\sas7bdat.py", line 595, in read
    nrows = self.row_count
AttributeError: 'SAS7BDATReader' object has no attribute 'row_count'

troels added a commit to troels/pandas that referenced this issue Sep 9, 2018
Sas7bdat may contain rows which are actually deleted.

If the page_type has bit 128 set, there is a bitmap following
the normal row data with a bit set for a given row if it has been
deleted. Use that information to not include deleted rows in
the resulting dataframe.
troels added a commit to troels/pandas that referenced this issue Sep 9, 2018
Sas7bdat may contain rows which are actually deleted.

If the page_type has bit 128 set, there is a bitmap following
the normal row data with a bit set for a given row if it has been
deleted. Use that information to not include deleted rows in
the resulting dataframe.
troels added a commit to troels/pandas that referenced this issue Sep 9, 2018
Sas7bdat may contain rows which are actually deleted.

If the page_type has bit 128 set, there is a bitmap following
the normal row data with a bit set for a given row if it has been
deleted. Use that information to not include deleted rows in
the resulting dataframe.
troels added a commit to troels/pandas that referenced this issue Sep 9, 2018
Sas7bdat may contain rows which are actually deleted.

If the page_type has bit 128 set, there is a bitmap following
the normal row data with a bit set for a given row if it has been
deleted. Use that information to not include deleted rows in
the resulting dataframe.
troels added a commit to troels/pandas that referenced this issue Sep 9, 2018
Sas7bdat may contain rows which are actually deleted.

If the page_type has bit 128 set, there is a bitmap following
the normal row data with a bit set for a given row if it has been
deleted. Use that information to not include deleted rows in
the resulting dataframe.
troels added a commit to troels/pandas that referenced this issue Sep 11, 2018
Sas7bdat may contain rows which are actually deleted.

If the page_type has bit 128 set, there is a bitmap following
the normal row data with a bit set for a given row if it has been
deleted. Use that information to not include deleted rows in
the resulting dataframe.
troels added a commit to troels/pandas that referenced this issue Sep 16, 2018
Sas7bdat may contain rows which are actually deleted.

If the page_type has bit 128 set, there is a bitmap following
the normal row data with a bit set for a given row if it has been
deleted. Use that information to not include deleted rows in
the resulting dataframe.
troels added a commit to troels/pandas that referenced this issue Sep 18, 2018
Sas7bdat may contain rows which are actually deleted.

If the page_type has bit 128 set, there is a bitmap following
the normal row data with a bit set for a given row if it has been
deleted. Use that information to not include deleted rows in
the resulting dataframe.
troels added a commit to troels/pandas that referenced this issue Sep 23, 2018
Sas7bdat may contain rows which are actually deleted.

If the page_type has bit 128 set, there is a bitmap following
the normal row data with a bit set for a given row if it has been
deleted. Use that information to not include deleted rows in
the resulting dataframe.
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO SAS SAS: read_sas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants