SPSS generated columns cause error in group_by when called directly but not if filtered #658

ghost · 2014-10-06T12:23:15Z

I have a large data frame that contains 3 columns of data that was generated in SPSS. I'm not actually accessing those columns in the code, but if I call group_by first, it generates an error (see below), whereas if I call filter first and pipe that to group_by it does not cause an error.

# I convert my data.frame to a data.table just to see if that resolves the problem, it does not
suicidesTbl <- tbl_df(suicideDeathDshsPost2001)

# The following is what I want to do
daysSinceIvenstigation <- suicidesTbl %>%
  group_by(timePeriod) %>%
  summarise(suicideCounts = n())

# But I get the following error
Error: column 'death.date.spss' has unsupported type

# The following sequence, with a call to filter, which doesn't actually eliminate anything, works just fine
daysSinceIvenstigation <- filter(suicidesTbl, YrDeath != 1900) %>%
  group_by(timePeriod) 
  summarise(suicideCounts = n())

# There are three offensive columns, if I eliminate them then it also works fine
daysSinceIvenstigation <- select(suicidesTbl, -death.date.spss, -birth.date.spss, -age.at.death) %>%
  group_by(timePeriod) %>%
  summarise(suicideCounts = n())

# Here is the structure of the offensive death.date.spss column
> str(suicidesTbl$death.date.spss)
Classes 'labelled', 'numeric'  atomic [1:1044] 1.32e+10 1.32e+10 1.32e+10 1.32e+10 1.32e+10 ...
  ..- attr(*, "label")= Named chr "Date.Dmy(DaDeath,MoDeath,YrDeath)"
  .. ..- attr(*, "names")= chr "death_date_spss"

# I can get around this easily enough, I don't really need those spss generated columns at this point, but it seems like a problem that either should get fixed, or it is a problem that I should be aware of and change on my end.

hadley · 2014-10-06T12:24:48Z

read.spss() must be including some additional attribute that dplyr doesn't know how to handle. Just strip that off and you should be fine.

ghost · 2014-10-06T13:01:56Z

Thanks for the quick response Hadley, much appreciated.

Does it make sense that it would cause a problem in one sequence but not the other? As I say, I'm okay without them, just for my education's sake.

Steven

@stevenvannoy
stevenvannoy.wordpress.com

On Oct 6, 2014, at 8:24 AM, Hadley Wickham notifications@github.com wrote:

read.spss() must be including some additional attribute that dplyr doesn't know how to handle. Just strip that off and you should be fine.

—
Reply to this email directly or view it on GitHub.

hadley · 2014-10-06T13:09:22Z

Hmmmm, that suggests filter is misbehaving and dropping attributes. @romainfrancois can you please take a look?

romainfrancois · 2014-10-06T13:16:47Z

I think it's just because of the extra class labelled. Where do I get suicideDeathDshsPost2001 from ?

ghost · 2014-10-06T13:31:04Z

Hi,

The data frame suicideDeathDshsPost2001 is partially constructed in my code, but it originates with a data file generated by SPSS, which in turn had read in an Excel spreadsheet. It appears that the problem variables are the few that were generated in SPSS (added to the original data). The data frame is 139,000 observations of 60+ variables. i'm happy to send to you if you wish, I could just drop the majority of the rows. I'm just learning how to use GitHub, if it can be done through that, I can do it that way or I could generate a data file and send it to you. Or whatever works. Again, I can get around this problem but I thought it should be reported in case there is an internal issue. Very happy to help you track it down.

Steven

@stevenvannoy
stevenvannoy.wordpress.com

On Oct 6, 2014, at 9:16 AM, Romain François notifications@github.com wrote:

I think it's just because of the extra class labelled. Where do I get suicideDeathDshsPost2001 from ?

—
Reply to this email directly or view it on GitHub.

hadley · 2014-10-06T13:40:17Z

Hmmm, it correctly errors for me on a simple test case:

library(dplyr)

x <- structure(1:10, class = "labelled", label = "abc")
df <- data.frame(y = 1:10)
df$x <- x

filter(df, y > 5)

@svannoy are you using dplyr 0.3?

Robinlovelace · 2014-10-11T16:22:52Z

This issue of read.spss in fact explains the strange behaviour I reported here: #390 . x[1:nrow(x),] stripped it for me and continues in version 0.3*. There may be a better way, but agree it's mostly read.spss's fault...

ghost · 2014-10-30T21:35:41Z

Hi Hadley,

I never heard back on my request to send you data (see below), but as I mentioned I just got around it by deleting the offending columns. If you want to track it down I'm more than happy to send you the data to reproduce it.

Steven

@stevenvannoy
stevenvannoy.wordpress.com

On Oct 30, 2014, at 5:32 PM, Hadley Wickham notifications@github.com wrote:

Closed #658.

—
Reply to this email directly or view it on GitHub.

hadley closed this as completed Oct 30, 2014

lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPSS generated columns cause error in group_by when called directly but not if filtered #658

SPSS generated columns cause error in group_by when called directly but not if filtered #658

ghost commented Oct 6, 2014

hadley commented Oct 6, 2014

ghost commented Oct 6, 2014

hadley commented Oct 6, 2014

romainfrancois commented Oct 6, 2014

ghost commented Oct 6, 2014

hadley commented Oct 6, 2014

Robinlovelace commented Oct 11, 2014

ghost commented Oct 30, 2014

SPSS generated columns cause error in group_by when called directly but not if filtered #658

SPSS generated columns cause error in group_by when called directly but not if filtered #658

Comments

ghost commented Oct 6, 2014

hadley commented Oct 6, 2014

ghost commented Oct 6, 2014

hadley commented Oct 6, 2014

romainfrancois commented Oct 6, 2014

ghost commented Oct 6, 2014

hadley commented Oct 6, 2014

Robinlovelace commented Oct 11, 2014

ghost commented Oct 30, 2014