Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPSS generated columns cause error in group_by when called directly but not if filtered #658

Closed
ghost opened this issue Oct 6, 2014 · 8 comments

Comments

@ghost
Copy link

ghost commented Oct 6, 2014

I have a large data frame that contains 3 columns of data that was generated in SPSS. I'm not actually accessing those columns in the code, but if I call group_by first, it generates an error (see below), whereas if I call filter first and pipe that to group_by it does not cause an error.

# I convert my data.frame to a data.table just to see if that resolves the problem, it does not
suicidesTbl <- tbl_df(suicideDeathDshsPost2001)

# The following is what I want to do
daysSinceIvenstigation <- suicidesTbl %>%
  group_by(timePeriod) %>%
  summarise(suicideCounts = n())

# But I get the following error
Error: column 'death.date.spss' has unsupported type

# The following sequence, with a call to filter, which doesn't actually eliminate anything, works just fine
daysSinceIvenstigation <- filter(suicidesTbl, YrDeath != 1900) %>%
  group_by(timePeriod) 
  summarise(suicideCounts = n())

# There are three offensive columns, if I eliminate them then it also works fine
daysSinceIvenstigation <- select(suicidesTbl, -death.date.spss, -birth.date.spss, -age.at.death) %>%
  group_by(timePeriod) %>%
  summarise(suicideCounts = n())

# Here is the structure of the offensive death.date.spss column
> str(suicidesTbl$death.date.spss)
Classes 'labelled', 'numeric'  atomic [1:1044] 1.32e+10 1.32e+10 1.32e+10 1.32e+10 1.32e+10 ...
  ..- attr(*, "label")= Named chr "Date.Dmy(DaDeath,MoDeath,YrDeath)"
  .. ..- attr(*, "names")= chr "death_date_spss"

# I can get around this easily enough, I don't really need those spss generated columns at this point, but it seems like a problem that either should get fixed, or it is a problem that I should be aware of and change on my end.
@hadley
Copy link
Member

hadley commented Oct 6, 2014

read.spss() must be including some additional attribute that dplyr doesn't know how to handle. Just strip that off and you should be fine.

@ghost
Copy link
Author

ghost commented Oct 6, 2014

Thanks for the quick response Hadley, much appreciated.

Does it make sense that it would cause a problem in one sequence but not the other? As I say, I'm okay without them, just for my education's sake.

Steven


@stevenvannoy
stevenvannoy.wordpress.com

On Oct 6, 2014, at 8:24 AM, Hadley Wickham notifications@github.com wrote:

read.spss() must be including some additional attribute that dplyr doesn't know how to handle. Just strip that off and you should be fine.


Reply to this email directly or view it on GitHub.

@hadley
Copy link
Member

hadley commented Oct 6, 2014

Hmmmm, that suggests filter is misbehaving and dropping attributes. @romainfrancois can you please take a look?

@romainfrancois
Copy link
Member

I think it's just because of the extra class labelled. Where do I get suicideDeathDshsPost2001 from ?

@ghost
Copy link
Author

ghost commented Oct 6, 2014

Hi,

The data frame suicideDeathDshsPost2001 is partially constructed in my code, but it originates with a data file generated by SPSS, which in turn had read in an Excel spreadsheet. It appears that the problem variables are the few that were generated in SPSS (added to the original data). The data frame is 139,000 observations of 60+ variables. i'm happy to send to you if you wish, I could just drop the majority of the rows. I'm just learning how to use GitHub, if it can be done through that, I can do it that way or I could generate a data file and send it to you. Or whatever works. Again, I can get around this problem but I thought it should be reported in case there is an internal issue. Very happy to help you track it down.

Steven


@stevenvannoy
stevenvannoy.wordpress.com

On Oct 6, 2014, at 9:16 AM, Romain François notifications@github.com wrote:

I think it's just because of the extra class labelled. Where do I get suicideDeathDshsPost2001 from ?


Reply to this email directly or view it on GitHub.

@hadley
Copy link
Member

hadley commented Oct 6, 2014

Hmmm, it correctly errors for me on a simple test case:

library(dplyr)

x <- structure(1:10, class = "labelled", label = "abc")
df <- data.frame(y = 1:10)
df$x <- x

filter(df, y > 5)

@svannoy are you using dplyr 0.3?

@Robinlovelace
Copy link
Contributor

This issue of read.spss in fact explains the strange behaviour I reported here: #390 . x[1:nrow(x),] stripped it for me and continues in version 0.3*. There may be a better way, but agree it's mostly read.spss's fault...

@hadley hadley closed this as completed Oct 30, 2014
@ghost
Copy link
Author

ghost commented Oct 30, 2014

Hi Hadley,

I never heard back on my request to send you data (see below), but as I mentioned I just got around it by deleting the offending columns. If you want to track it down I'm more than happy to send you the data to reproduce it.

Steven


@stevenvannoy
stevenvannoy.wordpress.com

On Oct 30, 2014, at 5:32 PM, Hadley Wickham notifications@github.com wrote:

Closed #658.


Reply to this email directly or view it on GitHub.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants