Support multiselect survey item types via `categorical` properties on `list` field types #1039

khusmann · 2024-06-19T20:36:18Z

khusmann
Jun 19, 2024
Collaborator

"Multiple select" items are an extremely common type of survey question type in the social / medical / bio-behavioral / etc sciences. For example:

Which fruits do you like? (Select all that apply)

a. Apple
b. Orange
c. Banana
d. Kiwi

Data from such items are often exported from survey software as a delimited list in a field. Qualtrics will export data like this (in fact, it uses this delimited list form by default), and I believe REDCap has an option for it as well (@pschumm please correct me if I'm wrong!). For example, an exported csv from the above item might look something like this:

id,multiselectField
0,"Apple"
1,"Apple,Orange"
2,"Apple,Banana,Kiwi"

For representing these item types in frictionless, I'd like to propose we allow categorical properties to be defined on list item types (where itemType is either integer or string). This way, the above multiple select item field could be represented as follows:

{
  "name": "multiselectField",
  "type": "list",
  "itemType": "string",
  "categories": ["Apple", "Orange", "Banana", "Kiwi"]
}

Or in a coded representation:

{
  "name": "multiselectField",
  "type": "list",
  "itemType": "integer",
  "categories": [
    { "value": 0, "label": "Apple"},
    { "value": 1, "label": "Orange"},
    { "value": 2, "label": "Banana"},
    { "value": 3, "label": "Kiwi"}
  ]
}

Thoughts from other folks that frequently use categorical items? @pschumm @fomcl @djvanderlaan

djvanderlaan · 2024-07-09T12:18:15Z

djvanderlaan
Jul 9, 2024

To be honest, I don't have that much experience with these. From what I have seen these are usually stored either as separate 'dummy' columns (so separate column for apple, orange etc with 1 indicating that this category is selected) or in 'long format' where there are multiple rows; one for each category selected. I believe the latter is, for example, used in some of the hospital data we are working with, where patients can have 0 or more subdiagnoses (a dummy variable for each possible icd10 code would be a bit unwieldy 😆 ).

But I suspect this is also because most tools don't have direct/easy support for list type fields. In plain JS/Python this is quite easy (don't know about pandas); R also supports these, although working with list types is not easy in base-R. So it is about support and not necessarily about this being or not being a natural way to store this data.

What you suggest is only a small deviation from the current spec. Actually reading back the v3 spec: it depends a little bit what the 'logical representation' of a list type is and how to interpret 'The logical representation of data in the field MUST exactly match one of the values in categories.' I would suspect that most humans reading your example would understand what is meant.

0 replies

khusmann · 2024-07-09T22:55:38Z

khusmann
Jul 9, 2024
Collaborator Author

From what I have seen these are usually stored either as separate 'dummy' columns (so separate column for apple, orange etc with 1 indicating that this category is selected) or in 'long format' where there are multiple rows; one for each category selected.

Exactly. To summarize, multiselect items are represented in tabular formats as:

Delimited lists in a single column (as I'm proposing here; Qualtrics and Redcap support this)
Exploded columns (1 boolean column for each option) (Qualtrics and Redcap also support this)
Exploded rows (like 2, but transformed into 'long' format; not supported by Qualtrics / Redcap, but used elsewhere, as you say)

Representations (2) & (3) can be presently captured by the current frictionless spec via boolean columns. In the current v2 spec, representation (1) is only partially supported: we can make lists of integer or string types, but we cannot define categories for them. This proposal would allow us to define the categories prop on integer and string lists, thereby giving us full support for representation (1).

But I suspect this is also because most tools don't have direct/easy support for list type fields. In plain JS/Python this is quite easy (don't know about pandas); R also supports these, although working with list types is not easy in base-R. So it is about support and not necessarily about this being or not being a natural way to store this data.

Historically, I think that's true, but in my experience list-columns have more recently become ubiquitous across the current open software landscape. For example, Pandas has a function to explode a list-column of form (1) into exploded rows of form (3). There's also a similar function for list-columns in Polars. List-columns in base R are a little unwieldy as you say, but now have excellent support now in the tidyverse: tibbles of list-columns work seamlessly with purrr maps, and tidyr now provides unnest_wider and unnest_longer to explode list-columns into form (2), and (3) respectively (here's the relevant vignette).

Implementations that don't support list-columns could also easily include an option to load these fields by transforming them into an exploded form, or leave them as delimited strings.

The larger point, I think, is that it's not uncommon to see delimited lists of categoricals for multiselect items (e.g. Redcap & Qualtrics), and so it'd be nice to directly represent this format in a frictionless schema rather than requiring pre-transformation to the data to get it into frictionless. Plus, we're 90% of the way there already via the existing list field type...

What you suggest is only a small deviation from the current spec. Actually reading back the v3 spec: it depends a little bit what the 'logical representation' of a list type is and how to interpret 'The logical representation of data in the field MUST exactly match one of the values in categories.'

Exactly. We would rephrase this part of the categorical definition to include a different provision for lists of categoricals. Something like:

When the categorical property is applied to a `list` field type, the logical representation of each element in the list `MUST` exactly match one of the values in categories.

There's a few other places like this where we'd need to update the definition to allow for the lists; but in general it would only be a minor deviation from the current spec, as you say.

0 replies

pschumm · 2024-07-10T00:30:41Z

pschumm
Jul 10, 2024
Collaborator

Here are some very quick reactions—nothing I would feel strongly about without further thought. Multiselect items are typically pretty bad from a measurement perspective; people often analyze them as though they were independent responses to each of several questions (one for each possible item), but of course the task when you present someone with a list of items and ask them to "Check all that apply" is actually quite different. If you want to know a participant's response to each item, then you really need to ask about each individually. That said, I recognize that there can be legitimate use cases, and even in cases where individual items would have been better, researchers still use multiselect fields (e.g., REDCap includes them as an option), so we need a way to represent them in the data.

While systems such as Pandas allow list fields, they are not very helpful when fitting statistical models, as it is difficult to create a model matrix from a list field. Hence, Stata doesn't have list fields—don't know about SAS or SPSS. Stata does have a function to split a list field into Representation (2) above, and it's not difficult to translate to Representation (3) either. But if you are always going to do that before fitting a model, then you don't gain much by using Representation (1).

I would point out that, if the items are ["Apple","Orange","Banana","Kiwi"], then there is a big difference between the value "Apple,Banana" and the value "Checked,Not checked,Checked,Not checked"; specifically (1) the Stata function split will work on the latter but not on the former, (2) the ordering of the categories is ambiguous in the former but not in the latter, and (3) the latter has categories at two levels (i.e., the items and the response options for each) while the former has only one level of categories (i.e., the items). Also, missing values can work (and mean) quite different things in both cases. Thus, I think we should think a bit about whether we want to support both of these, and if not, which one do we prefer?

Finally, a very similar issue is how to handle rank choices, including responses with tied ranks. In one sense a multiselect field may be thought of as a special case of a rank choice field (i.e., one with only two ranks, both of which may have ties). Perhaps that's too great a level of abstraction for this purpose.

In sum, I don't have any specific objection to what Kyle is proposing above, and I can even think of cases where it would work very nicely (e.g., storing diagnosis codes in medical record data). My only point is that there are instances of multiselect fields (and rank choice fields) that it doesn't cover, and that are important for full support of social, behavioral, and biomedical research. For example, in the case of a multiselect field in REDCap, it is unlikely that I would want to store the resulting data in this way.

0 replies

khusmann · 2024-07-10T22:31:12Z

khusmann
Jul 10, 2024
Collaborator Author

Multiselect items are typically pretty bad from a measurement perspective… researchers still use multiselect fields (e.g., REDCap includes them as an option), so we need a way to represent them in the data.

Agreed!

Finally, a very similar issue is how to handle rank choices, including responses with tied ranks. In one sense a multiselect field may be thought of as a special case of a rank choice field (i.e., one with only two ranks, both of which may have ties). Perhaps that's too great a level of abstraction for this purpose.

On the contrary, thinking about these abstractions is exactly the discussion I was hoping to have here :)

Your comment suggests we should have a fourth representation in our list:

Array columns (a variable size list of the selected options, e.g. "Apple,Banana")
Exploded columns
Exploded rows
Tuple columns (a fixed size list where each element of the list is a boolean, e.g. "True,False,True,False")

As you point out, representation (4) can be generalized to rank choices. In fact, I think it can be generalized to any block of nested items, like a matrix grid of likert responses, for example. For this use case, however, because we have a fixed length of items, I think it's probably better to avoid nested values and keep the exploded form (2). In this case, we would express hierarchy via field properties, perhaps something like:

[
  {
    "name": "field1",
    "type": "integer",
    "fieldGroup": "measure1"
  },
  {
    "name": "field2",
    "type": "integer",
    "fieldGroup": "measure1"
  },
]

As you point out, if we have a grouped representation like this it would also work for multiselect items, where individual items in the group were booleans representing the "checked" and "not checked" states of the instrument. As you say:

I would point out that, if the items are ["Apple","Orange","Banana","Kiwi"], then there is a big difference between the value "Apple,Banana" and the value "Checked,Not checked,Checked,Not checked";
missing values can work (and mean) quite different things in both cases. Thus, I think we should think a bit about whether we want to support both of these, and if not, which one do we prefer?

I totally agree, and here's my argument for why a multiselect item is categorically different (pun intended) from a group of yes-no items, even though that's how we generally treat it for analysis: The key difference is that in a group of yes-no items, you potentially have the ability to skip or not respond to one of the yes-no answers, whereas in a multiselect, there's no such thing as a skipped item – only a skipped group.

In other words, an N-item multiselect is not N discrete yes-no decisions when you may opt-out of any one of those N decisions, it represents a single action of selecting a subset of items from a pool of unique items. You either saw the multiselect item and selected zero or more options, or you never saw the item because of survey logic (or never took the survey) and it should be considered missing. I know, it's annoying from an analytical perspective, but that's the behavior we're wanting to represent in our data…

The other way multiselect items lend themselves to a variable length collection of categorical values is because you may want to attach other metadata the selection options (rather than to the question). So, for example, in a multiselect where an individual is choosing a list of medications they are taking, each categorical level could include information about those medications (or give the items a structure with some hierarchy).

(2) the ordering of the categories is ambiguous in the former but not in the latter,

I disagree on this point; the order of the options are encoded in the ordering of the levels in the categorical definition.

But if you are always going to do that before fitting a model, then you don't gain much by using Representation (1).
For example, in the case of a multiselect field in REDCap, it is unlikely that I would want to store the resulting data in this way.

I can understand this sentiment for a final data product you'd share for researchers to use analytically, but I think the value of the representation is for use in data transformation pipelines. To summarize, it allows you to:

interact with the multiselect item as a single item when you're selecting, pivoting, filtering, and otherwise wrangling/packing/coding your data. This means the step of exploding the multiselect into columns becomes an explicit part of your transformation pipeline from raw instrument -> final shared analytic data product.
add validation constraints on the collection instrument that only make sense when the multiselect is considered as a single item, rather than a group of N-independent items. (e.g., for a multiselect that asks "select up to two fruits" you could add a maxLength constraint on the list)
have metadata associated with the levels of the multiselect, rather than having to associate it with the N-independent boolean fields as if they were discrete questions. (This allows you to use the same categorical level definitions for forced choice and multiselect items; i.e. use the same codings with radio buttons as checkboxes)
distinguish between an N-item multiselect item vs a group of N yes-no choices in your data model. (This is nice for codebook / visualization generation, and ensures you can only have a missing value for the entire multiselect item rather than have to account for a possible missing value for each of the N independent items)

(1) the Stata function split will work on the latter but not on the former,

I hear you on this, but again, I think the value of this representation is in the transformation pipelines – so by the time it gets to the Stata user as an analytic data product it will have been exploded. Furthermore, I'm expecting we'll eventually be writing most of our loading functions for frictionless Stata, etc. as calls to a common python / rust library, right? So they can be exploded there as well in the rare case a Stata user is loading an intermediate data product with list-columns.

That said, if I haven't convinced you, I'm willing to compromise on the ontology of multiselect items if you're more interested in running with the "item groupings" approach for now – even if I don't think the abstraction 100% fits, I think it's still 100% valid to go with something that will probably work good enough in most use cases and can simultaneously address a more general need here in the short-term. (And we can always implement the array-column approach later, if we want… it is not mutually exclusive).

Please let me know what you think / how you'd like to proceed!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiselect survey item types via `categorical` properties on `list` field types #1039

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Support multiselect survey item types via categorical properties on list field types #1039

khusmann Jun 19, 2024 Collaborator

Replies: 4 comments

djvanderlaan Jul 9, 2024

khusmann Jul 9, 2024 Collaborator Author

pschumm Jul 10, 2024 Collaborator

khusmann Jul 10, 2024 Collaborator Author

Support multiselect survey item types via `categorical` properties on `list` field types #1039

khusmann
Jun 19, 2024
Collaborator

djvanderlaan
Jul 9, 2024

khusmann
Jul 9, 2024
Collaborator Author

pschumm
Jul 10, 2024
Collaborator

khusmann
Jul 10, 2024
Collaborator Author