Replies: 4 comments
-
To be honest, I don't have that much experience with these. From what I have seen these are usually stored either as separate 'dummy' columns (so separate column for apple, orange etc with 1 indicating that this category is selected) or in 'long format' where there are multiple rows; one for each category selected. I believe the latter is, for example, used in some of the hospital data we are working with, where patients can have 0 or more subdiagnoses (a dummy variable for each possible icd10 code would be a bit unwieldy 😆 ). But I suspect this is also because most tools don't have direct/easy support for list type fields. In plain JS/Python this is quite easy (don't know about pandas); R also supports these, although working with list types is not easy in base-R. So it is about support and not necessarily about this being or not being a natural way to store this data. What you suggest is only a small deviation from the current spec. Actually reading back the v3 spec: it depends a little bit what the 'logical representation' of a list type is and how to interpret 'The logical representation of data in the field MUST exactly match one of the values in categories.' I would suspect that most humans reading your example would understand what is meant. |
Beta Was this translation helpful? Give feedback.
-
Exactly. To summarize, multiselect items are represented in tabular formats as:
Representations (2) & (3) can be presently captured by the current frictionless spec via boolean columns. In the current v2 spec, representation (1) is only partially supported: we can make lists of
Historically, I think that's true, but in my experience list-columns have more recently become ubiquitous across the current open software landscape. For example, Pandas has a function to explode a list-column of form (1) into exploded rows of form (3). There's also a similar function for list-columns in Polars. List-columns in base R are a little unwieldy as you say, but now have excellent support now in the tidyverse: tibbles of list-columns work seamlessly with purrr maps, and tidyr now provides Implementations that don't support list-columns could also easily include an option to load these fields by transforming them into an exploded form, or leave them as delimited strings. The larger point, I think, is that it's not uncommon to see delimited lists of categoricals for multiselect items (e.g. Redcap & Qualtrics), and so it'd be nice to directly represent this format in a frictionless schema rather than requiring pre-transformation to the data to get it into frictionless. Plus, we're 90% of the way there already via the existing
Exactly. We would rephrase this part of the categorical definition to include a different provision for lists of categoricals. Something like:
There's a few other places like this where we'd need to update the definition to allow for the lists; but in general it would only be a minor deviation from the current spec, as you say. |
Beta Was this translation helpful? Give feedback.
-
Here are some very quick reactions—nothing I would feel strongly about without further thought. Multiselect items are typically pretty bad from a measurement perspective; people often analyze them as though they were independent responses to each of several questions (one for each possible item), but of course the task when you present someone with a list of items and ask them to "Check all that apply" is actually quite different. If you want to know a participant's response to each item, then you really need to ask about each individually. That said, I recognize that there can be legitimate use cases, and even in cases where individual items would have been better, researchers still use multiselect fields (e.g., REDCap includes them as an option), so we need a way to represent them in the data. While systems such as Pandas allow list fields, they are not very helpful when fitting statistical models, as it is difficult to create a model matrix from a list field. Hence, Stata doesn't have list fields—don't know about SAS or SPSS. Stata does have a function to split a list field into Representation (2) above, and it's not difficult to translate to Representation (3) either. But if you are always going to do that before fitting a model, then you don't gain much by using Representation (1). I would point out that, if the items are ["Apple","Orange","Banana","Kiwi"], then there is a big difference between the value "Apple,Banana" and the value "Checked,Not checked,Checked,Not checked"; specifically (1) the Stata function Finally, a very similar issue is how to handle rank choices, including responses with tied ranks. In one sense a multiselect field may be thought of as a special case of a rank choice field (i.e., one with only two ranks, both of which may have ties). Perhaps that's too great a level of abstraction for this purpose. In sum, I don't have any specific objection to what Kyle is proposing above, and I can even think of cases where it would work very nicely (e.g., storing diagnosis codes in medical record data). My only point is that there are instances of multiselect fields (and rank choice fields) that it doesn't cover, and that are important for full support of social, behavioral, and biomedical research. For example, in the case of a multiselect field in REDCap, it is unlikely that I would want to store the resulting data in this way. |
Beta Was this translation helpful? Give feedback.
-
Agreed!
On the contrary, thinking about these abstractions is exactly the discussion I was hoping to have here :) Your comment suggests we should have a fourth representation in our list:
As you point out, representation (4) can be generalized to rank choices. In fact, I think it can be generalized to any block of nested items, like a matrix grid of likert responses, for example. For this use case, however, because we have a fixed length of items, I think it's probably better to avoid nested values and keep the exploded form (2). In this case, we would express hierarchy via field properties, perhaps something like: [
{
"name": "field1",
"type": "integer",
"fieldGroup": "measure1"
},
{
"name": "field2",
"type": "integer",
"fieldGroup": "measure1"
},
] As you point out, if we have a grouped representation like this it would also work for multiselect items, where individual items in the group were booleans representing the "checked" and "not checked" states of the instrument. As you say:
I totally agree, and here's my argument for why a multiselect item is categorically different (pun intended) from a group of yes-no items, even though that's how we generally treat it for analysis: The key difference is that in a group of yes-no items, you potentially have the ability to skip or not respond to one of the yes-no answers, whereas in a multiselect, there's no such thing as a skipped item – only a skipped group. In other words, an N-item multiselect is not N discrete yes-no decisions when you may opt-out of any one of those N decisions, it represents a single action of selecting a subset of items from a pool of unique items. You either saw the multiselect item and selected zero or more options, or you never saw the item because of survey logic (or never took the survey) and it should be considered missing. I know, it's annoying from an analytical perspective, but that's the behavior we're wanting to represent in our data… The other way multiselect items lend themselves to a variable length collection of categorical values is because you may want to attach other metadata the selection options (rather than to the question). So, for example, in a multiselect where an individual is choosing a list of medications they are taking, each categorical level could include information about those medications (or give the items a structure with some hierarchy).
I disagree on this point; the order of the options are encoded in the ordering of the levels in the categorical definition.
I can understand this sentiment for a final data product you'd share for researchers to use analytically, but I think the value of the representation is for use in data transformation pipelines. To summarize, it allows you to:
I hear you on this, but again, I think the value of this representation is in the transformation pipelines – so by the time it gets to the Stata user as an analytic data product it will have been exploded. Furthermore, I'm expecting we'll eventually be writing most of our loading functions for frictionless Stata, etc. as calls to a common python / rust library, right? So they can be exploded there as well in the rare case a Stata user is loading an intermediate data product with list-columns. That said, if I haven't convinced you, I'm willing to compromise on the ontology of multiselect items if you're more interested in running with the "item groupings" approach for now – even if I don't think the abstraction 100% fits, I think it's still 100% valid to go with something that will probably work good enough in most use cases and can simultaneously address a more general need here in the short-term. (And we can always implement the array-column approach later, if we want… it is not mutually exclusive). Please let me know what you think / how you'd like to proceed! |
Beta Was this translation helpful? Give feedback.
-
"Multiple select" items are an extremely common type of survey question type in the social / medical / bio-behavioral / etc sciences. For example:
Which fruits do you like? (Select all that apply)
Data from such items are often exported from survey software as a delimited list in a field. Qualtrics will export data like this (in fact, it uses this delimited list form by default), and I believe REDCap has an option for it as well (@pschumm please correct me if I'm wrong!). For example, an exported csv from the above item might look something like this:
For representing these item types in frictionless, I'd like to propose we allow
categorical
properties to be defined onlist
item types (whereitemType
is eitherinteger
orstring
). This way, the above multiple select item field could be represented as follows:Or in a coded representation:
Thoughts from other folks that frequently use categorical items? @pschumm @fomcl @djvanderlaan
Beta Was this translation helpful? Give feedback.
All reactions