-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling individual columns that can expand into multiple columns #163
Comments
Hi @ptonner ! Thanks for reaching out. This is an interesting use-case! Formulaic does support nested "explosions", but doesn't implement categorical encoding for them. For example: >>> def d():
>>> return {"a": {"b": [1,2,3], "c": [1,2,3]}, "b": {"b": [1,2,3], "c": [1,2,3]}}
>>> formulaic.model_matrix("d()", pandas.DataFrame({"x": [1,2,3]}))
Intercept d()[a][b] d()[a][c] d()[b][b] d()[b][c]
0 1.0 1 1 1 1
1 1.0 2 2 2 2
2 1.0 3 3 3 3 You could manually encoded the nested values in It would be relatively straightforward, I think, to add support for automatic categorical codings on on top of this as well. If I did this would it meet your use-case? It would look something like:
where |
interesting, I didn't know that was possible! I this this could work, but there's a couple details I'm not clear on. For one thing, it looks like Also just want to be sure - the transform Also, how would this work with re-using the generated Would there be an easy way to test out a proof of concept implementation of categorical factors for these expanding terms? if I could play around with that aspect it would probably be easier for me to know if this would work for me |
In [8]: formulaic.model_matrix("d()*d( )", pandas.DataFrame({"x": [1,2,3]})) # note the space
Out[8]:
Intercept d()[a][b] d()[a][c] d()[b][b] ... d()[a][b]:d( )[b][c] d()[a][c]:d( )[b][c] d()[b][b]:d( )[b][c] d()[b][c]:d( )[b][c]
0 1.0 1 1 1 ... 1 1 1 1
1 1.0 2 2 2 ... 4 4 4 4
2 1.0 3 3 3 ... 9 9 9 9
[3 rows x 25 columns] But it's not exactly the same (there are squares in there), and it's definitely not very satisfying... I'll see if I can think of alternative ways to get this kind of unnesting to work more nicely; but it might end up requiring the parameter driven approach you suggested above.
That's correct.
That's correct also.
You could get started today by just manually generating the nested dictionary structure, and calling |
Okay thanks for taking the time on this already! I appreciate all the feedback already I played around a bit to see about a nested dictionary mapping onto mutations def M(_):
intermediate = pd.DataFrame(
{"site1": ["", "a", "", "a"], "site2": ["", "", "b", "b"]}
)
return {
s: pd.get_dummies(c).drop(columns="").to_dict(orient="list")
for s, c in intermediate.items()
} The This gives: mut = pd.Series(["", "a1b", "b2c", "a1b;b2c"])
mm = formulaic.model_matrix("M(mut)", mut.to_frame("mut"))
which looks good! Expanding this to interactions is correct, although the notation isn't ideal (sorry for different formatting) formulaic.model_matrix("M(mut):M( mut)", mut.to_frame("mut"))
It would be interesting to try this with support for categorical factors. But at the moment, I'm leaning towards pushing the logic of generating a representation like
For building the formula, I was thinking some form of aliasing could be used like described here. But I'm not sure this can work for something that essentially represents multiple individual terms? E.g. something like this mutations = C(intermediate["site1"]) + C(intermediate["site2"])
formulaic.model_matrix("mutations", intermediate) doesn't work b/c |
Actually playing around with Doing this def M(data: pandas.Series):
def encoder(*args, **kwargs):
return pandas.get_dummies(intermediate, drop_first=True)
return FactorValues(data, kind="categorical", encoder=encoder) means I can run formulaic.model_matrix("m.M(mut)", mut.to_frame("mut"))
which looks promising Although interactions doesn't quite work as intended (I had to use aliasing for this, b/c I was getting an error
Do you think this is something that could addressed by doing some more digging in how contrasts are currently built for single categorical factors? I can look around and try some more things if this seems like it would work |
@ptonner Apologies for the delay. Life has been busy.
Huh... interesting. What was the actual error message?
Hmm... I'm not sure I understand this. What is the error you are talking about? The above one? Or is the output matrix not what you expected? The biggest problem I see is that you get squared terms like I've thought about this a bit more, and am thinking about introducing an "unnesting" syntax like: The I'm a little torn, because this does add complexity, but I'm curious what you think about this. I'm pretty sure it would solve your use-case... do you think it worthwhile? |
No worries on the delay! I'll have to get back to you about the errors, I need to recreate the environment I was testing in. But for your suggestion, it's a bit hard for me to know if this would address my issue. It seems like it would? But I also agree that it might be a bit complex if this is the only use case. As another alternative, would it be possible to support the idea of expanding a regular expression matching multiple columns into separate terms? E.g. something like Also, just FYI I have got a workable pipeline for my purposes similar to this second idea. Essentially:
So all this to say: if this feature would be pretty complex (seems like it would?) then I'm fine closing this for now rather than adding the complexity. I have a workable solution for my needs. |
hi, thanks for the library!
I was wondering how best to handle a particular use-case, where a single column can "expand" into multiple categorical factors. The specific context is when analyzing datasets containing information about the effect of genetic mutations on the function of a protein (or other genetic targets). When there's a single starting sequence (sometimes called a "parent" or "wild-type"), it makes sense to represent these mutations as being relative to that parent sequence. So for a single mutation you might have a string like "A11K" which means changing the parent sequences alanine residue ("A") at site 11 to a lysine ("K"). Then if you have multiple mutations for a single protein, it might look something like "A11K;S28T" for that version of the protein.
So an example, slightly modified from the published data of this paper, would look like this (subset of the full table):
The most typical way to use this information in machine learning is a one-hot encoding for each possible mutation. But I'd like to be able to leverage
formulaic
to:model_spec
resuse on new datasets to have a consistent encodingformulaic
to generate higher-order interactions between different factors, as this is often a goal of statistical analyses of these dataSo I'm trying to figure out how best fit this into a pipeline relying on
formulaic
. There are essentially two steps that have to be done in sequence:I could already do this "by hand" by transforming the input data for step 1 and then feeding this new matrix into a
model_matrix
call. E.g. something likebut the challenge here is somewhat already apparent: there's a large number of categorical factors (one for each site) that will be generated (it's not uncommon to have hundreds or thousands of sites mutated in a given dataset). So, the bookkeeping on these sites, and generating an equation that represents each one, can be somewhat unruly. There's two potential solutions I had in mind, and wanted to know what makes the most sense.
Option 1 (preferred): Make it possible to implement a stateful transform that expand into multiple factors
My ideal API here would be to define a stateful transform that looks something like this:
where
M
would both do the per-site encoding as well as generate a set of individual categorical factors. I looked into the internals of howC
works and it doesn't seem like stateful transforms can operate in this way? Or more generally is it not possible for an individual formula factor to expand into multiple factors?The reason I'd prefer this option as it would simplify usage in more complicated analysis where other terms might be included in the formula. It also could potentially make re-use of
model_spec
on a new dataset more convenient (for example when a new dataset is missing mutations at a particular site).Option 2: Build a formula programmatically for a transformed version of the dataset
I'm still digging into how you can accomplish this, but if I understand correctly
formulaic
already supports programmatic generation of formulas, so I could do the site expansion upstream of processing withformulaic
and then generate the categorical terms for each individual site. This would be a reasonable solution, but I just wanted to double check something like option 1 couldn't work as I think it would simplify use for end-users.The text was updated successfully, but these errors were encountered: