Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow formatting the categorical encoded variables #158

Open
hguturu opened this issue Sep 29, 2023 · 4 comments
Open

Allow formatting the categorical encoded variables #158

hguturu opened this issue Sep 29, 2023 · 4 comments
Labels
enhancement New feature or request
Milestone

Comments

@hguturu
Copy link

hguturu commented Sep 29, 2023

Currently they get formatted as C({parameter})[T.{value}] or {parameter}[T.{value}] if its already a string.
E.g.,

BinGrp = [0, 0, 0, 1, 1, 1]
becomes
   C(BinGrp)[T.0]  C(BinGrp)[T.1]
0               1               0
1               1               0
2               1               0
3               0               1
4               0               1
5               0               1

It would be nice if we could pass in a format string to get simpler names. E.g. BinGrp0, BinGrp1 if we pass in a format string like "{parameter}{value}"

Moved from #46 (comment)

@matthewwardrop matthewwardrop added the enhancement New feature or request label Oct 4, 2023
@matthewwardrop
Copy link
Owner

I think it would be possible to easily add a format argument to the C() function; and the resulting formulae would look something like:

C(A, format="{variable}:{value}")

But presently the "variable" argument would be the entire C(A, format="{variable}:{value}"), not A. You could potentially fix this, but in principle you could encode A differently multiple times in the same formula... so I'm not sure yet whether this approach is worth pursuing.

Can you suggest a syntax that would make sense for you so we can further evaluate this?

@hguturu
Copy link
Author

hguturu commented Oct 5, 2023

Good point. I was coming more from the perspective of having easier to handle variable names.

e.g.,

design = formulaic.model_matrix(["C(BinGrp, contr.treatment)"], all_phenotypes)

model = sm.OLS([1,2,3,1,2,3], design).fit()
model.summary()

model.t_test("C(BinGrp, contr.treatment)[T.1] - C(BinGrp, contr.treatment)[T.0]") # impressively works

But, a little cumbersome to do.

Similarly if you had multiple encodings e.g.,

design = formulaic.model_matrix(["C(BinGrp, contr.treatment) + poly(BinGrp) + exp(BinGrp)"], all_phenotypes)
   C(BinGrp, contr.treatment)[T.0]  C(BinGrp, contr.treatment)[T.1]  poly(BinGrp)[1]  exp(BinGrp)
0                                1                                0        -0.408248     1.000000
1                                1                                0        -0.408248     1.000000
2                                1                                0        -0.408248     1.000000
3                                0                                1         0.408248     2.718282
4                                0                                1         0.408248     2.718282
5                                0                                1         0.408248     2.718282

Then I think you would specifiy a format for each one?

Using your suggested syntax:

C(BinGrp, contr.treatment) -> C(BinGrp, contr.treatment, format="{variable}:{value})
poly(BinGrp) -> poly(BinGrp, format="poly_{variable}_{value}")
exp(BinGrp) -> exp(BinGrp, format="{variable}") # e.g. you just want the value transformed but keep the name (silly transform)


                      BinGrp:0                              BinGrp:1   poly_BinGrp_1    BinGrp
0                                1                                0        -0.408248     1.000000
1                                1                                0        -0.408248     1.000000
2                                1                                0        -0.408248     1.000000
3                                0                                1         0.408248     2.718282
4                                0                                1         0.408248     2.718282
5                                0                                1         0.408248     2.718282

If format is not provided it falls back to the default?

@matthewwardrop
Copy link
Owner

Hmmmm... adding format arguments to every method is not really viable (we are just proxying numpy methods, and this wouldn't work for aliasing variables outside of a function call). We could obviously wrap these methods, but I'm not convinced this is a good idea.

After reflecting more on this, I think sensible (non-mutuially exclusive) ways forward might include:

  1. Adding support for format strings to categorical features to allow overriding the naming of columns combined with their levels. e.g.: C(X, fmt='{variable}.{level}')
  2. Add an aliasing operator along the lines of y ~ ("my_name":=C(X, fmt='...')
  3. Documenting better existing aliasing functionality:
import pandas
from formulaic import model_matrix
from formulaic.transforms import C

data = pandas.DataFrame({"X": ['a', 'b', 'c']})

my_var = C(data.X)
model_matrix("y ~ my_var", data)

I think I am leaning toward (1) and (3). I would consider implementing within formula aliasing if there were enough demand for it... but remain unconvinced at present.

@hguturu
Copy link
Author

hguturu commented Oct 12, 2023

I wasn't aware of 3. I tried it and it almost works, but the value var is still formatted differently.
e.g.

my_var = C(data.X)
model_matrix("~ my_var", data)

   Intercept  my_var[T.b]  my_var[T.c]
0        1.0            0            0
1        1.0            1            0
2        1.0            0            1

But, I was digging into the code a little bit and I realized there may be a simple enough way to get what is desired (although perhaps not stable across versions due to not being a "blessed" API).

import pandas
from formulaic import model_matrix
import formulaic

data = pandas.DataFrame({"X": ['a', 'b', 'c']})
formulaic.transforms.contrasts.TreatmentContrasts.FACTOR_FORMAT = '{name}.{field}'
model_matrix("~C(X)", data)

   Intercept  C(X).b  C(X).c
0        1.0       0       0
1        1.0       1       0
2        1.0       0       1

This is almost the desired output. The ~C(X) is still being stored in the name. But, perhaps there is a similar hack for this as well? If I can track down where the name is being set.

I could do

from formulaic.transforms import C
my_var = C(data.X)
formulaic.transforms.contrasts.TreatmentContrasts.FACTOR_FORMAT = '{name}.{field}'
model_matrix("~my_var", data)

   Intercept  my_var.b  my_var.c
0        1.0         0         0
1        1.0         1         0
2        1.0         0         1

and that gets me exactly what is needed, but that requires knowing the contrast variables in the formula involves parsing the formula.

By chance, is there a similar format constant I can play with to get the formatting needed without an official format support?

It already works when I don't explicitly ask for a contrast coding, but converting by values to strings.

from formulaic.transforms import C
data.X = data.X.astype(str)
formulaic.transforms.contrasts.TreatmentContrasts.FACTOR_FORMAT = '{name}.{field}'
model_matrix("~ X", data)

   Intercept  X.b  X.c
0        1.0    0    0
1        1.0    1    0
2        1.0    0    1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants