Allow formatting the categorical encoded variables #158

hguturu · 2023-09-29T21:11:54Z

Currently they get formatted as C({parameter})[T.{value}] or {parameter}[T.{value}] if its already a string.
E.g.,

BinGrp = [0, 0, 0, 1, 1, 1]
becomes
   C(BinGrp)[T.0]  C(BinGrp)[T.1]
0               1               0
1               1               0
2               1               0
3               0               1
4               0               1
5               0               1

It would be nice if we could pass in a format string to get simpler names. E.g. BinGrp0, BinGrp1 if we pass in a format string like "{parameter}{value}"

Moved from #46 (comment)

The text was updated successfully, but these errors were encountered:

matthewwardrop · 2023-10-04T21:44:50Z

I think it would be possible to easily add a format argument to the C() function; and the resulting formulae would look something like:

C(A, format="{variable}:{value}")

But presently the "variable" argument would be the entire C(A, format="{variable}:{value}"), not A. You could potentially fix this, but in principle you could encode A differently multiple times in the same formula... so I'm not sure yet whether this approach is worth pursuing.

Can you suggest a syntax that would make sense for you so we can further evaluate this?

hguturu · 2023-10-05T04:32:37Z

Good point. I was coming more from the perspective of having easier to handle variable names.

e.g.,

design = formulaic.model_matrix(["C(BinGrp, contr.treatment)"], all_phenotypes)

model = sm.OLS([1,2,3,1,2,3], design).fit()
model.summary()

model.t_test("C(BinGrp, contr.treatment)[T.1] - C(BinGrp, contr.treatment)[T.0]") # impressively works

But, a little cumbersome to do.

Similarly if you had multiple encodings e.g.,

design = formulaic.model_matrix(["C(BinGrp, contr.treatment) + poly(BinGrp) + exp(BinGrp)"], all_phenotypes)
   C(BinGrp, contr.treatment)[T.0]  C(BinGrp, contr.treatment)[T.1]  poly(BinGrp)[1]  exp(BinGrp)
0                                1                                0        -0.408248     1.000000
1                                1                                0        -0.408248     1.000000
2                                1                                0        -0.408248     1.000000
3                                0                                1         0.408248     2.718282
4                                0                                1         0.408248     2.718282
5                                0                                1         0.408248     2.718282

Then I think you would specifiy a format for each one?

Using your suggested syntax:

C(BinGrp, contr.treatment) -> C(BinGrp, contr.treatment, format="{variable}:{value})
poly(BinGrp) -> poly(BinGrp, format="poly_{variable}_{value}")
exp(BinGrp) -> exp(BinGrp, format="{variable}") # e.g. you just want the value transformed but keep the name (silly transform)


                      BinGrp:0                              BinGrp:1   poly_BinGrp_1    BinGrp
0                                1                                0        -0.408248     1.000000
1                                1                                0        -0.408248     1.000000
2                                1                                0        -0.408248     1.000000
3                                0                                1         0.408248     2.718282
4                                0                                1         0.408248     2.718282
5                                0                                1         0.408248     2.718282

If format is not provided it falls back to the default?

matthewwardrop · 2023-10-12T00:40:28Z

Hmmmm... adding format arguments to every method is not really viable (we are just proxying numpy methods, and this wouldn't work for aliasing variables outside of a function call). We could obviously wrap these methods, but I'm not convinced this is a good idea.

After reflecting more on this, I think sensible (non-mutuially exclusive) ways forward might include:

Adding support for format strings to categorical features to allow overriding the naming of columns combined with their levels. e.g.: C(X, fmt='{variable}.{level}')
Add an aliasing operator along the lines of y ~ ("my_name":=C(X, fmt='...')
Documenting better existing aliasing functionality:

import pandas
from formulaic import model_matrix
from formulaic.transforms import C

data = pandas.DataFrame({"X": ['a', 'b', 'c']})

my_var = C(data.X)
model_matrix("y ~ my_var", data)

I think I am leaning toward (1) and (3). I would consider implementing within formula aliasing if there were enough demand for it... but remain unconvinced at present.

hguturu · 2023-10-12T01:19:02Z

I wasn't aware of 3. I tried it and it almost works, but the value var is still formatted differently.
e.g.

my_var = C(data.X)
model_matrix("~ my_var", data)

   Intercept  my_var[T.b]  my_var[T.c]
0        1.0            0            0
1        1.0            1            0
2        1.0            0            1

But, I was digging into the code a little bit and I realized there may be a simple enough way to get what is desired (although perhaps not stable across versions due to not being a "blessed" API).

import pandas
from formulaic import model_matrix
import formulaic

data = pandas.DataFrame({"X": ['a', 'b', 'c']})
formulaic.transforms.contrasts.TreatmentContrasts.FACTOR_FORMAT = '{name}.{field}'
model_matrix("~C(X)", data)

   Intercept  C(X).b  C(X).c
0        1.0       0       0
1        1.0       1       0
2        1.0       0       1

This is almost the desired output. The ~C(X) is still being stored in the name. But, perhaps there is a similar hack for this as well? If I can track down where the name is being set.

I could do

from formulaic.transforms import C
my_var = C(data.X)
formulaic.transforms.contrasts.TreatmentContrasts.FACTOR_FORMAT = '{name}.{field}'
model_matrix("~my_var", data)

   Intercept  my_var.b  my_var.c
0        1.0         0         0
1        1.0         1         0
2        1.0         0         1

and that gets me exactly what is needed, but that requires knowing the contrast variables in the formula involves parsing the formula.

By chance, is there a similar format constant I can play with to get the formatting needed without an official format support?

It already works when I don't explicitly ask for a contrast coding, but converting by values to strings.

from formulaic.transforms import C
data.X = data.X.astype(str)
formulaic.transforms.contrasts.TreatmentContrasts.FACTOR_FORMAT = '{name}.{field}'
model_matrix("~ X", data)

   Intercept  X.b  X.c
0        1.0    0    0
1        1.0    1    0
2        1.0    0    1

matthewwardrop added the enhancement New feature or request label Oct 4, 2023

ptonner mentioned this issue Dec 7, 2023

Handling individual columns that can expand into multiple columns #163

Open

matthewwardrop added this to the 1.2.0 milestone Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow formatting the categorical encoded variables #158

Allow formatting the categorical encoded variables #158

hguturu commented Sep 29, 2023

matthewwardrop commented Oct 4, 2023

hguturu commented Oct 5, 2023 •

edited

Loading

matthewwardrop commented Oct 12, 2023

hguturu commented Oct 12, 2023

Allow formatting the categorical encoded variables #158

Allow formatting the categorical encoded variables #158

Comments

hguturu commented Sep 29, 2023

matthewwardrop commented Oct 4, 2023

hguturu commented Oct 5, 2023 • edited Loading

matthewwardrop commented Oct 12, 2023

hguturu commented Oct 12, 2023

hguturu commented Oct 5, 2023 •

edited

Loading