Skip to content

Variable labels as a dataframe field #11179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cdagnino opened this issue Sep 23, 2015 · 10 comments
Closed

Variable labels as a dataframe field #11179

cdagnino opened this issue Sep 23, 2015 · 10 comments
Labels
API Design Needs Discussion Requires discussion from core team before further action

Comments

@cdagnino
Copy link

I use both Stata and Pandas. Many Stata users save variable labels to describe the columns in a clearer way than the names. Running this in Stata

sysuse auto.dta
describe

gives something like

variable name storage type variable label
make str18 Make and Model
price int Price
mpg int Mileage (mpg)

For me (maybe for others too) it would be useful to have an optional field in a DataFrame with a column label dictionary. The keys would be the columns (not necessarily all of them) and the values the string labels.

This is used in the pandas.io.stata.StataReader field variable_labels(see the docs], that allows you to import these labels when one reads in a Stata .dta file.

I know I could just carry around a dictionary with this information, but I think it's cleaner and less error prone to set it and save it within a DataFrame.

Additionally, storing this would allow doing a cycle on Stata/Pandas without loss of information, since the to_stata would check if this field exists. (to_stata might already have the option to pass the variable_labels dictionary as an option, but I didn't see it documented at least)

My coding prowess is quite limited, but I'd be happy to at least write test code and help out if somebody starts out.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2015

This would involve attaching additional meta-data to the Index object, specifically a matching list / dict of value -> description. But this would then raise quite a few issues. Maybe you can provide some pseudo examples of what you think about the following.

  • what would the Index constructor look-like. What would be a natural way to specify these? e.g.
    i = Index([1,2,3],desc=[....])?
  • When/how/what would you repr these? E.g. you are showing basically df.info(). We already have a pretty complicated repr, e.g. (and this is not even a mult-index)
In [3]: df = DataFrame([[1,2]],index=Index([1],name='foo'),columns=Index(['A','B'],name='bar'))

In [4]: df
Out[4]: 
bar  A  B
foo      
1    1  2
  • aside from 'desc' or 'labels' of the data, how is this useful? These are certainly not applicable to say 'quantities/units' (which is much more of a dtype specification.

@jreback jreback added API Design Needs Discussion Requires discussion from core team before further action labels Sep 24, 2015
@cdagnino
Copy link
Author

I see something similar was raised in #39, which was closed in favor of the general issue of allowing metadata for a DataFrame in #2485.

Let me first be more explicit about the use case and then try to answer some of your questions.

I'll take columns from different sources or create new ones. Exactly what they mean or how they were created doesn't fit into the name. In Stata I'd add a longer description to document this and a quick describe is good for refreshing memory. In pandas I'm thinking something like:

df = pd.DataFrame({'x': [3, 1], 'y' : [8, 2], 'z' : [1.1, 2.0]})
df.set_variable_labels({'x': 'This is variable x', 'y': 'This is another variable'}) # No need to specify all columns
df.info()   # Gives the table without labels (the same info given in current pandas)
df.info(labels=True)   # Gives a table with the labels
df.variable_labels  # Gives the column dictionary

Like I said before, I could carry around this metadata in a separate dictionary, but I think it would be nice to have in the DataFrame, especially if it can persist after doing some changes.

  • I don't think it's worth it to get it into the repr, but rather it could go as an option in the df.info()
  • It looks to me that adding this to the (column) Index object would be a lot of work. I was hoping there could be a way of assigning it with @property and then just appending it to the original df.info(). After modifying the DataFrame, the variable_labels dictionary could have some keys (columns) that don't exist anymore, but I don't think that would be a problem.

I'm guessing the big issue here is persistence, but at this time I don't have enough knowledge of Pandas internals to say anything more helpful.

@mbirdi
Copy link

mbirdi commented Oct 25, 2015

Having column names have an additional property of having a label name seems like an interesting feature from Stata. But as a pandas user I don't think I would use it. I like to keep my column names simple. The column names in a DataFrame are also Series objects, and having just one name for them works well for me, and how I use pandas.

For example, I would take the variable names in the first example: make, price, and mpg, and would change them to make_model, price_dollars, mileage_mpg.

I do lose track of my column names from time to time. But when that happens I just create a col_names variable with the DataFrame.columns method.

@jreback jreback added this to the Someday milestone Oct 26, 2015
@msampathkumar
Copy link

Hei, I like this idea :)

So I created a small code snippet. I'm new to open source, so please share some suggestions.

from pandas import DataFrame

class myDataFrame(DataFrame):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.columns_labels = {}

    def columns_description(self):
        print('\t'.join(['Column', 'Type', 'Description']))
        for each in zip(df.dtypes.index, df.dtypes):
            each = list(map(str, each)) + [self.columns_labels.get(each[0], '')]
            print('\t'.join(each))
            
    def update_columns_description(self, input_dict):
        for key in input_dict:
            if key in df.columns:
                self.columns_labels[key] = input_dict[key]

df = myDataFrame({'x': [3, 1], 'y' : [8, 2], 'z' : [1.1, 2.0]})
df.columns_description()
df.update_columns_description({'x': "I'm not so sure", 'y':'Hi there!', 'z': 'want to grab some coffe with me :)'})
df.columns_description()

@donnaaboise
Copy link

(I came across this issue from a comment on my Stack Exchange post on this issue.)

I am new to Pandas, and am using it for the first time in a Juypter Notebook. I love the way Pandas displays table data, and developers have clearly into making the display nice (different formatting styles, shaded table rows, etc). So it seemed obvious to me that there must be a way to have column labels (for display only) that are different from the dictionary keys. I was surprised to find that this feature didn't exist.

Here is why I think it would be a really nice feature.

  • The ability to manipulate data using short (single variable?) dictionary keys makes mathematical expressions much cleaner. I would much rather use df["e"] in a mathematical expression than
    df["Efficiency (%)"].

  • On the other hand, "e" makes for a bad column header for tables used for presentation purposes, or even just to remember what each column is.

This issue seems especially important in Jupyter Notebooks, which are designed for presentation purposes as well as actual computing.

@jreback
Copy link
Contributor

jreback commented Jan 13, 2018

@donnaaboise as you can see from the comments above, I don't think we would object to having this, but practically its quite a lot of work and lots of unanswered questions.

  • how would the 'labels' be specified (maybe thru an alternative index)
  • these would naturally have to propagate, this would lead to quite a lot of complexity, just having name propagate properly is hard
  • how would conflicts between the index and the 'label' be handled? what if they had the some overlapping values?

@donnaaboise
Copy link

Perhaps this additional meta data doesn't need to be stored at all, but only recognized in a formatter. For example, it is nice that columns can be formatted independently, i.e.

df.style.format({'e' : '{:8.2f}%', 't' : '{:12.3f}'})

Could this style also accept header labels? Something like :

df.style.format(formatstr={'e' : '{:8.2f}%', 't' : '{:12.3f}'}, labels={'e' : 'Efficiency (%)', 't' : 'Time'})

If one simply types

df

at a command prompt, the variable names are printed instead (no labels). Only when a style is specified are labels used instead (if desired).

@jreback
Copy link
Contributor

jreback commented Jan 13, 2018

you can already just rename things (and then chain with .style)

In [5]: df
Out[5]: 
   A  B
0  1  4
1  2  5
2  3  6

In [6]: df.rename(columns={'A':'A long version', 'B': 'B long version'})
Out[6]: 
   A long version  B long version
0               1               4
1               2               5
2               3               6

@donnaaboise
Copy link

yes - this seems like a very good approximation. The only minor drawback I can see is that the format dictionary passed to chained style.format now has to use the long names to format columns. But these can be accessed through a renaming dictionary. Something like :

di = {'e' : 'Efficiency', 't' : 'Time'}
fstr = {di["e"] : '{:8.2f}%', di["t"] : '{:12.3f}'}
df.rename(columns=di).style.format(fstr) 

@mroeschke
Copy link
Member

I believe the current _metadata attribute might be able to solve this issue (https://pandas.pydata.org/pandas-docs/stable/development/extending.html#define-original-properties); therefore, I think this issue is solved by this feature. Happy to reopen this issue if _metadata doesn't completely address this use case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

6 participants