@@ -150,10 +150,10 @@ constructor to save the factorize step during normal constructor mode:
150
150
splitter = np.random.choice([0 ,1 ], 5 , p = [0.5 ,0.5 ])
151
151
s = pd.Series(pd.Categorical.from_codes(splitter, categories = [" train" , " test" ]))
152
152
153
- .. _categorical.objectcreation.frame :
153
+ .. _categorical.objectcreation.existingframe :
154
154
155
- Creating categories from a ``DataFrame ``
156
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
155
+ Creating categories from an existing ``DataFrame ``
156
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
157
157
158
158
.. versionadded :: 0.22.0
159
159
@@ -169,15 +169,6 @@ if a column does not contain all labels:
169
169
df[' A' ].dtype
170
170
df[' B' ].dtype
171
171
172
- Note that this behavior is different than instantiating a ``DataFrame `` with categorical dtype, which will only assign
173
- categories to each column based on the labels present in each column:
174
-
175
- .. ipython :: python
176
-
177
- df = pd.DataFrame({' A' : [' a' , ' b' , ' c' ], ' B' : [' c' , ' d' , ' e' ]}, dtype = ' category' )
178
- df[' A' ].dtype
179
- df[' B' ].dtype
180
-
181
172
When using ``astype ``, you can control the categories that will be present in each column by passing
182
173
a ``CategoricalDtype ``:
183
174
@@ -199,6 +190,72 @@ discussed hold with subselection.
199
190
df[[' A' , ' B' ]] = df[[' A' , ' B' ]].astype(' category' )
200
191
df.dtypes
201
192
193
+ Note that you can use ``apply `` to set categories on a per-column basis:
194
+
195
+ .. ipython :: python
196
+
197
+ df = pd.DataFrame({' A' : [' a' , ' b' , ' c' ], ' B' : [' c' , ' d' , ' e' ]})
198
+ df = df.apply(lambda x : x.astype(' category' ))
199
+ df[' A' ].dtype
200
+ df[' B' ].dtype
201
+
202
+
203
+ .. _categorical.objectcreation.frameconstructor :
204
+
205
+ Creating categories from the ``DataFrame `` constructor
206
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
207
+
208
+ .. versionchanged :: 0.22.0
209
+
210
+ .. warning ::
211
+
212
+ Prior to version 0.22.0, the default behavior of the ``DataFrame `` constructor when a categorical dtype was
213
+ passed was to operate on a per-column basis, meaning that only labels present in a given column would be categories
214
+ for that column.
215
+
216
+ To promote consistency of behavior, from version 0.22.0 onwards instantiating a ``DataFrame `` with categorical
217
+ dtype will by default use all labels in present all columns when setting categories, even if a column does not
218
+ contain all labels. This is consistent with the new ``astype `` behavior described above.
219
+
220
+ Behavior prior to version 0.22.0:
221
+
222
+ .. code-block :: ipython
223
+
224
+ In [2]: df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': ['c', 'd', 'e']}, dtype='category')
225
+
226
+ In [3]: df
227
+ Out[3]:
228
+ A B
229
+ 0 a c
230
+ 1 b d
231
+ 2 c e
232
+
233
+ In [4]: df['A'].dtype
234
+ Out[4]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
235
+
236
+ In [5]: df['B'].dtype
237
+ Out[5]: CategoricalDtype(categories=['c', 'd', 'e'], ordered=False)
238
+
239
+ Behavior from version 0.22.0 onwards:
240
+
241
+ .. ipython :: python
242
+
243
+ df = pd.DataFrame({' A' : [' a' , ' b' , ' c' ], ' B' : [' c' , ' d' , ' e' ]}, dtype = ' category' )
244
+ df
245
+ df[' A' ].dtype
246
+ df[' B' ].dtype
247
+
248
+ Like with ``astype ``, you can control the categories that will be present in each column by passing
249
+ a ``CategoricalDtype ``:
250
+
251
+ .. ipython :: python
252
+
253
+ dtype = CategoricalDtype(categories = list (' abdef' ), ordered = True )
254
+ df = pd.DataFrame({' A' : [' a' , ' b' , ' c' ], ' B' : [' c' , ' d' , ' e' ]}, dtype = dtype)
255
+ df
256
+ df[' A' ].dtype
257
+ df[' B' ].dtype
258
+
202
259
.. _categorical.categoricaldtype :
203
260
204
261
CategoricalDtype
0 commit comments