Skip to content

Conversation

@trivialfis
Copy link
Member

@trivialfis trivialfis commented Jul 29, 2025

Ref #11088

  • Use the __arrow_c_device_array__ in cuDF 25.06.
  • Reuse the feature_types for reference encoding.
  • Change the columnar schema to include a handle to the cat container.
  • Support training continuation through re-coding the DMatrix.
  • Handle invalid input type.
  • Support all integer types.

todos

  • Handle inplace predict fallback.
  • Remove the container in the GPU predictor.
  • Test predict with re-coded DMatrix. Skip storing categories if a reference is present.

@trivialfis trivialfis requested a review from Copilot July 29, 2025 20:03

This comment was marked as outdated.

@trivialfis trivialfis requested a review from Copilot July 30, 2025 08:31

This comment was marked as outdated.

@trivialfis trivialfis requested a review from Copilot July 31, 2025 09:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements training continuation support for categorical encoders, enabling XGBoost to handle changing category encodings between training and prediction phases. It adds support for using the __arrow_c_device_array__ interface in cuDF 25.06 and includes comprehensive handling of all integer types for categorical features.

  • Introduce support for training continuation with categorical features by allowing reference categories for re-coding
  • Add complete support for all integer types (uint8_t, uint16_t, uint32_t, uint64_t) in categorical encoding
  • Implement __arrow_c_device_array__ interface support for cuDF 25.06 compatibility

Reviewed Changes

Copilot reviewed 55 out of 55 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/python/test_with_polars.py Update test to use new export_to_arrow=True parameter
tests/python/test_ordinal.py Fix typo in function name and add new test functions
tests/python-gpu/test_gpu_ordinal.py Add mixed device tests and new ordinal test functions
tests/cpp/test_serialization.cc Add test coverage for new unsigned integer array types
tests/cpp/data/test_cat_container.h Update constructor call to use Context parameter
src/tree/gpu_hist/evaluate_splits.cu Remove deprecated CUB version check
src/predictor/predict_fn.h Move accessor classes to cat_container.h
src/predictor/gpu_predictor.cu Refactor to use shared accessor functions
src/predictor/cpu_predictor.cc Update sparse page view interface and use shared accessors
src/gbm/gbtree_model.h Add CatsShared() method for shared pointer access
src/encoder/ordinal.h Add support for all unsigned integer types
src/encoder/ordinal.cuh Improve type checking and error messages
src/data/sparse_page_dmatrix.cc Update CatContainer constructor call
src/data/simple_dmatrix.cu Handle reference categorical data and encoding
src/data/simple_dmatrix.cc Add support for encoded columnar batches
src/data/quantile_dmatrix.cu Update CatContainer constructor
src/data/proxy_dmatrix.h Add reference categories support
src/data/proxy_dmatrix.cuh Handle reference categorical encoding
src/data/proxy_dmatrix.cu Add type utilities and reference categories
src/data/proxy_dmatrix.cc Improve DMatrix creation from proxy
src/data/gradient_index.cc Add template instantiation for new batch type
src/data/entry.h Move entry-related structures to dedicated header
src/data/ellpack_page.cu Add specialization for encoded cuDF adapter
src/data/device_adapter.cuh Implement encoded adapter batch and cuDF improvements
src/data/device_adapter.cu Add reference categories parsing for cuDF
src/data/data.cc Update includes and add template instantiation
src/data/columnar.h Add helper functions for arrow-based categorical data
src/data/cat_container.h Add accessor classes and CPU implementation
src/data/cat_container.cuh Add CUDA implementation for category accessors
src/data/cat_container.cu Update constructor and improve memory handling
src/data/cat_container.cc Add support for all unsigned integer types
src/data/array_interface.h Add TypeStr method declaration
src/data/array_interface.cc Implement TypeStr method for better error messages
src/data/adapter.h Refactor adapter interfaces and add encoding support
src/data/adapter.cc Add reference categories parsing for columnar adapter
src/common/type.h Add GetValueT utility type alias
src/common/quantile.cc Add template instantiation for encoded batch
src/common/json.cc Add support for new unsigned integer array types
src/common/hist_util.cuh Remove deprecated CUB version checks
src/common/device_vector.cuh Add missing include
src/common/column_matrix.h Update includes for moved structures
src/c_api/c_api.cc Update category API function signatures
python-package/xgboost/testing/utils.py Add assert_allclose utility function
python-package/xgboost/testing/ordinal.py Comprehensive test suite for categorical encoding
python-package/xgboost/testing/federated.py Update type annotations
python-package/xgboost/testing/dask.py Update type annotations
python-package/xgboost/data.py Add Categories type support and reference handling
python-package/xgboost/core.py Update Categories class and API calls
python-package/xgboost/callback.py Update type annotations
python-package/xgboost/_typing.py Add new type definitions
python-package/xgboost/_data_utils.py Major refactor for reference categories and arrow support
python-package/pyproject.toml Add CUDA to extension whitelist
include/xgboost/json_io.h Add visitor methods for new array types
include/xgboost/json.h Add new unsigned integer array value kinds
doc/python/python_api.rst Add Categories class to documentation

@trivialfis trivialfis changed the title [wip][enc] Support trianing continuation. [enc] Support trianing continuation. Jul 31, 2025
@trivialfis trivialfis marked this pull request as ready for review July 31, 2025 09:56
@trivialfis
Copy link
Member Author

cc @rongou .

arrow.

sketch of the container.

mapping.

aif.

Pass it down.

Extract the names.

store it.rename.

sketch of the accessor.

Pass in the batch.

Cleanup.

work on test.

copy

work on pandas.

Use list to keep the order.

no print.

Start the work on QDM.

Cleanup.

typing.

alias.

Fix.

import skip.

cleanup.

outdated

test.

note.

Move initialization.

Get cats cudf.

In the adapter.

Copy.

Work on CUDA test.

Setters.

rename.

static.

Work on cuDF acc.

fix.

cleanup.

assert.

doc string.

dispatch.

get cats.

Check.

cleanup.

Work on numeric index.

Numeric.

typo.

Remove.

Notes.

Split up arrow utilities.

Notes.

Reference.

Fix note.

Hide.

typo.

Notes.

hide.

Test.

Work on removing pyarrow as dep.

Merge.

Move.

Move.

Use the handle for the host.

Work on device.

Use host storage.

Cleanup.

Work on training continuation.

device.

More.

Cleanup the type hints.

more checks.

Work on predict check.

type hints.

More.

Tests.

params.

small optimization.

modin series.
arrow schema.

check.

Device array.

debug

fix.

Wait.

cleanup.

lint.

cleanup.

Cleanup.

Lost track.

lint.

Note.

cleanup.

Work on update.

cleanup.

Polars.
@trivialfis trivialfis merged commit 9261f05 into dmlc:master Jul 31, 2025
86 of 90 checks passed
@trivialfis trivialfis deleted the enc-arrow-c-array branch July 31, 2025 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants