-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
[enc] Support trianing continuation. #11598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements training continuation support for categorical encoders, enabling XGBoost to handle changing category encodings between training and prediction phases. It adds support for using the __arrow_c_device_array__ interface in cuDF 25.06 and includes comprehensive handling of all integer types for categorical features.
- Introduce support for training continuation with categorical features by allowing reference categories for re-coding
- Add complete support for all integer types (uint8_t, uint16_t, uint32_t, uint64_t) in categorical encoding
- Implement
__arrow_c_device_array__interface support for cuDF 25.06 compatibility
Reviewed Changes
Copilot reviewed 55 out of 55 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/python/test_with_polars.py | Update test to use new export_to_arrow=True parameter |
| tests/python/test_ordinal.py | Fix typo in function name and add new test functions |
| tests/python-gpu/test_gpu_ordinal.py | Add mixed device tests and new ordinal test functions |
| tests/cpp/test_serialization.cc | Add test coverage for new unsigned integer array types |
| tests/cpp/data/test_cat_container.h | Update constructor call to use Context parameter |
| src/tree/gpu_hist/evaluate_splits.cu | Remove deprecated CUB version check |
| src/predictor/predict_fn.h | Move accessor classes to cat_container.h |
| src/predictor/gpu_predictor.cu | Refactor to use shared accessor functions |
| src/predictor/cpu_predictor.cc | Update sparse page view interface and use shared accessors |
| src/gbm/gbtree_model.h | Add CatsShared() method for shared pointer access |
| src/encoder/ordinal.h | Add support for all unsigned integer types |
| src/encoder/ordinal.cuh | Improve type checking and error messages |
| src/data/sparse_page_dmatrix.cc | Update CatContainer constructor call |
| src/data/simple_dmatrix.cu | Handle reference categorical data and encoding |
| src/data/simple_dmatrix.cc | Add support for encoded columnar batches |
| src/data/quantile_dmatrix.cu | Update CatContainer constructor |
| src/data/proxy_dmatrix.h | Add reference categories support |
| src/data/proxy_dmatrix.cuh | Handle reference categorical encoding |
| src/data/proxy_dmatrix.cu | Add type utilities and reference categories |
| src/data/proxy_dmatrix.cc | Improve DMatrix creation from proxy |
| src/data/gradient_index.cc | Add template instantiation for new batch type |
| src/data/entry.h | Move entry-related structures to dedicated header |
| src/data/ellpack_page.cu | Add specialization for encoded cuDF adapter |
| src/data/device_adapter.cuh | Implement encoded adapter batch and cuDF improvements |
| src/data/device_adapter.cu | Add reference categories parsing for cuDF |
| src/data/data.cc | Update includes and add template instantiation |
| src/data/columnar.h | Add helper functions for arrow-based categorical data |
| src/data/cat_container.h | Add accessor classes and CPU implementation |
| src/data/cat_container.cuh | Add CUDA implementation for category accessors |
| src/data/cat_container.cu | Update constructor and improve memory handling |
| src/data/cat_container.cc | Add support for all unsigned integer types |
| src/data/array_interface.h | Add TypeStr method declaration |
| src/data/array_interface.cc | Implement TypeStr method for better error messages |
| src/data/adapter.h | Refactor adapter interfaces and add encoding support |
| src/data/adapter.cc | Add reference categories parsing for columnar adapter |
| src/common/type.h | Add GetValueT utility type alias |
| src/common/quantile.cc | Add template instantiation for encoded batch |
| src/common/json.cc | Add support for new unsigned integer array types |
| src/common/hist_util.cuh | Remove deprecated CUB version checks |
| src/common/device_vector.cuh | Add missing include |
| src/common/column_matrix.h | Update includes for moved structures |
| src/c_api/c_api.cc | Update category API function signatures |
| python-package/xgboost/testing/utils.py | Add assert_allclose utility function |
| python-package/xgboost/testing/ordinal.py | Comprehensive test suite for categorical encoding |
| python-package/xgboost/testing/federated.py | Update type annotations |
| python-package/xgboost/testing/dask.py | Update type annotations |
| python-package/xgboost/data.py | Add Categories type support and reference handling |
| python-package/xgboost/core.py | Update Categories class and API calls |
| python-package/xgboost/callback.py | Update type annotations |
| python-package/xgboost/_typing.py | Add new type definitions |
| python-package/xgboost/_data_utils.py | Major refactor for reference categories and arrow support |
| python-package/pyproject.toml | Add CUDA to extension whitelist |
| include/xgboost/json_io.h | Add visitor methods for new array types |
| include/xgboost/json.h | Add new unsigned integer array value kinds |
| doc/python/python_api.rst | Add Categories class to documentation |
|
cc @rongou . |
arrow. sketch of the container. mapping. aif. Pass it down. Extract the names. store it.rename. sketch of the accessor. Pass in the batch. Cleanup. work on test. copy work on pandas. Use list to keep the order. no print. Start the work on QDM. Cleanup. typing. alias. Fix. import skip. cleanup. outdated test. note. Move initialization. Get cats cudf. In the adapter. Copy. Work on CUDA test. Setters. rename. static. Work on cuDF acc. fix. cleanup. assert. doc string. dispatch. get cats. Check. cleanup. Work on numeric index. Numeric. typo. Remove. Notes. Split up arrow utilities. Notes. Reference. Fix note. Hide. typo. Notes. hide. Test. Work on removing pyarrow as dep. Merge. Move. Move. Use the handle for the host. Work on device. Use host storage. Cleanup. Work on training continuation. device. More. Cleanup the type hints. more checks. Work on predict check. type hints. More. Tests. params. small optimization. modin series.
arrow schema. check. Device array. debug fix. Wait. cleanup. lint. cleanup. Cleanup. Lost track. lint. Note. cleanup. Work on update. cleanup. Polars.
e5d30df to
85bb3a5
Compare
Ref #11088
__arrow_c_device_array__in cuDF 25.06.feature_typesfor reference encoding.DMatrix.todos