Commit 94f3589
## Description
This PR adds support for reading Unity Catalog Delta tables in Ray Data
with automatic credential vending. This enables secure, temporary access
to Delta Lake tables stored in Databricks Unity Catalog without
requiring users to manage cloud credentials manually.
### What's Added
- **`ray.data.read_unity_catalog()`** - Updated public API for reading
Unity Catalog Delta tables
- **`UnityCatalogConnector`** - Handles Unity Catalog REST API
integration and credential vending
- **Multi-cloud support** - Works with AWS S3, Azure Data Lake Storage,
and Google Cloud Storage
- **Automatic credential management** - Obtains temporary,
least-privilege credentials via Unity Catalog API
- **Delta Lake integration** - Properly configures PyArrow filesystem
for Delta tables with session tokens
### Key Features
✅ **Production-ready credential vending API** - Uses stable, public
Unity Catalog APIs
✅ **Secure by default** - Temporary credentials with automatic cleanup ✅
**Multi-cloud** - AWS (S3), Azure (Blob Storage), and GCP (Cloud
Storage)
✅ **Delta Lake optimized** - Handles session tokens and PyArrow
filesystem configuration
✅ **Comprehensive error handling** - Helpful messages for common issues
(deletion vectors, permissions, etc.)
✅ **Full logging support** - Debug and info logging throughout
### Usage Example
```python
import ray
# Read a Unity Catalog Delta table
ds = ray.data.read_unity_catalog(
table="main.sales.transactions",
url="https://dbc-XXXXXXX-XXXX.cloud.databricks.com",
token="dapi...",
region="us-west-2" # Optional, for AWS
)
# Use standard Ray Data operations
ds = ds.filter(lambda row: row["amount"] > 100)
ds.show(5)
```
### Implementation Notes
This is a **simplified, focused implementation** that:
- Supports **Unity Catalog tables only** (no volumes - that's in private
preview)
- Assumes **Delta Lake format** (most common Unity Catalog use case)
- Uses **production-ready APIs** only (no private preview features)
- Provides ~600 lines of clean, reviewable code
The full implementation with volumes and multi-format support is
available in the `data_uc_volumes` branch and can be added in a future
PR once this foundation is reviewed.
### Testing
- ✅ All ruff lint checks pass
- ✅ Code formatted per Ray standards
- ✅ Tested with real Unity Catalog Delta tables on AWS S3
- ✅ Proper PyArrow filesystem configuration verified
- ✅ Credential vending flow validated
## Related issues
Related to Unity Catalog and Delta Lake support requests in Ray Data.
## Additional information
### Architecture
The implementation follows the **connector pattern** rather than a
`Datasource` subclass because Unity Catalog is a metadata/credential
layer, not a data format. The connector:
1. Fetches table metadata from Unity Catalog REST API
2. Obtains temporary credentials via credential vending API
3. Configures cloud-specific environment variables
4. Delegates to `ray.data.read_delta()` with proper filesystem
configuration
### Delta Lake Special Handling
Delta Lake on AWS requires explicit PyArrow S3FileSystem configuration
with session tokens (environment variables alone are insufficient). This
implementation correctly creates and passes the filesystem object to the
`deltalake` library.
### Cloud Provider Support
| Provider | Credential Type | Implementation |
|----------|----------------|----------------|
| AWS S3 | Temporary IAM credentials | PyArrow S3FileSystem with session
token |
| Azure Blob | SAS tokens | Environment variables
(AZURE_STORAGE_SAS_TOKEN) |
| GCP Cloud Storage | OAuth tokens / Service account | Environment
variables (GCP_OAUTH_TOKEN, GOOGLE_APPLICATION_CREDENTIALS) |
### Error Handling
Comprehensive error messages for common issues:
- **Deletion Vectors**: Guidance on upgrading deltalake library or
disabling the feature
- **Column Mapping**: Compatibility information and solutions
- **Permissions**: Clear list of required Unity Catalog permissions
- **Credential issues**: Detailed troubleshooting steps
### Future Enhancements
Potential follow-up PRs:
- Unity Catalog volumes support (when out of private preview)
- Multi-format support (Parquet, CSV, JSON, images, etc.)
- Custom datasource integration
- Advanced Delta Lake features (time travel, partition filters)
### Dependencies
- Requires `deltalake` package for Delta Lake support
- Uses standard Ray Data APIs (`read_delta`, `read_datasource`)
- Integrates with existing PyArrow filesystem infrastructure
### Documentation
- Full docstrings with examples
- Type hints throughout
- Inline comments with references to external documentation
- Comprehensive error messages with actionable guidance
---------
> Thank you for contributing to Ray! 🚀
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.
> ⚠️ Remove these instructions before submitting your PR.
> 💡 Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.
## Description
> Briefly describe what this PR accomplishes and why it's needed.
## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".
## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Co-authored-by: soffer-anyscale <173827098+soffer-anyscale@users.noreply.github.com>
1 parent 0e6b21a commit 94f3589
File tree
3 files changed
+95
-303
lines changed- python/ray/data
- _internal/datasource
3 files changed
+95
-303
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
1 | 2 | | |
2 | 3 | | |
3 | 4 | | |
| |||
42 | 43 | | |
43 | 44 | | |
44 | 45 | | |
| 46 | + | |
45 | 47 | | |
46 | 48 | | |
47 | 49 | | |
| |||
81 | 83 | | |
82 | 84 | | |
83 | 85 | | |
84 | | - | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | | - | |
89 | | - | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
90 | 97 | | |
91 | 98 | | |
92 | 99 | | |
| |||
96 | 103 | | |
97 | 104 | | |
98 | 105 | | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
99 | 115 | | |
100 | 116 | | |
101 | 117 | | |
| |||
121 | 137 | | |
122 | 138 | | |
123 | 139 | | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
124 | 179 | | |
125 | 180 | | |
126 | 181 | | |
127 | 182 | | |
128 | 183 | | |
129 | 184 | | |
130 | | - | |
131 | 185 | | |
132 | 186 | | |
133 | 187 | | |
134 | 188 | | |
135 | | - | |
136 | | - | |
137 | | - | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
Lines changed: 0 additions & 237 deletions
This file was deleted.
0 commit comments