Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ Your Python dependencies can be packaged as .py files, .zip archives (containing
Your entry point script will define logic using the `Client` object which wraps data access layers.

You should only need the following methods:
* `read_file(file_name)` - Returns a file handle of the provided file_name
* `read_dlo(name)` – Read from a Data Lake Object by name
* `read_dmo(name)` – Read from a Data Model Object by name
* `write_to_dlo(name, spark_dataframe, write_mode)` – Write to a Data Model Object by name with a Spark dataframe
Expand Down
179 changes: 179 additions & 0 deletions docs/file_reader_refactoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# DefaultFileReader Class Refactoring

## Overview

The `DefaultFileReader` class has been refactored to improve testability, readability, and maintainability. This document outlines the changes made and how to use the new implementation.

## Key Improvements

### 1. **Separation of Concerns**
- **File path resolution** is now handled by dedicated methods
- **File opening** is separated from path resolution
- **Configuration management** is centralized and configurable

### 2. **Enhanced Testability**
- **Dependency injection** through constructor parameters
- **Mockable methods** for unit testing
- **Clear interfaces** between different responsibilities
- **Comprehensive test coverage** with isolated test cases

### 3. **Better Error Handling**
- **Custom exception hierarchy** for different error types
- **Descriptive error messages** with context
- **Proper exception chaining** for debugging

### 4. **Improved Configuration**
- **Configurable defaults** that can be overridden
- **Environment-specific settings** support
- **Clear configuration contract**

### 5. **Enhanced Readability**
- **Comprehensive docstrings** for all methods
- **Clear method names** that describe their purpose
- **Logical method organization** from public to private
- **Type hints** throughout the codebase

## Class Structure

### DefaultFileReader
The main class that provides the file reading framework:

```python
class DefaultFileReader(BaseDataAccessLayer):
# Configuration constants
DEFAULT_CODE_PACKAGE = 'payload'
DEFAULT_FILE_FOLDER = 'files'
DEFAULT_CONFIG_FILE = 'config.json'

def __init__(self, code_package=None, file_folder=None, config_file=None):
# Initialize with custom or default configuration

def file_open(self, file_name: str) -> io.TextIOWrapper:
# Main public method for opening files

def get_search_locations(self) -> list[Path]:
# Get all possible search locations
```

## Exception Hierarchy

```python
FileReaderError (base)
├── FileNotFoundError (file not found in any location)
└── FileAccessError (permission, I/O errors, etc.)
```

## Usage Examples

### Basic Usage
```python
from datacustomcode.file.reader.default import DefaultFileReader

# Use default configuration
reader = DefaultFileReader()
with reader.file_open('data.csv') as f:
content = f.read()
```

### Custom Configuration
```python
from datacustomcode.file.reader.default import DefaultFileReader

# Custom configuration
reader = DefaultFileReader(
code_package='my_package',
file_folder='data',
config_file='settings.json'
)
```

### Error Handling
```python
try:
with reader.file_open('data.csv') as f:
content = f.read()
except FileNotFoundError as e:
print(f"File not found: {e}")
except FileAccessError as e:
print(f"Access error: {e}")
```

## File Resolution Strategy

The file reader uses a two-tier search strategy:

1. **Primary Location**: `{code_package}/{file_folder}/{filename}`
2. **Fallback Location**: `{config_file_parent}/{file_folder}/{filename}`

This allows for flexible deployment scenarios where files might be in different locations depending on the environment.

## Testing

### Unit Tests
The refactored class includes comprehensive unit tests covering:
- Configuration initialization
- File path resolution
- Error handling scenarios
- File opening operations
- Search location determination

### Mocking
The class is designed for easy mocking in tests:
```python
from unittest.mock import patch

with patch('DefaultFileReader._resolve_file_path') as mock_resolve:
mock_resolve.return_value = Path('/test/file.txt')
# Test file opening logic
```

### Integration Tests
Integration tests verify the complete file resolution and opening flow using temporary directories and real file operations.

## Migration Guide

### From Old Implementation
The old implementation had these issues:
- Hardcoded configuration values
- Mixed responsibilities in single methods
- Limited error handling
- Difficult to test

### To New Implementation
1. **Update imports**: Use `DefaultFileReader` from `datacustomcode.file.reader.default`
2. **Error handling**: Catch specific exceptions instead of generic ones
3. **Configuration**: Use constructor parameters for custom settings
4. **Testing**: Leverage the new mockable methods

## Benefits

### For Developers
- **Easier debugging** with clear error messages
- **Better IDE support** with type hints and docstrings
- **Simplified testing** with dependency injection
- **Clearer code structure** with separated responsibilities

### For Maintainers
- **Easier to extend** with new file resolution strategies
- **Better error tracking** with custom exception types
- **Improved test coverage** with isolated test cases
- **Clearer documentation** with comprehensive docstrings

### For Users
- **More reliable** with proper error handling
- **More flexible** with configurable behavior
- **Better debugging** with descriptive error messages
- **Consistent interface** across different implementations

## Future Enhancements

The refactored structure makes it easy to add:
- **Additional file resolution strategies** (URLs, cloud storage, etc.)
- **File format detection** and automatic handling
- **Caching mechanisms** for frequently accessed files
- **Async file operations** for better performance
- **File validation** and integrity checking

## Conclusion

The refactored `DefaultFileReader` class provides a solid foundation for file reading operations while maintaining backward compatibility. The improvements in testability, readability, and maintainability make it easier to develop, test, and maintain file reading functionality in the Data Cloud Custom Code SDK.
6 changes: 3 additions & 3 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ loguru = "^0.7.3"
numpy = "*"
pandas = "*"
pydantic = "^1.8.2 || ^2.0.0"
pyspark = "3.5.1"
pyspark = "3.5.6"
python = ">=3.10,<3.12"
pyyaml = "^6.0"
salesforce-cdp-connector = "*"
Expand Down
10 changes: 10 additions & 0 deletions src/datacustomcode/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,12 @@
from pyspark.sql import SparkSession

from datacustomcode.config import SparkConfig, config
from datacustomcode.file.reader.default import DefaultFileReader
from datacustomcode.io.reader.base import BaseDataCloudReader

if TYPE_CHECKING:
import io

from pyspark.sql import DataFrame as PySparkDataFrame

from datacustomcode.io.reader.base import BaseDataCloudReader
Expand Down Expand Up @@ -112,6 +115,7 @@ class Client:
_instance: ClassVar[Optional[Client]] = None
_reader: BaseDataCloudReader
_writer: BaseDataCloudWriter
_file: DefaultFileReader
_data_layer_history: dict[DataCloudObjectType, set[str]]

def __new__(
Expand Down Expand Up @@ -154,6 +158,7 @@ def __new__(
writer_init = writer
cls._instance._reader = reader_init
cls._instance._writer = writer_init
cls._instance._file = DefaultFileReader()
cls._instance._data_layer_history = {
DataCloudObjectType.DLO: set(),
DataCloudObjectType.DMO: set(),
Expand Down Expand Up @@ -212,6 +217,11 @@ def write_to_dmo(
self._validate_data_layer_history_does_not_contain(DataCloudObjectType.DLO)
return self._writer.write_to_dmo(name, dataframe, write_mode, **kwargs)

def read_file(self, file_name: str) -> io.TextIOWrapper:
"""Read a file from the local file system."""

return self._file.read_file(file_name)

def _validate_data_layer_history_does_not_contain(
self, data_cloud_object_type: DataCloudObjectType
) -> None:
Expand Down
14 changes: 14 additions & 0 deletions src/datacustomcode/file/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Copyright (c) 2025, Salesforce, Inc.
# SPDX-License-Identifier: Apache-2
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
19 changes: 19 additions & 0 deletions src/datacustomcode/file/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Copyright (c) 2025, Salesforce, Inc.
# SPDX-License-Identifier: Apache-2
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations


class BaseDataAccessLayer:
"""Base class for data access layer implementations."""
14 changes: 14 additions & 0 deletions src/datacustomcode/file/reader/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Copyright (c) 2025, Salesforce, Inc.
# SPDX-License-Identifier: Apache-2
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Loading