Get data from cache #525

jan-janssen · 2024-12-17T05:32:38Z

Example:

import os
import pandas
import shutil
from executorlib import Executor
from executorlib.standalone.hdf import get_cache_data

cache_directory = "./cache"
with Executor(backend="local", cache_directory=cache_directory) as exe:
    future_lst = [exe.submit(sum, [i, i]) for i in range(1, 4)]
    print([f.result() for f in future_lst])

df = pandas.DataFrame(get_cache_data(cache_directory=cache_directory))
df

Summary by CodeRabbit

New Features
- Introduced a function to retrieve data from HDF5 files in a specified cache directory.
- Globalized key mappings for consistent usage across functions.
Bug Fixes
- Enhanced caching mechanism validation through new unit tests.
Tests
- Added a test class to validate caching functionality and ensure data consistency.

Example: ```python import os import pandas import shutil from executorlib import Executor from executorlib.standalone.hdf import get_cache_data cache_directory = "./cache" with Executor(backend="local", cache_directory=cache_directory) as exe: future_lst = [exe.submit(sum, [i, i]) for i in range(1, 4)] print([f.result() for f in future_lst]) df = pandas.DataFrame(get_cache_data(cache_directory=cache_directory)) df ```

for more information, see https://pre-commit.ci

coderabbitai · 2024-12-17T05:32:47Z

Walkthrough

This pull request introduces a new function get_cache_data() in the executorlib/standalone/hdf.py module, which retrieves data from HDF5 files in a specified cache directory. The group_dict has been moved to a global scope, making it accessible across the module. A corresponding test file test_cache_executor_interactive.py has been added to validate the new caching functionality, including a test method that checks cache data retrieval and cleanup.

Changes

File	Change Summary
`executorlib/standalone/hdf.py`	- Added `get_cache_data()` function to retrieve HDF5 file data - Moved `group_dict` to global scope - Updated import statements to include `os` and `List`
`tests/test_cache_executor_interactive.py`	- Added `TestCacheFunctions` test class - Implemented `test_cache_data()` method to validate caching - Added `tearDown()` method for cache cleanup

Sequence Diagram

sequenceDiagram
    participant Executor
    participant GetCacheData
    participant HDF5Files
    
    Executor->>GetCacheData: Request cache data
    GetCacheData->>HDF5Files: Scan cache directory
    HDF5Files-->>GetCacheData: Return file list
    GetCacheData->>HDF5Files: Read each HDF5 file
    HDF5Files-->>GetCacheData: Extract file contents
    GetCacheData-->>Executor: Return list of cached data

Possibly related PRs

Write cache first #492: Involves caching mechanisms, which aligns with the new get_cache_data function and caching functionality introduced in this PR.

Poem

🐰 In the realm of data's embrace,
HDF files dance with coding grace,
Caching secrets, swift and light,
Rabbit's magic takes its flight!
Executors sing, cache gleams bright! 🔍

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ea305da and cf7bce5.

📒 Files selected for processing (2)

executorlib/standalone/hdf.py (2 hunks)
tests/test_cache_executor_interactive.py (1 hunks)

🔇 Additional comments (3)

executorlib/standalone/hdf.py (1)

8-15: LGTM: Well-structured global mapping dictionary

Moving group_dict to global scope improves code organization by making the mapping accessible across functions while maintaining a single source of truth.

tests/test_cache_executor_interactive.py (2)

1-17: LGTM: Well-implemented test skip logic

The implementation properly handles h5py dependency and follows unittest best practices for conditional test execution.

15-18: LGTM: Proper test class structure

The test class follows unittest conventions and is properly decorated for conditional execution.

coderabbitai · 2024-12-17T05:35:11Z

executorlib/standalone/hdf.py

+def get_cache_data(cache_directory: str) -> List[dict]:
+    file_lst = []
+    for file_name in os.listdir(cache_directory):
+        with h5py.File(os.path.join(cache_directory, file_name), "r") as hdf:
+            file_content_dict = {
+                key: cloudpickle.loads(np.void(hdf["/" + key]))
+                for key in group_dict.values()
+                if key in hdf
+            }
+        file_content_dict["filename"] = file_name
+        file_lst.append(file_content_dict)
+    return file_lst


⚠️ Potential issue

Add error handling and improve resource management

The function needs additional safeguards for robustness:

Missing directory existence check

No error handling for invalid HDF5 files

Resource management could be improved using context manager pattern

Consider this improved implementation:

def get_cache_data(cache_directory: str) -> List[dict]: + if not os.path.isdir(cache_directory): + raise ValueError(f"Cache directory does not exist: {cache_directory}") + file_lst = [] for file_name in os.listdir(cache_directory): - with h5py.File(os.path.join(cache_directory, file_name), "r") as hdf: - file_content_dict = { - key: cloudpickle.loads(np.void(hdf["/" + key])) - for key in group_dict.values() - if key in hdf - } - file_content_dict["filename"] = file_name - file_lst.append(file_content_dict) + file_path = os.path.join(cache_directory, file_name) + try: + with h5py.File(file_path, "r") as hdf: + file_content_dict = { + key: cloudpickle.loads(np.void(hdf["/" + key])) + for key in group_dict.values() + if key in hdf + } + file_content_dict["filename"] = file_name + file_lst.append(file_content_dict) + except (OSError, KeyError) as e: + # Log warning but continue processing other files + import warnings + warnings.warn(f"Failed to read cache file {file_name}: {str(e)}") return file_lst

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_cache_data(cache_directory: str) -> List[dict]:

file_lst = []

for file_name in os.listdir(cache_directory):

with h5py.File(os.path.join(cache_directory, file_name), "r") as hdf:

file_content_dict = {

key: cloudpickle.loads(np.void(hdf["/" + key]))

for key in group_dict.values()

if key in hdf

}

file_content_dict["filename"] = file_name

file_lst.append(file_content_dict)

return file_lst

def get_cache_data(cache_directory: str) -> List[dict]:

if not os.path.isdir(cache_directory):

raise ValueError(f"Cache directory does not exist: {cache_directory}")

file_lst = []

for file_name in os.listdir(cache_directory):

file_path = os.path.join(cache_directory, file_name)

try:

with h5py.File(file_path, "r") as hdf:

file_content_dict = {

key: cloudpickle.loads(np.void(hdf["/" + key]))

for key in group_dict.values()

if key in hdf

}

file_content_dict["filename"] = file_name

file_lst.append(file_content_dict)

except (OSError, KeyError) as e:

# Log warning but continue processing other files

import warnings

warnings.warn(f"Failed to read cache file {file_name}: {str(e)}")

return file_lst

coderabbitai · 2024-12-17T05:35:11Z

tests/test_cache_executor_interactive.py

+    def tearDown(self):
+        if os.path.exists("cache"):
+            shutil.rmtree("cache")


🛠️ Refactor suggestion

Update cleanup to use test's cache directory

The tearDown method should clean up the temporary directory used in the test, not a hardcoded path.

def tearDown(self): - if os.path.exists("cache"): - shutil.rmtree("cache") + if hasattr(self, '_testMethodName'): + method = getattr(self, self._testMethodName) + if hasattr(method, 'cache_directory'): + shutil.rmtree(method.cache_directory)

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-12-17T05:35:11Z

tests/test_cache_executor_interactive.py

+    def test_cache_data(self):
+        cache_directory = "./cache"
+        with Executor(backend="local", cache_directory=cache_directory) as exe:
+            future_lst = [exe.submit(sum, [i, i]) for i in range(1, 4)]
+            result_lst = [f.result() for f in future_lst]
+
+        cache_lst = get_cache_data(cache_directory=cache_directory)
+        self.assertEqual(sum([c["output"] for c in cache_lst]), sum(result_lst))
+        self.assertEqual(
+            sum([sum(c["input_args"][0]) for c in cache_lst]), sum(result_lst)
+        )


🛠️ Refactor suggestion

Enhance test coverage and use temporary directory

The test could be improved in several ways:

Use tempfile.mkdtemp() instead of hardcoded "./cache"

Add test cases for error conditions

Validate all cached fields, not just output and input_args

Consider this improved implementation:

+import tempfile + def test_cache_data(self): - cache_directory = "./cache" + cache_directory = tempfile.mkdtemp() with Executor(backend="local", cache_directory=cache_directory) as exe: future_lst = [exe.submit(sum, [i, i]) for i in range(1, 4)] result_lst = [f.result() for f in future_lst] cache_lst = get_cache_data(cache_directory=cache_directory) + # Validate cache size + self.assertEqual(len(cache_lst), len(future_lst)) + + # Validate all cached fields + for cache_entry in cache_lst: + self.assertIn('function', cache_entry) + self.assertIn('input_args', cache_entry) + self.assertIn('input_kwargs', cache_entry) + self.assertIn('output', cache_entry) + self.assertIn('runtime', cache_entry) + self.assertEqual(sum([c["output"] for c in cache_lst]), sum(result_lst)) self.assertEqual( sum([sum(c["input_args"][0]) for c in cache_lst]), sum(result_lst) ) + +def test_cache_data_invalid_directory(self): + with self.assertRaises(ValueError): + get_cache_data("/nonexistent/path")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def test_cache_data(self):

cache_directory = "./cache"

with Executor(backend="local", cache_directory=cache_directory) as exe:

future_lst = [exe.submit(sum, [i, i]) for i in range(1, 4)]

result_lst = [f.result() for f in future_lst]

cache_lst = get_cache_data(cache_directory=cache_directory)

self.assertEqual(sum([c["output"] for c in cache_lst]), sum(result_lst))

self.assertEqual(

sum([sum(c["input_args"][0]) for c in cache_lst]), sum(result_lst)

)

def test_cache_data(self):

cache_directory = tempfile.mkdtemp()

with Executor(backend="local", cache_directory=cache_directory) as exe:

future_lst = [exe.submit(sum, [i, i]) for i in range(1, 4)]

result_lst = [f.result() for f in future_lst]

cache_lst = get_cache_data(cache_directory=cache_directory)

# Validate cache size

self.assertEqual(len(cache_lst), len(future_lst))

# Validate all cached fields

for cache_entry in cache_lst:

self.assertIn('function', cache_entry)

self.assertIn('input_args', cache_entry)

self.assertIn('input_kwargs', cache_entry)

self.assertIn('output', cache_entry)

self.assertIn('runtime', cache_entry)

self.assertEqual(sum([c["output"] for c in cache_lst]), sum(result_lst))

self.assertEqual(

sum([sum(c["input_args"][0]) for c in cache_lst]), sum(result_lst)

)

def test_cache_data_invalid_directory(self):

with self.assertRaises(ValueError):

get_cache_data("/nonexistent/path")

jan-janssen and others added 2 commits December 16, 2024 22:32

[pre-commit.ci] auto fixes from pre-commit.com hooks

cf7bce5

for more information, see https://pre-commit.ci

coderabbitai bot reviewed Dec 17, 2024

View reviewed changes

jan-janssen merged commit c6a7988 into main Dec 17, 2024
27 checks passed

jan-janssen deleted the get_cache_data branch December 17, 2024 05:42

jan-janssen mentioned this pull request Mar 29, 2025

[Documentation] Show how to gather the data from cache #624

Closed

coderabbitai bot mentioned this pull request Apr 26, 2025

Clean up: warn when functionality is not available #638

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Get data from cache #525

Get data from cache #525

Uh oh!

jan-janssen commented Dec 17, 2024 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 17, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 17, 2024

Uh oh!

coderabbitai bot Dec 17, 2024

Uh oh!

coderabbitai bot Dec 17, 2024

Uh oh!

Uh oh!

Uh oh!

Get data from cache #525

Get data from cache #525

Uh oh!

Conversation

jan-janssen commented Dec 17, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Possibly related PRs

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jan-janssen commented Dec 17, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 17, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)