Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get data from cache #525

Merged
merged 2 commits into from
Dec 17, 2024
Merged

Get data from cache #525

merged 2 commits into from
Dec 17, 2024

Conversation

jan-janssen
Copy link
Member

@jan-janssen jan-janssen commented Dec 17, 2024

Example:

import os
import pandas
import shutil
from executorlib import Executor
from executorlib.standalone.hdf import get_cache_data

cache_directory = "./cache"
with Executor(backend="local", cache_directory=cache_directory) as exe:
    future_lst = [exe.submit(sum, [i, i]) for i in range(1, 4)]
    print([f.result() for f in future_lst])

df = pandas.DataFrame(get_cache_data(cache_directory=cache_directory))
df

Summary by CodeRabbit

  • New Features

    • Introduced a function to retrieve data from HDF5 files in a specified cache directory.
    • Globalized key mappings for consistent usage across functions.
  • Bug Fixes

    • Enhanced caching mechanism validation through new unit tests.
  • Tests

    • Added a test class to validate caching functionality and ensure data consistency.

jan-janssen and others added 2 commits December 16, 2024 22:32
Example:
```python
import os
import pandas
import shutil
from executorlib import Executor
from executorlib.standalone.hdf import get_cache_data

cache_directory = "./cache"
with Executor(backend="local", cache_directory=cache_directory) as exe:
    future_lst = [exe.submit(sum, [i, i]) for i in range(1, 4)]
    print([f.result() for f in future_lst])

df = pandas.DataFrame(get_cache_data(cache_directory=cache_directory))
df
```
Copy link
Contributor

coderabbitai bot commented Dec 17, 2024

Walkthrough

This pull request introduces a new function get_cache_data() in the executorlib/standalone/hdf.py module, which retrieves data from HDF5 files in a specified cache directory. The group_dict has been moved to a global scope, making it accessible across the module. A corresponding test file test_cache_executor_interactive.py has been added to validate the new caching functionality, including a test method that checks cache data retrieval and cleanup.

Changes

File Change Summary
executorlib/standalone/hdf.py - Added get_cache_data() function to retrieve HDF5 file data
- Moved group_dict to global scope
- Updated import statements to include os and List
tests/test_cache_executor_interactive.py - Added TestCacheFunctions test class
- Implemented test_cache_data() method to validate caching
- Added tearDown() method for cache cleanup

Sequence Diagram

sequenceDiagram
    participant Executor
    participant GetCacheData
    participant HDF5Files
    
    Executor->>GetCacheData: Request cache data
    GetCacheData->>HDF5Files: Scan cache directory
    HDF5Files-->>GetCacheData: Return file list
    GetCacheData->>HDF5Files: Read each HDF5 file
    HDF5Files-->>GetCacheData: Extract file contents
    GetCacheData-->>Executor: Return list of cached data
Loading

Possibly related PRs

  • Write cache first #492: Involves caching mechanisms, which aligns with the new get_cache_data function and caching functionality introduced in this PR.

Poem

🐰 In the realm of data's embrace,
HDF files dance with coding grace,
Caching secrets, swift and light,
Rabbit's magic takes its flight!
Executors sing, cache gleams bright! 🔍

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ea305da and cf7bce5.

📒 Files selected for processing (2)
  • executorlib/standalone/hdf.py (2 hunks)
  • tests/test_cache_executor_interactive.py (1 hunks)
🔇 Additional comments (3)
executorlib/standalone/hdf.py (1)

8-15: LGTM: Well-structured global mapping dictionary

Moving group_dict to global scope improves code organization by making the mapping accessible across functions while maintaining a single source of truth.

tests/test_cache_executor_interactive.py (2)

1-17: LGTM: Well-implemented test skip logic

The implementation properly handles h5py dependency and follows unittest best practices for conditional test execution.


15-18: LGTM: Proper test class structure

The test class follows unittest conventions and is properly decorated for conditional execution.

Comment on lines +104 to +115
def get_cache_data(cache_directory: str) -> List[dict]:
file_lst = []
for file_name in os.listdir(cache_directory):
with h5py.File(os.path.join(cache_directory, file_name), "r") as hdf:
file_content_dict = {
key: cloudpickle.loads(np.void(hdf["/" + key]))
for key in group_dict.values()
if key in hdf
}
file_content_dict["filename"] = file_name
file_lst.append(file_content_dict)
return file_lst
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling and improve resource management

The function needs additional safeguards for robustness:

  1. Missing directory existence check
  2. No error handling for invalid HDF5 files
  3. Resource management could be improved using context manager pattern

Consider this improved implementation:

 def get_cache_data(cache_directory: str) -> List[dict]:
+    if not os.path.isdir(cache_directory):
+        raise ValueError(f"Cache directory does not exist: {cache_directory}")
+
     file_lst = []
     for file_name in os.listdir(cache_directory):
-        with h5py.File(os.path.join(cache_directory, file_name), "r") as hdf:
-            file_content_dict = {
-                key: cloudpickle.loads(np.void(hdf["/" + key]))
-                for key in group_dict.values()
-                if key in hdf
-            }
-        file_content_dict["filename"] = file_name
-        file_lst.append(file_content_dict)
+        file_path = os.path.join(cache_directory, file_name)
+        try:
+            with h5py.File(file_path, "r") as hdf:
+                file_content_dict = {
+                    key: cloudpickle.loads(np.void(hdf["/" + key]))
+                    for key in group_dict.values()
+                    if key in hdf
+                }
+                file_content_dict["filename"] = file_name
+                file_lst.append(file_content_dict)
+        except (OSError, KeyError) as e:
+            # Log warning but continue processing other files
+            import warnings
+            warnings.warn(f"Failed to read cache file {file_name}: {str(e)}")
     return file_lst
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_cache_data(cache_directory: str) -> List[dict]:
file_lst = []
for file_name in os.listdir(cache_directory):
with h5py.File(os.path.join(cache_directory, file_name), "r") as hdf:
file_content_dict = {
key: cloudpickle.loads(np.void(hdf["/" + key]))
for key in group_dict.values()
if key in hdf
}
file_content_dict["filename"] = file_name
file_lst.append(file_content_dict)
return file_lst
def get_cache_data(cache_directory: str) -> List[dict]:
if not os.path.isdir(cache_directory):
raise ValueError(f"Cache directory does not exist: {cache_directory}")
file_lst = []
for file_name in os.listdir(cache_directory):
file_path = os.path.join(cache_directory, file_name)
try:
with h5py.File(file_path, "r") as hdf:
file_content_dict = {
key: cloudpickle.loads(np.void(hdf["/" + key]))
for key in group_dict.values()
if key in hdf
}
file_content_dict["filename"] = file_name
file_lst.append(file_content_dict)
except (OSError, KeyError) as e:
# Log warning but continue processing other files
import warnings
warnings.warn(f"Failed to read cache file {file_name}: {str(e)}")
return file_lst

Comment on lines +31 to +33
def tearDown(self):
if os.path.exists("cache"):
shutil.rmtree("cache")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Update cleanup to use test's cache directory

The tearDown method should clean up the temporary directory used in the test, not a hardcoded path.

 def tearDown(self):
-    if os.path.exists("cache"):
-        shutil.rmtree("cache")
+    if hasattr(self, '_testMethodName'):
+        method = getattr(self, self._testMethodName)
+        if hasattr(method, 'cache_directory'):
+            shutil.rmtree(method.cache_directory)

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +19 to +29
def test_cache_data(self):
cache_directory = "./cache"
with Executor(backend="local", cache_directory=cache_directory) as exe:
future_lst = [exe.submit(sum, [i, i]) for i in range(1, 4)]
result_lst = [f.result() for f in future_lst]

cache_lst = get_cache_data(cache_directory=cache_directory)
self.assertEqual(sum([c["output"] for c in cache_lst]), sum(result_lst))
self.assertEqual(
sum([sum(c["input_args"][0]) for c in cache_lst]), sum(result_lst)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance test coverage and use temporary directory

The test could be improved in several ways:

  1. Use tempfile.mkdtemp() instead of hardcoded "./cache"
  2. Add test cases for error conditions
  3. Validate all cached fields, not just output and input_args

Consider this improved implementation:

+import tempfile
+
 def test_cache_data(self):
-    cache_directory = "./cache"
+    cache_directory = tempfile.mkdtemp()
     with Executor(backend="local", cache_directory=cache_directory) as exe:
         future_lst = [exe.submit(sum, [i, i]) for i in range(1, 4)]
         result_lst = [f.result() for f in future_lst]

     cache_lst = get_cache_data(cache_directory=cache_directory)
+    # Validate cache size
+    self.assertEqual(len(cache_lst), len(future_lst))
+    
+    # Validate all cached fields
+    for cache_entry in cache_lst:
+        self.assertIn('function', cache_entry)
+        self.assertIn('input_args', cache_entry)
+        self.assertIn('input_kwargs', cache_entry)
+        self.assertIn('output', cache_entry)
+        self.assertIn('runtime', cache_entry)
+
     self.assertEqual(sum([c["output"] for c in cache_lst]), sum(result_lst))
     self.assertEqual(
         sum([sum(c["input_args"][0]) for c in cache_lst]), sum(result_lst)
     )
+
+def test_cache_data_invalid_directory(self):
+    with self.assertRaises(ValueError):
+        get_cache_data("/nonexistent/path")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_cache_data(self):
cache_directory = "./cache"
with Executor(backend="local", cache_directory=cache_directory) as exe:
future_lst = [exe.submit(sum, [i, i]) for i in range(1, 4)]
result_lst = [f.result() for f in future_lst]
cache_lst = get_cache_data(cache_directory=cache_directory)
self.assertEqual(sum([c["output"] for c in cache_lst]), sum(result_lst))
self.assertEqual(
sum([sum(c["input_args"][0]) for c in cache_lst]), sum(result_lst)
)
def test_cache_data(self):
cache_directory = tempfile.mkdtemp()
with Executor(backend="local", cache_directory=cache_directory) as exe:
future_lst = [exe.submit(sum, [i, i]) for i in range(1, 4)]
result_lst = [f.result() for f in future_lst]
cache_lst = get_cache_data(cache_directory=cache_directory)
# Validate cache size
self.assertEqual(len(cache_lst), len(future_lst))
# Validate all cached fields
for cache_entry in cache_lst:
self.assertIn('function', cache_entry)
self.assertIn('input_args', cache_entry)
self.assertIn('input_kwargs', cache_entry)
self.assertIn('output', cache_entry)
self.assertIn('runtime', cache_entry)
self.assertEqual(sum([c["output"] for c in cache_lst]), sum(result_lst))
self.assertEqual(
sum([sum(c["input_args"][0]) for c in cache_lst]), sum(result_lst)
)
def test_cache_data_invalid_directory(self):
with self.assertRaises(ValueError):
get_cache_data("/nonexistent/path")

@jan-janssen jan-janssen merged commit c6a7988 into main Dec 17, 2024
27 checks passed
@jan-janssen jan-janssen deleted the get_cache_data branch December 17, 2024 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant