- 
                Notifications
    
You must be signed in to change notification settings  - Fork 420
 
[Store]: Cleanup processing objects if transferring timedout (#975) #993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Store]: Cleanup processing objects if transferring timedout (#975) #993
Conversation
          Summary of ChangesHello @nickyc975, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness of  Highlights
 Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either  
 Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a  Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a mechanism to clean up timed-out Put operations, which is a great improvement for the store's robustness. The changes are quite extensive, involving API modifications to include client_id, new configuration options for timeouts, and a new monitor thread. The logic for handling expired put operations, including staging and releasing discarded replicas, seems well-thought-out and correctly implemented. The addition of a dedicated test for the new monitor thread is also a valuable contribution.
I have a few suggestions to improve code clarity and maintainability by using more standard C++ idioms.
| bool isExpired(const std::chrono::steady_clock::time_point& now) { | ||
| return ttl_ <= now; | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The isExpired method does not modify the state of the DiscardedReplicas object. It should be marked as const to reflect this. This is good practice and allows the method to be called on const objects, which can enable further optimizations and cleaner code, for example when using it with standard library algorithms like std::list::remove_if.
        bool isExpired(const std::chrono::steady_clock::time_point& now) const {
            return ttl_ <= now;
        }| auto it = discarded_replicas_.begin(); | ||
| while (it != discarded_replicas_.end()) { | ||
| if (it->isExpired(now)) { | ||
| it = discarded_replicas_.erase(it); | ||
| } else { | ||
| it++; | ||
| } | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This while loop for erasing expired replicas from the list can be simplified by using the std::list::remove_if member function. This makes the code more concise and expressive of its intent. Note that this requires the isExpired method to be const-qualified.
            discarded_replicas_.remove_if([
                &now
            ](const DiscardedReplicas& item) { return item.isExpired(now); });Signed-off-by: Chen Jinlong <chenjinlong.cjl@alibaba-inc.com>
… by the same client Signed-off-by: Chen Jinlong <chenjinlong.cjl@alibaba-inc.com>
Signed-off-by: Chen Jinlong <chenjinlong.cjl@alibaba-inc.com>
Signed-off-by: Chen Jinlong <chenjinlong.cjl@alibaba-inc.com>
Signed-off-by: Chen Jinlong <chenjinlong.cjl@alibaba-inc.com>
2c393b4    to
    80c1684      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! I have a few suggestions:
- 
processing_metadataonly needs to be anunordered_set<string>rather than a holding astd::shared_ptr<ObjectMetadata>for each key. Since we can retrieve the metadata directly from the key, there is no need to change metadata toshared_ptr. Otherwise, it would require modifying many places, cause a lot of conflicts with other PRs, and there is no real necessity to useshared_ptrhere. - 
Can we move the logic of the
PutStartMonitorthread intoBatchEvict(specifically in the first pass of batch eviction)? In practice, we don’t need to reclaim memory immediately when the put-start timer expires, nor do we need to scan all shards every second (which also requires locking each shard). Instead, we can simply check this as part of the eviction process. 
| 
               | 
          ||
| auto& metadata = accessor.Get(); | ||
| if (client_id != metadata.client_id) { | ||
| LOG(ERROR) << "key=" << key << " putEnd client=" << client_id | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error log is a little bit misleading. Here the parameter is correct. The problem is that the previous put start from this client has been aborted by master
| 
               | 
          ||
| auto& metadata = accessor.Get(); | ||
| if (client_id != metadata.client_id) { | ||
| LOG(ERROR) << "key=" << key << " putRevoke client=" << client_id | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar issue with put start error log
| 
           Could you also add this new mechanism to the documents? Thanks  | 
    
| 
           Hello, @ykwd: 
 
  | 
    
          
 I think it’s reasonable to wrap this logic into separate functions. However, a put_start timeout is a very rare case, and in the current architecture, each check requires acquiring locks for all shards. That adds a relatively expensive overhead for something that almost never happens. Would it make sense to wrap it as one or several standalone functions and call it at an appropriate point during batch eviction instead?  | 
    
As discussed in #975, this PR implemented the following mechenism:
put_start_monitor_thread_is added to monitor the ongoing Put operations. If an object is found stucking in processing state for more than 10-20 minutes (can be the same as timeout of discarded replicas), we remove it and release the space immediately.