Skip to content

fix: query staging(in-mem) when concerned with the past 5 minutes #1194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 19, 2025

Conversation

de-sh
Copy link
Contributor

@de-sh de-sh commented Feb 17, 2025

Fixes #XXXX.

Description

Currently we aren't considering the possibility of data that is in staging due to slow network or slow compaction.
Another PR will consider files inside parquet files not pushed into object store as well.


This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced time filtering logic to accurately determine if queries fall within the specified staging window.
    • Improved processing of filter expressions for more efficient query operations.

Copy link

coderabbitai bot commented Feb 17, 2025

Walkthrough

The pull request introduces significant changes to the StandardTableProvider in the src/query/stream_schema_provider.rs file. The include_now function is replaced by is_within_staging_window, which modifies how time filters are evaluated. Additionally, the extract_primary_filter function's visibility is changed to public. The src/utils/arrow/flight.rs file sees updates in the send_to_ingester function, where the logic for constructing filter expressions is altered to utilize the new functions, reflecting a shift in filtering logic.

Changes

File Summary of Changes
src/query/stream_schema_provider.rs - Replaced include_now with is_within_staging_window: New logic checks if time filters indicate a range ending within five minutes from now.
- Updated extract_primary_filter: Changed visibility to public.
- Updated supports_filters_pushdown: Enhanced to handle multiple filters.
src/utils/arrow/flight.rs - Removed import of include_now and added imports for extract_primary_filter and is_within_staging_window.
- Updated send_to_ingester: Logic modified to use extract_primary_filter for constructing time filters.

Sequence Diagram(s)

sequenceDiagram
    participant QE as Query Engine
    participant TP as TableProvider
    QE->>TP: extract_primary_filter(filters, time_partition)
    TP->>TP: Process filters and return time_filters
    QE->>TP: is_within_staging_window(time_filters)
    TP->>TP: Evaluate if current time is within staging window
    TP-->>QE: Return inclusion decision
Loading

Poem

In the stream of time, we’ve made a change,
With filters that dance, and logic rearranged.
From now to staging, we hop with delight,
A rabbit’s cheer for queries done right!
🐇✨


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 17, 2025
@de-sh de-sh changed the title fix: consider current time fix: query staging when concerned with the past 5 minutes Feb 19, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (2)
src/query/stream_schema_provider.rs (2)

957-957: 🛠️ Refactor suggestion

Add test coverage for is_within_staging_window.

The new function lacks test coverage. Consider adding tests to verify:

  1. Behavior with filters within the 5-minute window
  2. Behavior with filters outside the window
  3. Behavior with no upper bound
  4. Edge cases around the window boundary

Here's a suggested test implementation:

#[test]
fn test_staging_window() {
    let now = Utc::now().naive_utc();
    let four_mins_ago = (now - TimeDelta::minutes(4));
    let six_mins_ago = (now - TimeDelta::minutes(6));
    
    // Test case 1: Filter within 5 minutes
    let filters = vec![PartialTimeFilter::High(Bound::Included(four_mins_ago))];
    assert!(is_within_staging_window(&filters));
    
    // Test case 2: Filter outside 5 minutes
    let filters = vec![PartialTimeFilter::High(Bound::Included(six_mins_ago))];
    assert!(!is_within_staging_window(&filters));
    
    // Test case 3: No upper bound
    let filters = vec![PartialTimeFilter::Low(Bound::Included(six_mins_ago))];
    assert!(is_within_staging_window(&filters));
    
    // Test case 4: Multiple filters
    let filters = vec![
        PartialTimeFilter::Low(Bound::Included(six_mins_ago)),
        PartialTimeFilter::High(Bound::Included(four_mins_ago))
    ];
    assert!(is_within_staging_window(&filters));
}

736-757: 💡 Verification agent

🧩 Analysis chain

LGTM! Well-implemented staging window check.

The implementation correctly:

  1. Calculates a 5-minute window from the current time
  2. Checks if any filter's upper bound falls within this window
  3. Handles cases with no upper bound

However, there's a potential edge case to consider:


🏁 Script executed:

#!/bin/bash
# Description: Check if there are any tests that verify the behavior of is_within_staging_window
# with different time filter combinations.

# Search for test cases
rg -A 5 "is_within_staging_window" --type rust

Length of output: 1906


Staging window check approved – please add explicit unit tests for edge-case scenarios

The implementation correctly:

  • Calculates a 5‑minute window from the current time
  • Determines when any filter’s high bound (or equality) falls within that window
  • Handles cases when no upper bound is present

However, a repository-wide search (using rg -A 5 "is_within_staging_window" --type rust) did not reveal dedicated tests verifying this function’s behavior under various time filter combinations. I recommend adding unit tests that cover scenarios such as:

  • A time filter exactly on the 5‑minute threshold
  • A mix of filters including both high bounds and non‑high bounds
  • Cases with no upper bound at all

This will help ensure that the edge-case behavior is validated and maintained over time.

🧹 Nitpick comments (1)
src/query/stream_schema_provider.rs (1)

733-735: Documentation could be more descriptive.

The current documentation doesn't fully explain the function's purpose and behavior. Consider expanding it to include:

  • The purpose of the 5-minute window
  • The impact on data retrieval
  • Examples of when data will/won't be considered

Apply this diff to improve the documentation:

-/// We should consider data in staging for queries concerning a time period,
-/// ending within 5 minutes from now. e.g. If current time is 5
-pub fn is_within_staging_window(time_filters: &[PartialTimeFilter]) -> bool {
+/// Determines if data should be retrieved from staging based on the query's time filters.
+/// 
+/// This function checks if any of the time filters indicate a query that ends within
+/// the last 5 minutes. This ensures that recent data, which might still be in the
+/// staging area and not yet written to permanent storage, is included in query results.
+/// 
+/// # Arguments
+/// * `time_filters` - A slice of time-based filters from the query
+/// 
+/// # Returns
+/// * `true` if any filter's upper bound is within the last 5 minutes or if there's no upper bound
+/// * `false` otherwise
+/// 
+/// # Example
+/// If current time is 10:05:00:
+/// - Query ending at 10:01:00 -> returns true (within 5 minutes)
+/// - Query ending at 09:59:00 -> returns false (outside 5 minutes)
+/// - Query with no end time -> returns true (assumed current)
+pub fn is_within_staging_window(time_filters: &[PartialTimeFilter]) -> bool {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 56391bd and e005ddf.

📒 Files selected for processing (2)
  • src/query/stream_schema_provider.rs (4 hunks)
  • src/utils/arrow/flight.rs (2 hunks)
🔇 Additional comments (3)
src/utils/arrow/flight.rs (2)

23-23: LGTM! Import changes align with the new timestamp handling approach.

The imports reflect the shift from include_now to the new is_within_staging_window function, which better handles data within the current minute.


134-136: LGTM! Improved timestamp handling logic.

The changes correctly implement the new approach for handling data within the current minute by:

  1. Using extract_primary_filter to process time filters
  2. Using is_within_staging_window to check if data falls within the staging window
src/query/stream_schema_provider.rs (1)

829-829: LGTM! Appropriate visibility change for extract_primary_filter.

Making the function public is necessary as it's now used by the send_to_ingester function in flight.rs.

@de-sh de-sh changed the title fix: query staging when concerned with the past 5 minutes fix: query staging(in-mem) when concerned with the past 5 minutes Feb 19, 2025
@nitisht nitisht merged commit 3e02f29 into parseablehq:main Feb 19, 2025
14 checks passed
@de-sh de-sh deleted the fix-curr branch February 21, 2025 06:35
@coderabbitai coderabbitai bot mentioned this pull request Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants