Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add new mv for logs page chart #2723

Merged
merged 17 commits into from
Dec 10, 2024
Merged

feat: add new mv for logs page chart #2723

merged 17 commits into from
Dec 10, 2024

Conversation

ogzhanolguncu
Copy link
Contributor

@ogzhanolguncu ogzhanolguncu commented Dec 9, 2024

What does this PR do?

This PR adds hourly, minutely and daily MVs for logs. We'll extend this in the future to support more granular filters.

Fixes # (issue)

If there is not an issue for this, please create one first. This is used to tracking purposes and also helps use understand why this PR exists

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Chore (refactoring code, technical debt, workflow improvements)
  • Enhancement (small improvements)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How should this be tested?

  • Test A
  • Test B

Checklist

Required

  • Filled out the "How to test" section in this PR
  • Read Contributing Guide
  • Self-reviewed my own code
  • Commented on my code in hard-to-understand areas
  • Ran pnpm build
  • Ran pnpm fmt
  • Checked for warnings, there are none
  • Removed all console.logs
  • Merged the latest changes from main onto my branch with git pull origin main
  • My changes don't cause any responsiveness issues

Appreciated

  • If a UI change was made: Added a screen recording or screenshots to this PR
  • Updated the Unkey Docs if changes were necessary

Summary by CodeRabbit

  • New Features

    • Added a new step to install the goose tool in the GitHub Actions workflow.
    • Introduced a "Clickhouse Tests" section in the documentation detailing local test setup.
    • Enhanced logging functionality with new time-series data retrieval methods.
    • Created new tables and materialized views for tracking API request metrics on various time intervals (minute, hour, day).
  • Bug Fixes

    • Updated the ClickHouse container management to streamline migration handling.
  • Tests

    • Added a comprehensive test suite for validating timeseries data aggregation.
  • Chores

    • Updated dependencies and modified Dockerfile for improved base image.

Copy link

vercel bot commented Dec 9, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
dashboard ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 10, 2024 1:08pm
engineering ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 10, 2024 1:08pm
play ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 10, 2024 1:08pm
www ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 10, 2024 1:08pm

Copy link

changeset-bot bot commented Dec 9, 2024

⚠️ No Changeset found

Latest commit: f545f52

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Contributor

coderabbitai bot commented Dec 9, 2024

Caution

Review failed

The head commit changed during the review from 2893164 to f545f52.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

github-actions bot commented Dec 9, 2024

Thank you for following the naming conventions for pull request titles! 🙏

@ogzhanolguncu ogzhanolguncu marked this pull request as ready for review December 10, 2024 12:52
Copy link

pullflow-com bot commented Dec 10, 2024

From OzAndreas Thomas I didn't wanna push everything into a single PR. So let's merge this first then move on to UI.

Copy link

pullflow-com bot commented Dec 10, 2024

From Andreas Thomas ‣ Just got back from the gym
Will look asap

Copy link

pullflow-com bot commented Dec 10, 2024

From Oz ‣ take your time np

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (8)
internal/clickhouse/src/logs-timeseries.test.ts (3)

71-72: Simplify startTime assignment by removing unnecessary Date object creation

The use of new Date() followed by .getTime() is redundant since Date.now() already provides a timestamp in milliseconds. You can subtract the interval directly from Date.now().

Apply this diff to simplify the code:

- startTime: new Date(Date.now() - 24 * 60 * 60 * 1000).getTime(), // 24 hours ago
+ startTime: Date.now() - 24 * 60 * 60 * 1000, // 24 hours ago

81-82: Simplify startTime assignment by removing unnecessary Date object creation

Same as above for the hourly timeseries aggregation.

Apply this diff:

- startTime: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000).getTime(), // 7 days ago
+ startTime: Date.now() - 7 * 24 * 60 * 60 * 1000, // 7 days ago

91-92: Simplify startTime assignment by removing unnecessary Date object creation

Same as above for the daily timeseries aggregation.

Apply this diff:

- startTime: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000).getTime(), // 30 days ago
+ startTime: Date.now() - 30 * 24 * 60 * 60 * 1000, // 30 days ago
internal/clickhouse/schema/045_create_api_requests_per_minute_mv_v1.sql (1)

13-19: Consider adding a PARTITION BY clause for better performance

The current implementation groups by multiple columns but doesn't specify partitioning. For time-series data, partitioning by time ranges (e.g., by day) could significantly improve query performance and data management.

 GROUP BY
     workspace_id,
     path,
     response_status,
     host,
     method,
-    time;
+    time
+PARTITION BY toYYYYMM(time);
internal/clickhouse/schema/046_create_api_requests_per_day_v1.sql (1)

2-12: Consider adding column constraints and TTL

The table definition could benefit from:

  1. NOT NULL constraints where appropriate
  2. TTL (Time To Live) policy for automatic data cleanup
  3. Default values for mandatory fields
 CREATE TABLE metrics.api_requests_per_day_v1 (
-    time DateTime,
-    workspace_id String,
+    time DateTime NOT NULL,
+    workspace_id String NOT NULL,
     path String,
     response_status Int,
     host String,
     method LowCardinality(String),
     count Int64
-) ENGINE = SummingMergeTree()
+) ENGINE = SummingMergeTree()
+TTL time + INTERVAL 90 DAY DELETE
internal/clickhouse/schema/042_create_api_requests_per_hour_v1.sql (2)

2-21: Consider adding partitioning and TTL for data lifecycle management.

The table schema looks good, but consider these enhancements for production readiness:

  1. Add partitioning by time for efficient data management:
PARTITION BY toYYYYMM(time)
  1. Define a TTL policy for automatic data cleanup:
TTL time + INTERVAL 3 MONTH DELETE

8-10: Consider adding a CHECK constraint for HTTP methods.

Since the method column expects specific HTTP methods, add a constraint to ensure data integrity:

method LowCardinality(String),
CONSTRAINT valid_method CHECK method IN ('GET', 'POST', 'PUT', 'DELETE', 'PATCH', 'HEAD', 'OPTIONS')
internal/clickhouse/src/index.ts (1)

5-10: LGTM! Consider grouping related imports

The imports are logically organized, though consider grouping all logs-related functions together for better maintainability.

-import {
-  getDailyLogsTimeseries,
-  getHourlyLogsTimeseries,
-  getLogs,
-  getMinutelyLogsTimeseries,
-} from "./logs";
+import {
+  getLogs,
+  // Timeseries functions
+  getDailyLogsTimeseries,
+  getHourlyLogsTimeseries,
+  getMinutelyLogsTimeseries,
+} from "./logs";
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 2364e6b and f545f52.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (14)
  • .github/actions/install/action.yaml (1 hunks)
  • apps/engineering/content/docs/contributing/testing.mdx (1 hunks)
  • internal/clickhouse/Dockerfile (1 hunks)
  • internal/clickhouse/package.json (1 hunks)
  • internal/clickhouse/schema/042_create_api_requests_per_hour_v1.sql (1 hunks)
  • internal/clickhouse/schema/043_create_api_requests_per_hour_mv_v1.sql (1 hunks)
  • internal/clickhouse/schema/044_create_api_requests_per_minute_v1.sql (1 hunks)
  • internal/clickhouse/schema/045_create_api_requests_per_minute_mv_v1.sql (1 hunks)
  • internal/clickhouse/schema/046_create_api_requests_per_day_v1.sql (1 hunks)
  • internal/clickhouse/schema/047_create_api_requests_per_day_mv_v1.sql (1 hunks)
  • internal/clickhouse/src/index.ts (2 hunks)
  • internal/clickhouse/src/logs-timeseries.test.ts (1 hunks)
  • internal/clickhouse/src/logs.ts (2 hunks)
  • internal/clickhouse/src/testutil.ts (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • internal/clickhouse/Dockerfile
🔇 Additional comments (9)
internal/clickhouse/src/logs.ts (1)

97-218: Well-structured implementation of timeseries queries

The code for the timeseries queries and associated parameter schemas is clean, efficient, and follows best practices. The use of parameterized queries enhances security, and the structured approach improves maintainability.

internal/clickhouse/schema/047_create_api_requests_per_day_mv_v1.sql (1)

13-19: Confirm the use of alias time in GROUP BY clause

Ensure that using the alias time in the GROUP BY clause works as intended in ClickHouse. In some SQL dialects, the full expression must be used instead of the alias. Verifying this will prevent potential aggregation issues.

Run the following script to check if the materialized view aggregates data correctly:

✅ Verification successful

Using alias time in GROUP BY is correct for ClickHouse

The materialized view definition shows that time is properly defined as an alias using AS in the SELECT clause: toStartOfDay(fromUnixTimestamp64Milli(time)) AS time. ClickHouse explicitly supports using column aliases in GROUP BY clauses, which is one of its distinguishing features from some other SQL dialects. The usage here is correct and will work as intended.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the grouping in the materialized view

# Test: Query the materialized view and check for expected outputs
echo "SELECT time, count FROM metrics.api_requests_per_day_v1 LIMIT 10;" | clickhouse-client

Length of output: 202


Script:

#!/bin/bash
# Let's examine the full materialized view definition and any related table definitions
# to understand the context of the GROUP BY clause

# Check the materialized view definition
fd "047_create_api_requests_per_day_mv_v1.sql" | xargs cat

# Look for related table definitions that might show column definitions
fd -e sql | grep -l "api_requests" | xargs cat

Length of output: 648

internal/clickhouse/schema/043_create_api_requests_per_hour_mv_v1.sql (1)

13-19: Confirm the use of alias time in GROUP BY clause

As with the daily materialized view, ensure that using the alias time in the GROUP BY clause functions correctly in ClickHouse to avoid any grouping inconsistencies.

Run the following script to check if the materialized view aggregates data correctly:

✅ Verification successful

Using time alias in GROUP BY is correct and consistent

After examining the materialized view definition and related tables, the usage of time in the GROUP BY clause is correct. The time column is properly defined using toStartOfHour(fromUnixTimestamp64Milli(time)) in the SELECT clause before being used in GROUP BY. This pattern is consistently used across all time-based materialized views (minute, hour, day) in the codebase, and ClickHouse supports using column aliases in GROUP BY clauses.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the grouping in the materialized view

# Test: Query the materialized view and check for expected outputs
echo "SELECT time, count FROM metrics.api_requests_per_hour_v1 LIMIT 10;" | clickhouse-client

Length of output: 203


Script:

#!/bin/bash
# Let's examine the materialized view definition and its source table schema
fd "043_create_api_requests_per_hour_mv_v1.sql" | xargs cat

# Also check for any related table definitions
fd -e sql | grep -i "api_requests" | xargs cat

Length of output: 4345

internal/clickhouse/schema/045_create_api_requests_per_minute_mv_v1.sql (1)

9-10: Verify time zone handling in timestamp conversion

The fromUnixTimestamp64Milli conversion followed by toStartOfMinute could be affected by server timezone settings. Consider explicitly specifying the timezone to ensure consistent results across different environments.

internal/clickhouse/package.json (1)

17-17: Verify compatibility with Node.js version requirements

Execa v9.3.0 requires Node.js 16.0.0 or newer. Please ensure this aligns with the project's Node.js version requirements.

internal/clickhouse/schema/046_create_api_requests_per_day_v1.sql (1)

13-21: Verify consistency of column types with other metrics tables

Please ensure that the data types and ordering key structure are consistent with related tables (hourly and minutely metrics).

✅ Verification successful

Column types and ordering structure are consistent across all metrics tables

The verification shows that all three tables (per_minute_v1, per_hour_v1, and per_day_v1) have identical:

  • Column names and data types
  • Ordering key structure (workspace_id, time, host, path, response_status, method)
  • Table engine (SummingMergeTree)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check column types across all metrics tables
rg -A 20 "CREATE TABLE metrics\.api_requests" --type sql

Length of output: 5618

apps/engineering/content/docs/contributing/testing.mdx (1)

Line range hint 1-1: Verify consistency with related materialized views.

These tables will be populated by materialized views mentioned in the AI summary. Please ensure:

  1. The schema matches the source tables
  2. The aggregation logic in MVs aligns with these target tables
✅ Verification successful

The previous output showed the materialized view definitions but we need to verify the source and target table schemas to ensure consistency. This will help us confirm the schema alignment between the source tables and materialized views.


Schema consistency verified across materialized views and tables

The materialized views are correctly configured with matching schemas:

  • Source table raw_api_requests_v1 contains all raw fields
  • Target tables (per_hour, per_minute, per_day) have identical schemas with aggregated fields
  • MVs correctly transform the data with appropriate time-based grouping using toStartOfHour, toStartOfMinute, and toStartOfDay
  • All tables maintain consistent field types (e.g., LowCardinality(String) for method)
  • Aggregation logic in MVs properly counts requests while preserving all dimension fields
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for related materialized view definitions
rg -A 20 "CREATE MATERIALIZED VIEW.*api_requests" internal/clickhouse/schema/

# Verify source table schemas match these target tables
rg -A 20 "CREATE TABLE.*api_requests.*source" internal/clickhouse/schema/

Length of output: 5943


Script:

#!/bin/bash
# Let's check for the target tables and raw source table definitions
rg -A 20 "CREATE TABLE.*api_requests" internal/clickhouse/schema/

# Also check for any other related schema files
fd ".*api_requests.*\.sql" internal/clickhouse/schema/

Length of output: 8114

internal/clickhouse/src/testutil.ts (1)

48-50: Verify localhost accessibility in CI environment

The DSN now uses localhost instead of the container name. While this works locally, ensure it's reliable in CI environments where networking setup might differ.

✅ Verification successful

Using localhost in the DSN is correct and reliable

The DSN construction using localhost is correct and reliable because:

  1. The container is started using testcontainers which handles port mapping automatically
  2. The mapped port is obtained using container.getMappedPort(9000), ensuring the correct local port is used
  3. This approach is standard practice with testcontainers, as it manages container networking and port bindings appropriately for both local and CI environments
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if this file is used in CI tests
rg -l "ClickHouseContainer" .github/workflows/

Length of output: 46


Script:

#!/bin/bash
# Let's check where ClickHouseContainer is used and how tests are executed
rg -l "ClickHouseContainer" --type ts

# Check CI workflow files for test execution patterns
rg "npm.*test|yarn.*test|pnpm.*test" .github/workflows/

Length of output: 922


Script:

#!/bin/bash
# Let's check the test configuration to understand the test environment setup
cat vitest.integration.ts

# Check if there's any Docker Compose configuration that might handle networking
fd -e yml -e yaml docker-compose

Length of output: 147


Script:

#!/bin/bash
# Let's check the Docker Compose configuration for network setup
cat deployment/docker-compose.yaml

# Let's also check how the tests are actually using this container
rg -A 5 -B 5 "new ClickHouseContainer\(\)" internal/clickhouse/src/

Length of output: 4417


Script:

#!/bin/bash
# Let's check the test setup in testutil.ts to understand the container initialization
cat internal/clickhouse/src/testutil.ts

# Also check if there are any test configuration files in the clickhouse directory
fd -e json -e js -e ts config internal/clickhouse/

Length of output: 1852

internal/clickhouse/src/index.ts (1)

102-106: LGTM! Clean implementation of timeseries API

The implementation follows the established pattern and provides a clear interface for accessing logs timeseries data at different granularities.

✅ Verification successful

Implementation verified and follows best practices

The timeseries implementation is well-structured and robust:

  • Uses a factory function createTimeseriesQuerier to generate consistent queriers for different time intervals
  • Properly handles data validation using zod schemas
  • Implements clean SQL queries that aggregate logs data with success/warning/error counts
  • Maintains consistent interface across different time granularities (minute/hour/day)
  • Uses type-safe interval configurations through the INTERVALS constant
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify the implementation of the new timeseries functions
rg -A 10 "export (async )?function (getMinutelyLogsTimeseries|getHourlyLogsTimeseries|getDailyLogsTimeseries)"

Length of output: 112


Script:

#!/bin/bash
# Let's try searching without the 'export' keyword and with a broader pattern
rg -A 10 "(getMinutelyLogsTimeseries|getHourlyLogsTimeseries|getDailyLogsTimeseries)"

# Also search for timeseries-related function declarations using ast-grep
ast-grep --pattern 'function $_(querier) {
  $$$
}'

Length of output: 2026


Script:

#!/bin/bash
# Let's check the implementation of createTimeseriesQuerier since it's used to create these functions
rg -A 15 "createTimeseriesQuerier"

# Also check for the INTERVALS constant definition
rg "INTERVALS.*=.*{" -A 5

Length of output: 1519


Script:

#!/bin/bash
# Let's fix the INTERVALS search and also look for the query creation
rg "INTERVALS" -A 5

# Check the createTimeseriesQuery implementation
rg "createTimeseriesQuery" -A 10

# Look for the schema definition
rg "logsTimeseriesDataPoint" -A 5

Length of output: 3356

Copy link
Collaborator

@chronark chronark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only nitpick is that in the sql files the column types are not indented evenly, but that's ok

x Int,
longerValue Int

vs

x           Int
longerValue Int

Copy link

pullflow-com bot commented Dec 10, 2024

From OzAndreas Thomas there is a interesting coderabbit suggestion could you take a look at it?

Copy link

pullflow-com bot commented Dec 10, 2024

From Andreas Thomas ‣ done and ignored

Copy link

pullflow-com bot commented Dec 10, 2024

From Oz ‣ okay let's merge and lemme move on to chart

@ogzhanolguncu ogzhanolguncu added this pull request to the merge queue Dec 10, 2024
Merged via the queue into main with commit fbfdf48 Dec 10, 2024
27 checks passed
@ogzhanolguncu ogzhanolguncu deleted the mv_for_logs_chart branch December 10, 2024 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants