Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add augurs-clustering crate with DBSCAN algorithm #100

Merged
merged 7 commits into from
Sep 4, 2024
Merged

Conversation

sd2k
Copy link
Collaborator

@sd2k sd2k commented Jul 19, 2024

This PR adds a new crate, augurs-clustering, which adds time series clustering functionality using the DBSCAN algorithm.

Summary by CodeRabbit

  • New Features

    • Introduced a DBSCAN clustering algorithm with documentation and benchmarks.
    • Added a new module for clustering, providing Python bindings for the DBSCAN algorithm and flexibility in input formats.
  • Documentation

    • Updated README with information about the new augurs-clustering module and its functionality.
    • Added CHANGELOG for tracking changes in the augurs-clustering crate.
  • Chores

    • Simplified npm publishing process by removing unnecessary tasks related to the Grafana Labs registry.

@sd2k sd2k changed the title clustering feat: add augurs-clustering crate with DBSCAN algorithm Jul 19, 2024
@sd2k sd2k changed the base branch from main to dtw July 19, 2024 13:34
Copy link
Contributor

github-actions bot commented Jul 19, 2024

🐰Bencher

ReportWed, August 21, 2024 at 19:35:57 UTC
Projectaugurs
Branch100/merge
Testbedubuntu-latest
Click to view all benchmark results
BenchmarkLatencyLatency Results
nanoseconds (ns) | (Δ%)
Latency Upper Boundary
nanoseconds (ns) | (%)
auto_fit/air_passengers✅ (view plot)1,893,600.00 (-0.91%)1,965,151.35 (96.36%)
dbscan✅ (view plot)1,659,300.00 (+17.75%)1,796,559.48 (92.36%)
distance_euclidean/None✅ (view plot)202,120.00 (-0.01%)202,704.63 (99.71%)
distance_euclidean/Some(10)✅ (view plot)15,607.00 (-2.05%)17,127.70 (91.12%)
distance_euclidean/Some(2)✅ (view plot)3,587.10 (+2.41%)3,588.48 (99.96%)
distance_euclidean/Some(20)✅ (view plot)31,401.00 (-0.39%)31,951.78 (98.28%)
distance_euclidean/Some(5)✅ (view plot)7,783.00 (-0.57%)7,920.91 (98.26%)
distance_euclidean/Some(50)✅ (view plot)75,451.00 (-0.20%)76,535.21 (98.58%)
distance_matrix_euclidean/window: Some(10), parallelize: false✅ (view plot)2,994,900,000.00 (-0.21%)3,016,233,475.48 (99.29%)
distance_matrix_euclidean/window: Some(10), parallelize: true✅ (view plot)2,994,900,000.00 (+31.90%)3,760,216,959.23 (79.65%)
distance_matrix_euclidean/window: Some(2), parallelize: false✅ (view plot)537,280,000.00 (+0.30%)543,438,068.04 (98.87%)
distance_matrix_euclidean/window: Some(2), parallelize: true✅ (view plot)536,900,000.00 (+27.25%)651,343,075.99 (82.43%)
fit/air_passengers✅ (view plot)423,880.00 (-2.18%)448,634.01 (94.48%)
forecast/air_passengers✅ (view plot)1,360.30 (-2.53%)1,466.70 (92.75%)
season_eight✅ (view plot)22,023.00 (-0.48%)22,858.72 (96.34%)
vic_elec✅ (view plot)39,214,000.00 (+0.62%)39,905,454.04 (98.27%)

Bencher - Continuous Benchmarking
View Public Perf Page
Docs | Repo | Chat | Help

Copy link
Contributor

coderabbitai bot commented Jul 19, 2024

Warning

Rate limit exceeded

@sd2k has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 18 minutes and 39 seconds before requesting another review.

How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Commits

Files that changed from the base of the PR and between e1cf67e and 6c4ffa3.

Walkthrough

This update enhances the project's functionality by introducing a new augurs-clustering crate, which implements the DBSCAN clustering algorithm for time series analysis. The changes include updates to documentation, benchmarks, and various Rust modules, as well as integration with Python bindings. These improvements streamline the benchmarking process and expand the library's capabilities, fostering a more robust framework for data analysis.

Changes

Files Change Summary
.github/workflows/run_benchmarks.yaml Modified benchmarking command to include --all-features.
Cargo.toml (for multiple crates) Added augurs-clustering as a new dependency in the workspace.
README.md Updated with new entry for augurs-clustering, detailing its purpose and status.
crates/augurs-clustering/* Introduced new crate with various files, including CHANGELOG.md, implementation of DBSCAN, and benchmarks.
crates/augurs-js/* Added clustering module and bindings for the DBSCAN algorithm.
crates/pyaugurs/* Introduced Python bindings for clustering algorithms, including new Dbscan class.
justfile Removed npm-related tasks for Grafana Labs registry, simplifying the npm publishing process.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Clustering
    participant DBSCAN
    participant DistanceMatrix

    User->>Clustering: Create Dbscan instance
    Clustering->>DBSCAN: Initialize with parameters
    User->>Clustering: Call fit with DistanceMatrix
    Clustering->>DistanceMatrix: Process input distance matrix
    DBSCAN->>DistanceMatrix: Perform clustering
    DBSCAN-->>Clustering: Return cluster assignments
    Clustering-->>User: Output cluster results
Loading

Poem

🐰 In the meadow where data flows,
A rabbit hops where the clustering grows.
With DBSCAN now in our paws,
We find the clusters without a pause.
Hooray for changes, let’s take a chance,
To analyze data, and make it dance! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@sd2k sd2k force-pushed the clustering branch 2 times, most recently from a030c55 to c39d876 Compare August 4, 2024 02:34
Base automatically changed from dtw to main August 21, 2024 15:43
@sd2k sd2k marked this pull request as ready for review August 21, 2024 18:58
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Outside diff range, codebase verification and nitpick comments (5)
crates/augurs-clustering/CHANGELOG.md (1)

9-9: Consider using "Others" instead of "Other".

The LanguageTool suggests that the plural noun "others" might fit better in this context.

 ### Other
- Add `augurs-clustering` crate
+ - Add `augurs-clustering` crate
Tools
LanguageTool

[misspelling] ~9-~9: It seems that the plural noun “others” fits better in this context.
Context: ...pec/v2.0.0.html). ## [Unreleased] ### Other - Add augurs-clustering crate

(OTHER_OTHERS)

crates/augurs-clustering/benches/dbscan.rs (1)

18-19: Consider parameterizing the DBSCAN parameters.

The parameters 10.0 and 3 are hardcoded. Consider parameterizing them to allow flexibility in benchmarking different configurations.

let eps = 10.0;
let min_points = 3;
Dbscan::new(eps, min_points).fit(&distance_matrix);
crates/augurs-clustering/README.md (3)

4-4: Add a comma for clarity.

Consider adding a comma after "time series" for better readability.

Use this diff to improve the sentence:

 This crate contains algorithms for clustering time series.
-So far only DBSCAN is implemented, and the distance matrix must be passed directly.
+So far, only DBSCAN is implemented, and the distance matrix must be passed directly.
Tools
LanguageTool

[typographical] ~4-~4: It seems that a comma is missing.
Context: ...algorithms for clustering time series. So far only DBSCAN is implemented, and the dis...

(SO_COMMA)


30-30: Correct the phrase for clarity.

The phrase "based heavily on to the implementation" should be corrected to "based heavily on the implementation."

Use this diff to correct the phrase:

 This implementation based heavily on to the implementation in [`linfa-clustering`] and [`scikit-learn`].
-This implementation based heavily on to the implementation in [`linfa-clustering`] and [`scikit-learn`].
+This implementation is based heavily on the implementation in [`linfa-clustering`] and [`scikit-learn`].
Tools
LanguageTool

[uncategorized] ~30-~30: “to the” seems less likely than “the”.
Context: ...s This implementation based heavily on to the implementation in [linfa-clustering] ...

(AI_HYDRA_LEO_CP_TO_THE_THE)


31-31: Correct the verb agreement.

The verb "is" should be changed to "are" to match the plural subject "these."

Use this diff to correct the verb agreement:

 The main difference between these is that we operate directly on the distance matrix rather than calculating
-The main difference between these is that we operate directly on the distance matrix rather than calculating
+The main difference between these are that we operate directly on the distance matrix rather than calculating
Tools
LanguageTool

[grammar] ~31-~31: The verb ‘is’ is singular. Did you mean: “this is” or “these are”?
Context: ...it-learn`]. The main difference between these is that we operate directly on the distanc...

(SINGULAR_VERB_AFTER_THESE_OR_THOSE)

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d7a3dbb and e1cf67e.

Files ignored due to path filters (1)
  • crates/augurs-clustering/data/dist.csv is excluded by !**/*.csv
Files selected for processing (18)
  • .github/workflows/run_benchmarks.yaml (1 hunks)
  • Cargo.toml (1 hunks)
  • README.md (2 hunks)
  • crates/augurs-clustering/CHANGELOG.md (1 hunks)
  • crates/augurs-clustering/Cargo.toml (1 hunks)
  • crates/augurs-clustering/LICENSE-APACHE (1 hunks)
  • crates/augurs-clustering/LICENSE-MIT (1 hunks)
  • crates/augurs-clustering/README.md (1 hunks)
  • crates/augurs-clustering/benches/dbscan.rs (1 hunks)
  • crates/augurs-clustering/src/lib.rs (1 hunks)
  • crates/augurs-js/Cargo.toml (1 hunks)
  • crates/augurs-js/src/clustering.rs (1 hunks)
  • crates/augurs-js/src/dtw.rs (1 hunks)
  • crates/augurs-js/src/lib.rs (1 hunks)
  • crates/pyaugurs/Cargo.toml (1 hunks)
  • crates/pyaugurs/src/clustering.rs (1 hunks)
  • crates/pyaugurs/src/lib.rs (2 hunks)
  • justfile (1 hunks)
Files skipped from review due to trivial changes (4)
  • crates/augurs-clustering/Cargo.toml
  • crates/augurs-clustering/LICENSE-APACHE
  • crates/augurs-clustering/LICENSE-MIT
  • justfile
Additional context used
LanguageTool
crates/augurs-clustering/CHANGELOG.md

[misspelling] ~9-~9: It seems that the plural noun “others” fits better in this context.
Context: ...pec/v2.0.0.html). ## [Unreleased] ### Other - Add augurs-clustering crate

(OTHER_OTHERS)

crates/augurs-clustering/README.md

[typographical] ~4-~4: It seems that a comma is missing.
Context: ...algorithms for clustering time series. So far only DBSCAN is implemented, and the dis...

(SO_COMMA)


[uncategorized] ~30-~30: “to the” seems less likely than “the”.
Context: ...s This implementation based heavily on to the implementation in [linfa-clustering] ...

(AI_HYDRA_LEO_CP_TO_THE_THE)


[grammar] ~31-~31: The verb ‘is’ is singular. Did you mean: “this is” or “these are”?
Context: ...it-learn`]. The main difference between these is that we operate directly on the distanc...

(SINGULAR_VERB_AFTER_THESE_OR_THOSE)

Additional comments not posted (21)
crates/pyaugurs/Cargo.toml (1)

19-19: Dependency addition is consistent with workspace structure.

The addition of augurs-clustering.workspace = true aligns with the integration of the new crate and the workspace configuration.

crates/augurs-js/Cargo.toml (1)

22-22: Addition of augurs-clustering dependency approved.

The inclusion of augurs-clustering as a dependency aligns with the PR objectives of integrating clustering functionality into the project.

crates/augurs-js/src/clustering.rs (1)

1-49: Implementation of DBSCAN bindings approved.

The implementation of the DBSCAN algorithm with JavaScript bindings is well-structured and aligns with the PR objectives. The use of serde, tsify_next, and wasm_bindgen is appropriate for handling data and exposing Rust functionality to JavaScript.

However, ensure that the DistanceMatrix class is correctly implemented and compatible with the DBSCAN algorithm.

Run the following script to verify the implementation of DistanceMatrix:

Verification successful

DistanceMatrix Implementation Verified

The DistanceMatrix struct is correctly implemented and compatible with the DBSCAN algorithm. It provides necessary methods and is used consistently across the codebase.

  • DistanceMatrix is defined in crates/augurs-js/src/dtw.rs and wraps augurs_core::DistanceMatrix.
  • Methods like inner() ensure compatibility with DBSCAN.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of `DistanceMatrix`.

# Test: Search for the implementation of `DistanceMatrix`. Expect: Correct implementation and compatibility with DBSCAN.
ast-grep --lang rust --pattern 'struct DistanceMatrix { $$$ }'

Length of output: 64


Script:

#!/bin/bash
# Search for the definition of the `DistanceMatrix` struct in the codebase.
ast-grep --lang rust --pattern 'struct DistanceMatrix { $$$ }'

# Search for any implementations of methods or traits for `DistanceMatrix`.
ast-grep --lang rust --pattern 'impl DistanceMatrix { $$$ }'

# Search for any usages of `DistanceMatrix` to understand its context and compatibility.
rg 'DistanceMatrix' -A 5

Length of output: 23945

Cargo.toml (1)

25-25: Dependency Addition Approved.

The addition of augurs-clustering as a dependency aligns with the PR objectives and enhances the project's functionality.

.github/workflows/run_benchmarks.yaml (1)

40-40: Benchmark Command Enhancement Approved.

The inclusion of --all-features in the benchmarking command is a beneficial change, ensuring a comprehensive performance assessment.

crates/augurs-js/src/lib.rs (1)

17-17: New Module Addition Approved.

The addition of the clustering module expands the library's capabilities and aligns with the PR objectives.

crates/pyaugurs/src/clustering.rs (3)

11-19: LGTM: Flexible input representation.

The InputDistanceMatrix enum provides a flexible way to represent distance matrices, supporting lists, numpy arrays, and augurs core distance matrices.


21-41: LGTM: Robust conversion implementation.

The TryFrom implementation effectively converts different input types into an augurs_core::DistanceMatrix, with proper error handling.


50-92: LGTM: Well-structured Dbscan class.

The Dbscan class is well-implemented, providing clear methods for initialization and clustering. Ensure that the integration with the rest of the codebase is verified.

Run the following script to verify the integration:

Verification successful

Dbscan class is well-integrated across the codebase.

The Dbscan class is utilized in various modules, including tests and benchmarks, and is part of both Python and JavaScript bindings. This indicates that it is effectively integrated and its functionality is being verified across different environments.

  • Locations:
    • crates/augurs-clustering/src/lib.rs: Implementation and tests.
    • crates/pyaugurs/src/clustering.rs: Python bindings.
    • crates/augurs-js/src/clustering.rs: JavaScript bindings.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the integration of the `Dbscan` class in the codebase.

# Test: Search for the usage of the `Dbscan` class. Expect: Proper integration and usage.
rg --type rust -A 5 $'Dbscan'

Length of output: 11195

crates/pyaugurs/src/lib.rs (2)

17-17: LGTM: New clustering module added.

The clustering module has been successfully added, enhancing the library's functionality.


117-117: LGTM: Dbscan class added to Python module.

The Dbscan class is correctly added to the Python module, expanding the library's capabilities in clustering.

README.md (1)

24-24: LGTM: Documentation for augurs-clustering added.

The README update clearly describes the new augurs-clustering module, enhancing the project's documentation.

crates/augurs-js/src/dtw.rs (4)

83-84: Change to inner field type is appropriate.

The change from Vec<Vec<f64>> to augurs_core::DistanceMatrix likely enhances performance or functionality.


86-89: Addition of inner method is appropriate.

This method provides necessary encapsulation for accessing the underlying augurs_core::DistanceMatrix.


93-94: Simplification of from method is appropriate.

Directly assigning the inner field simplifies the conversion process.


100-100: Update to from method is appropriate.

Calling into_inner() on the inner field reflects the new structure and ensures proper conversion.

crates/augurs-clustering/src/lib.rs (5)

13-18: Definition of Dbscan struct is appropriate.

The fields epsilon and min_cluster_size are well-defined and relevant for the DBSCAN algorithm.


20-33: Initialization method new is appropriate.

The method correctly initializes the Dbscan struct with the provided parameters.


47-99: Implementation of fit method is robust.

The method effectively implements the DBSCAN clustering algorithm, handling clustering and noise identification.


101-111: Implementation of find_neighbours method is efficient.

The method efficiently identifies neighbors within the specified epsilon distance.


114-192: Test module is comprehensive.

The tests cover various scenarios for the DBSCAN algorithm, ensuring robustness.

crates/augurs-clustering/benches/dbscan.rs Show resolved Hide resolved
@sd2k sd2k merged commit 6dcc641 into main Sep 4, 2024
21 checks passed
@sd2k sd2k deleted the clustering branch September 4, 2024 12:11
This was referenced Sep 4, 2024
@sd2k sd2k mentioned this pull request Sep 22, 2024
@sd2k sd2k mentioned this pull request Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant