Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: fix ci and tests #174

Merged
merged 14 commits into from
Dec 16, 2024
Merged

feat: fix ci and tests #174

merged 14 commits into from
Dec 16, 2024

Conversation

smuu
Copy link
Contributor

@smuu smuu commented Dec 13, 2024

Summary by CodeRabbit

  • Documentation

    • Updated the README for improved clarity on project description and installation instructions.
  • New Features

    • Enhanced Docker configuration for bridge and light node services with new environment variables.
    • Improved scripts for managing trusted peers in bridge and light node setups.
  • Bug Fixes

    • Adjusted connection strings in tests to connect to localhost instead of an IP address.
  • Chores

    • Added error handling and installation checks in the justfile commands.
    • Updated CI workflow to include wait steps for bridge and light nodes during integration tests.
    • Modified Dockerfiles to use a newer base image version.

Signed-off-by: Smuu <18609909+Smuu@users.noreply.github.com>
Copy link
Contributor

coderabbitai bot commented Dec 13, 2024

Walkthrough

The pull request includes updates to various documentation and configuration files. The README.md was revised for clarity and detail about the decentralized key transparency protocol, including installation instructions. Dockerfiles for the Celestia bridge, light node, and validator were updated to use a newer base image. The docker-compose.yml file was modified to enhance service configurations, and scripts for managing bridge and light nodes were improved. Additionally, the justfile and CI workflow were updated for better error handling and process management.

Changes

File Change Summary
README.md Updated project description for clarity; added hyperlink to just documentation; clarified commands for starting prover and light-client.
ci/Dockerfile.bridge Updated base image from alpine:3.19.1 to alpine:3.21.0.
ci/Dockerfile.lightnode Updated base image from alpine:3.19.1 to alpine:3.21.0; no other changes made.
ci/Dockerfile.validator Updated base image from alpine:3.19.1 to alpine:3.21.0; no other changes made.
ci/docker-compose.yml Added environment variables SKIP_AUTH and CONTAINER_NAME for bridge services; updated comments and added BRIDGE_COUNT for light node service.
ci/run-bridge.sh Introduced CONTAINER_NAME and TRUSTED_PEERS_FILE variables; added append_trusted_peers function for managing trusted peers.
ci/run-lightnode.sh Added BRIDGE_COUNT and TRUSTED_PEERS_FILE variables; updated functions to manage bridge nodes' readiness.
crates/tests/src/lib.rs Updated WebSocket connection strings in CelestiaConfig for local connections.
justfile Enhanced commands with error handling and tool installation checks; added shebang lines for bash execution.
.github/workflows/ci.yml Added a step in the integration-test job to wait for the light node to start, filtering logs for readiness.

Possibly related PRs

Suggested reviewers

  • distractedm1nd

Poem

🐇 In the meadow where changes bloom,
The README shines with clearer room.
Dockerfiles fresh, like morning dew,
Light nodes ready, all brand new!
With trusted peers and bridges wide,
Together we hop, in code we glide! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (5)
justfile (2)

71-73: Enhance error handling in integration tests

While the error message is good, we should explicitly exit with a non-zero status on test failure to ensure proper CI pipeline behavior.

  if ! cargo test -p prism-tests --lib --release --features mock_prover; then
    echo "Integration tests failed."
+   exit 1
  fi

174-182: Consider version pinning for cargo tools

The tool installation loop is a good addition, but consider pinning specific versions to ensure consistent behavior across different environments.

  for tool in cargo-udeps cargo-llvm-cov cargo-nextest; do \
    if ! command -v $tool > /dev/null; then \
      echo "Installing $tool..."; \
-     cargo install $tool; \
+     case $tool in \
+       cargo-udeps)    cargo install cargo-udeps --version "0.1.35" ;; \
+       cargo-llvm-cov) cargo install cargo-llvm-cov --version "0.5.31" ;; \
+       cargo-nextest)  cargo install cargo-nextest --version "0.9.63" ;; \
+     esac; \
    else \
      echo "$tool is already installed."; \
    fi; \
  done
ci/run-lightnode.sh (1)

9-9: Redundant assignment of BRIDGE_COUNT variable

The line BRIDGE_COUNT="${BRIDGE_COUNT}" assigns the variable to itself without any modification. If BRIDGE_COUNT is already set in the environment, this line is unnecessary and can be removed.

Apply this diff to remove the redundant assignment:

-BRIDGE_COUNT="${BRIDGE_COUNT}"
README.md (1)

42-42: Correct the hyperlink to the 'just' task runner

The hyperlink to just in the installation instructions appears to be incorrect. The URL may not direct users to the intended page.

Update the link to:

-We use [`just`](https://github.com/casey/just?tab=readme-ov-file#packages) as a task runner. Once installed, you can install the rest of the dependencies with:
+We use [`just`](https://github.com/casey/just#packages) as a task runner. Once installed, you can install the rest of the dependencies with:
crates/tests/src/lib.rs (1)

Line range hint 63-127: Consider improving test reliability and cleanup.

The test spawns multiple async tasks including an infinite transaction generation loop. Consider these improvements:

  1. Add a timeout for the entire test to prevent hanging
  2. Implement proper cleanup of spawned tasks
  3. Document or make configurable the 100-block test duration

Here's a suggested improvement:

 #[tokio::test]
+#[timeout(300000)] // 5 minute timeout
 async fn test_light_client_prover_talking() -> Result<()> {
     // ... existing setup ...
 
+    let (tx, rx) = tokio::sync::oneshot::channel();
     let prover_clone = prover.clone();
-    spawn(async move {
+    let prover_handle = spawn(async move {
         debug!("starting prover");
         prover_clone.run().await.unwrap();
     });
 
     let lc_clone = lightclient.clone();
-    spawn(async move {
+    let lc_handle = spawn(async move {
         debug!("starting light client");
         lc_clone.run().await.unwrap();
     });
 
-    spawn(async move {
+    let tx_handle = spawn(async move {
         let mut rng = StdRng::from_entropy();
+        let mut rx = rx;
 
         // ... transaction generation code ...
 
         loop {
+            if rx.try_recv().is_ok() {
+                break;
+            }
             // ... existing loop code ...
         }
     });
 
     let mut rx = lc_da_layer.clone().subscribe_to_heights();
     let initial_height = rx.recv().await.unwrap();
     while let Ok(height) = rx.recv().await {
         debug!("received height {}", height);
         if height >= initial_height + 100 {
             break;
         }
     }
 
+    // Cleanup
+    tx.send(()).unwrap();
+    tx_handle.abort();
+    prover_handle.abort();
+    lc_handle.abort();
+
     Ok(())
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f2f977 and c2323e1.

📒 Files selected for processing (9)
  • README.md (1 hunks)
  • ci/Dockerfile.bridge (1 hunks)
  • ci/Dockerfile.lightnode (1 hunks)
  • ci/Dockerfile.validator (1 hunks)
  • ci/docker-compose.yml (3 hunks)
  • ci/run-bridge.sh (4 hunks)
  • ci/run-lightnode.sh (4 hunks)
  • crates/tests/src/lib.rs (1 hunks)
  • justfile (4 hunks)
✅ Files skipped from review due to trivial changes (3)
  • ci/Dockerfile.bridge
  • ci/Dockerfile.validator
  • ci/Dockerfile.lightnode
🧰 Additional context used
🪛 Shellcheck (0.10.0)
ci/run-lightnode.sh

[warning] 82-82: Declare and assign separately to avoid masking return values.

(SC2155)


[warning] 83-83: Declare and assign separately to avoid masking return values.

(SC2155)

🔇 Additional comments (6)
justfile (2)

11-13: LGTM: Docker build step addition improves reliability

The addition of the build step before spinning up the stack ensures that all images are properly built, reducing potential runtime failures.


94-106: Verify coverage exclusions and enhance error handling

The coverage command setup looks good, but has two areas for improvement:

  1. Error handling should be consistent with other commands
  2. We should verify the excluded packages are intentional
  if ! cargo llvm-cov nextest --html --output-dir coverage_report --lib --features "mock_prover" --release --workspace --exclude prism-cli --exclude-from-report prism-sp1 --ignore-filename-regex sp1; then
    echo "Coverage report generation failed."
+   exit 1
  else
    echo "Coverage report generated in 'coverage_report' directory"
  fi
✅ Verification successful

Let me gather more information about the test coverage setup to make a more informed decision.


Let me check one more thing to understand the SP1-related code structure and test coverage requirements.


Coverage exclusions are appropriate, but error handling needs improvement

The exclusions in the coverage command are appropriate:

  • prism-cli is excluded as it's primarily an executable
  • prism-sp1 (under crates/zk/sp1) is excluded as it's SP1-specific code
  • The sp1 filename regex exclusion aligns with the SP1-related implementation files

The error handling should be enhanced as suggested:

  if ! cargo llvm-cov nextest --html --output-dir coverage_report --lib --features "mock_prover" --release --workspace --exclude prism-cli --exclude-from-report prism-sp1 --ignore-filename-regex sp1; then
    echo "Coverage report generation failed."
+   exit 1
  else
    echo "Coverage report generated in 'coverage_report' directory"
  fi
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify the excluded packages exist and are intentional
fd -t d "prism-(cli|sp1)" .

# Check for any other test files that might need exclusion
rg -l "^#\[cfg\(test\)\]" . --type rust

Length of output: 234


Script:

#!/bin/bash
# Check if prism-cli and prism-sp1 have test-specific code
rg -l "test" ./prism-cli ./prism-sp1 --type rust || true

# Check Cargo.toml to understand the workspace structure
cat Cargo.toml

# Look for any mock or test utilities in sp1-related files
rg -l "mock|test" . --type rust | grep -i "sp1" || true

Length of output: 3580


Script:

#!/bin/bash
# Check the structure and content of the SP1-related code
fd -t f . "crates/zk/sp1"

# Look for any test configurations in workspace
rg -l "mock_prover|test_utils" . --type toml

Length of output: 371

ci/docker-compose.yml (3)

26-30: Environment variables added correctly for bridge nodes

The addition of SKIP_AUTH and CONTAINER_NAME environment variables for bridge-0 enhances configurability and clarity in the service setup. The variables are appropriately documented.


49-51: Environment variables added correctly for bridge nodes

Similarly, the environment variables for bridge-1 are correctly added, ensuring consistency across bridge services.


65-72: Environment variables added correctly for light node

The inclusion of SKIP_AUTH and BRIDGE_COUNT for light-0 improves the light node's configuration, aligning it with the new peer management features.

crates/tests/src/lib.rs (1)

37-37: LGTM! Connection string changes improve security and resolve port conflicts.

The changes from 0.0.0.0 to localhost improve security by restricting connections to the local machine. The light client port change to 46658 helps avoid potential port conflicts.

Let's verify these ports are correctly configured in the docker-compose file:

Also applies to: 41-41

✅ Verification successful

Port mappings in docker-compose.yml correctly align with test configuration

The verification confirms that:

  • Port 36658 is mapped to 26658 for bridge connection
  • Port 46658 is mapped to 26658 for light client connection

These mappings match exactly with the connection strings in the test file:

  • Bridge: "ws://localhost:36658"
  • Light client: "ws://localhost:46658"
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the WebSocket ports in docker-compose match the test configuration
# Expected: Port mappings for 36658 and 46658 should be present

rg -A 2 "ports:" ci/docker-compose.yml | rg "366[0-9][0-9]|466[0-9][0-9]"

Length of output: 113

ci/run-lightnode.sh Show resolved Hide resolved
ci/run-bridge.sh Show resolved Hide resolved
ci/run-lightnode.sh Show resolved Hide resolved
ci/run-bridge.sh Show resolved Hide resolved
justfile Show resolved Hide resolved
justfile Show resolved Hide resolved
justfile Show resolved Hide resolved
crates/tests/src/lib.rs Show resolved Hide resolved
justfile Show resolved Hide resolved
Copy link
Contributor

@distractedm1nd distractedm1nd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautiful

Signed-off-by: Smuu <18609909+Smuu@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
ci/run-bridge.sh (2)

63-76: Add error message details in retry loop

The retry loop could be more informative about why the peer ID retrieval failed.

-    if [[ -z "$peer_id" ]]; then
-      echo "Node is not running yet. Retrying..."
+    if [[ -z "$peer_id" ]]; then
+      echo "Node is not running yet. Failed to get peer ID. Retrying... (${elapsed}s/${timeout}s)"

78-80: Consider making the port configurable

The RPC port (2121) is hardcoded in the multiaddr construction. Consider making this configurable via an environment variable for flexibility.

+RPC_PORT="${RPC_PORT:-2121}"
-  multiaddr="/dns/$CONTAINER_NAME/tcp/2121/p2p/$peer_id"
+  multiaddr="/dns/$CONTAINER_NAME/tcp/$RPC_PORT/p2p/$peer_id"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c2323e1 and ed8fff4.

📒 Files selected for processing (1)
  • ci/run-bridge.sh (4 hunks)
🔇 Additional comments (3)
ci/run-bridge.sh (3)

9-9: LGTM: Well-structured variable declarations

The new variables are properly defined with sensible defaults and follow the existing pattern of using shared directories.

Also applies to: 23-23


82-101: LGTM: Proper implementation of file locking

The implementation correctly addresses the race condition concerns raised in the previous review by using proper file locking mechanisms.


Line range hint 1-124: Verify related CI script updates

Since this script is part of the CI infrastructure and handles critical peer configuration, we should verify if similar changes are needed in related scripts (e.g., light node scripts).

✅ Verification successful

Let me check the content of the related scripts to verify if they need similar updates.


No additional updates needed in related scripts

The verification shows that the peer handling implementation in run-bridge.sh and run-lightnode.sh are complementary and working as intended:

  • run-bridge.sh appends its peer information to the shared TRUSTED_PEERS_FILE
  • run-lightnode.sh waits for all bridge nodes to register their peer information and properly configures them using add_trusted_peers()
  • The docker-compose.yml correctly sets up the container names and bridge count used by this peer configuration system
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for related CI scripts that might need similar updates
echo "Searching for related CI scripts..."
fd -t f . ci/ -x grep -l "celestia" {} \;

# Check for similar peer handling in other scripts
echo -e "\nChecking for similar peer handling..."
rg -l "trusted.*peer" ci/

Length of output: 429


Script:

#!/bin/bash
# Check content of run-lightnode.sh for peer handling
echo "=== Content of run-lightnode.sh ==="
cat ci/run-lightnode.sh

echo -e "\n=== Checking trusted peer handling in docker-compose.yml ==="
cat ci/docker-compose.yml | grep -A 5 -B 5 "trusted.*peer"

Length of output: 5061

ci/run-bridge.sh Show resolved Hide resolved
@distractedm1nd
Copy link
Contributor

in the CI the integration test still can't connect to the light node, hmmm

@smuu
Copy link
Contributor Author

smuu commented Dec 13, 2024

in the CI the integration test still can't connect to the light node, hmmm

I needed to change 0.0.0.0 to localhost to get it working. Can you test with 0.0.0.0 to see if this is the issue?

Signed-off-by: Smuu <18609909+Smuu@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed8fff4 and 8cd0a40.

📒 Files selected for processing (2)
  • crates/da/src/celestia.rs (1 hunks)
  • crates/tests/src/lib.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/tests/src/lib.rs
🔇 Additional comments (1)
crates/da/src/celestia.rs (1)

46-46: Security consideration for exposed WebSocket endpoint

The change to 0.0.0.0 exposes the WebSocket endpoint on all network interfaces, which could present security risks if not properly secured. Please ensure:

  1. This exposure is intentional and required for the architecture
  2. The WebSocket endpoint has proper authentication (note: the code shows optional auth token support)
  3. Network-level security (firewalls, security groups) is configured appropriately
✅ Verification successful

Let me gather more context about the Celestia client implementation and configuration.


Based on the gathered context, I can now provide a final response:

WebSocket binding to 0.0.0.0 is intended for development and testing

The binding to 0.0.0.0:26658 is part of the default configuration in a development context, as evidenced by:

  • It's used alongside test configurations (36658, 46658 ports in tests)
  • The codebase is explicitly marked as "not yet suitable for production environments" in README.md
  • Authentication support is properly implemented via optional auth tokens in the client
  • This is a local development setup that requires running a local Celestia network ("just celestia-up")

For production deployments, the connection string should be configured to a more restrictive binding through CLI arguments or configuration, which is already supported via the celestia_client parameter.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for security-related configurations

# Search for authentication token usage
echo "Checking auth token usage..."
rg -A 5 "auth_token"

# Search for security-related configurations
echo "Checking security configurations..."
fd -e toml -e yaml -e json -x rg "(security|auth|token|credential)"

Length of output: 944


Script:

#!/bin/bash
# Check the Celestia client implementation and configuration details
echo "Checking Celestia client implementation..."
rg -A 10 "Client::new" 

echo "Checking WebSocket connection configuration..."
rg -A 10 "connection_string"

echo "Checking if there are any security-related constants or configurations..."
rg -B 2 -A 5 "0.0.0.0:26658"

echo "Looking for any documentation about security considerations..."
fd README.md -x cat {}

Length of output: 17561

crates/da/src/celestia.rs Outdated Show resolved Hide resolved
smuu added 2 commits December 13, 2024 15:21
Signed-off-by: Smuu <18609909+Smuu@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
.github/workflows/ci.yml (1)

90-94: LGTM! The wait step for light node is a good addition.

This addition properly ensures the light node is ready before running integration tests, which should help resolve the connection issues mentioned in the PR discussion.

Consider adding a timeout to prevent the workflow from hanging if the light node fails to start:

      - name: Wait for light node to start
        run: |
          docker compose -f ci/docker-compose.yml logs -f |
-            awk '/Configuration finished. Running a light node/ {print;exit}'
+            timeout 60 awk '/Configuration finished. Running a light node/ {print;exit}'
+          if [ $? -eq 124 ]; then
+            echo "Error: Light node failed to start within 60 seconds"
+            docker compose -f ci/docker-compose.yml logs
+            exit 1
+          fi
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8cd0a40 and 3ebe2b5.

📒 Files selected for processing (2)
  • .github/workflows/ci.yml (1 hunks)
  • crates/tests/src/lib.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/tests/src/lib.rs
🔇 Additional comments (1)
.github/workflows/ci.yml (1)

90-94: Verify Docker service connectivity configuration.

Given the discussion about changing addresses from 0.0.0.0 to localhost, we should verify the Docker service connectivity configuration.

smuu added 6 commits December 13, 2024 15:36
Signed-off-by: Smuu <18609909+Smuu@users.noreply.github.com>
Signed-off-by: Smuu <18609909+Smuu@users.noreply.github.com>
Signed-off-by: Smuu <18609909+Smuu@users.noreply.github.com>
Signed-off-by: Smuu <18609909+Smuu@users.noreply.github.com>
Signed-off-by: Smuu <18609909+Smuu@users.noreply.github.com>
Signed-off-by: Smuu <18609909+Smuu@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
ci/docker-compose.yml (2)

37-56: Clean up commented code

Instead of commenting out the entire bridge-1 service configuration, consider:

  1. Removing it if it's no longer needed
  2. Using source control to track the changes
  3. If this is temporary for testing, add a TODO comment explaining the plan
-  # bridge-1:
-  #   image: bridge
-  #   platform: "linux/amd64"
-  #   build:
-  #     context: .
-  #     dockerfile: Dockerfile.bridge
-  #   environment:
-  #     # provide an id for the bridge node (default: 0)
-  #     # each node should have a next natural number starting from 0
-  #     - NODE_ID=1
-  #     # setting SKIP_AUTH to true disables the use of JWT for authentication
-  #     - SKIP_AUTH=true
-  #     # this must match the service name in the docker-compose.yml file
-  #     # used for the trusted peers string for the light node
-  #     - CONTAINER_NAME=bridge-1
-  #   ports:
-  #     - 36658:26658
-  #   volumes:
-  #     - credentials:/credentials
-  #     - genesis:/genesis

Line range hint 1-89: Document testing strategy

The changes suggest a simplified testing setup (single bridge node, disabled auth) but lack documentation about:

  1. The testing strategy for different configurations
  2. Validation steps for the CI pipeline
  3. How this setup differs from production

Consider adding a ci/README.md documenting:

  1. The CI environment setup and configuration
  2. Test scenarios and expected outcomes
  3. Troubleshooting guidelines
  4. Production vs CI differences
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7cf1cd6 and b6ad595.

📒 Files selected for processing (3)
  • .github/workflows/ci.yml (1 hunks)
  • ci/docker-compose.yml (3 hunks)
  • crates/tests/src/lib.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/ci.yml
  • crates/tests/src/lib.rs
🔇 Additional comments (2)
ci/docker-compose.yml (2)

10-10: Verify impact of reducing bridge nodes

Reducing the bridge count from 2 to 1 removes redundancy in the system. While this might simplify testing, it could affect system reliability in production.

Consider documenting the rationale for this change and its implications for production deployments.

✅ Verification successful

Reducing bridge count is intentional for CI testing

The reduction from 2 to 1 bridge nodes is a valid change for CI testing purposes. The scripts show that:

  • The bridge count is configurable via BRIDGE_COUNT environment variable
  • Light nodes wait for the configured number of bridge nodes before starting
  • The setup is specifically for CI testing with proper coordination between validator, bridge, and light nodes
  • The scripts handle both single and multiple bridge node scenarios correctly

This change simplifies the CI testing environment without compromising the test functionality.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if there are any references to multiple bridge nodes in configuration or documentation
rg -l "bridge.*nodes?" . --type md --type yaml --type sh

Length of output: 145


Script:

#!/bin/bash
# Check the content of CI scripts to understand the bridge node setup and its implications
echo "=== run-bridge.sh ==="
cat ./ci/run-bridge.sh
echo -e "\n=== run-lightnode.sh ==="
cat ./ci/run-lightnode.sh
echo -e "\n=== run-validator.sh ==="
cat ./ci/run-validator.sh

Length of output: 14976


65-72: ⚠️ Potential issue

Address light node connection issues

The configuration changes attempt to fix the CI connection issues by:

  1. Adding SKIP_AUTH=true to match bridge configuration
  2. Setting explicit BRIDGE_COUNT=1

However, based on the PR comments about 0.0.0.0 vs localhost, there might be additional networking configuration needed.

Consider:

  1. Adding explicit host binding configuration
  2. Documenting the networking requirements
  3. Adding connection validation in the CI pipeline

ci/docker-compose.yml Show resolved Hide resolved
smuu added 2 commits December 13, 2024 16:21
Signed-off-by: Smuu <18609909+Smuu@users.noreply.github.com>
Signed-off-by: Smuu <18609909+Smuu@users.noreply.github.com>
Signed-off-by: Smuu <18609909+Smuu@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (5)
ci/docker-compose.yml (2)

Line range hint 8-14: Inconsistent BRIDGE_COUNT configuration

The BRIDGE_COUNT=2 in validator service appears incorrect since:

  1. The PR is addressing light node connection issues
  2. Comments indicate troubleshooting with a single bridge node
  3. This might be contributing to the connection problems mentioned in PR comments
- BRIDGE_COUNT=2
+ BRIDGE_COUNT=1

Line range hint 37-56: Remove unused bridge-1 service

Since the PR is focused on fixing light node connection issues with a single bridge, maintaining the bridge-1 service configuration:

  1. Adds unnecessary complexity
  2. May interfere with connection troubleshooting
  3. Conflicts with the suggested BRIDGE_COUNT=1

Consider removing the entire bridge-1 service block while debugging the connection issues.

ci/run-lightnode.sh (1)

32-64: Improve timeout handling in wait_for_provision

The function has been enhanced to wait for bridge nodes, but the timeout handling could be improved.

Consider these improvements:

  1. Make timeout configurable via environment variable
  2. Add more detailed error message on timeout
  3. Add progress indicator
+PROVISION_TIMEOUT="${PROVISION_TIMEOUT:-30}"  # Allow configurable timeout
 echo "Waiting for $BRIDGE_COUNT bridge nodes to start"
 start_time=$(date +%s)
-timeout=30
+timeout=$PROVISION_TIMEOUT
+echo "Will wait up to $timeout seconds..."
ci/run-validator.sh (2)

Line range hint 191-193: Fix potential race condition in genesis hash file writing

Both provision_bridge_nodes and provision_light_nodes are run in parallel and write to the same GENESIS_HASH_FILE. This could lead to a race condition. Consider:

  1. Running them sequentially instead of parallel
  2. Or having only one function write the genesis hash
- provision_bridge_nodes &
- provision_light_nodes &
+ # Run sequentially to avoid race condition
+ provision_bridge_nodes
+ provision_light_nodes

Line range hint 3-3: Enhance error handling for CI reliability

While set -euo pipefail provides basic error handling, consider adding:

  1. Timeout handling for wait_for_block to prevent infinite loops
  2. Explicit checks for file operations in shared directory
  3. Logging for better CI debugging
 set -euo pipefail
+
+# Add timeout handling
+TIMEOUT=300  # 5 minutes
+
+# Add error handling function
+handle_error() {
+  echo "Error occurred in script at line $1"
+  exit 1
+}
+
+trap 'handle_error $LINENO' ERR
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6ad595 and 4364107.

📒 Files selected for processing (5)
  • .github/workflows/ci.yml (1 hunks)
  • ci/docker-compose.yml (5 hunks)
  • ci/run-bridge.sh (4 hunks)
  • ci/run-lightnode.sh (4 hunks)
  • ci/run-validator.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/ci.yml
🧰 Additional context used
🪛 Shellcheck (0.10.0)
ci/run-lightnode.sh

[warning] 87-87: Declare and assign separately to avoid masking return values.

(SC2155)


[warning] 88-88: Declare and assign separately to avoid masking return values.

(SC2155)

🔇 Additional comments (8)
ci/docker-compose.yml (3)

26-30: Authentication still disabled in CI environment

While the added comments explain the purpose of SKIP_AUTH and CONTAINER_NAME, the security concern about disabled authentication remains unaddressed.


87-91: LGTM: Good volume configuration for CI

The tmpfs configuration for the shared volume is well-suited for CI:

  1. Provides fast I/O performance
  2. Ensures clean state between test runs
  3. Properly configured with appropriate driver options

Line range hint 1-91: Verify configuration consistency

Let's verify the overall configuration consistency to help resolve the connection issues.

✅ Verification successful

Let me verify the configuration consistency by generating additional scripts to check the environment variables and port mappings.


Configuration is consistent and properly set up

The verification shows that:

  • Bridge count is consistently set to 2 across services
  • Container names are properly referenced (bridge-0, bridge-1)
  • Port mappings are correctly configured with unique host ports (26658, 36658, 46658)
  • Authentication settings (SKIP_AUTH) are consistently set to true across all services
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check configuration consistency across services

# Check for bridge count references
echo "Checking bridge count consistency:"
rg "BRIDGE_COUNT" -A 1

# Check for container name references
echo "\nChecking container name references:"
rg "CONTAINER_NAME" -A 1

# Check for network-related configurations
echo "\nChecking network configurations:"
rg "0.0.0.0|localhost" -A 1

Length of output: 67959


Script:

#!/bin/bash
# Check port mappings and environment variables

# Check all port mappings
echo "Checking port mappings:"
rg "ports:" -A 2 ci/docker-compose.yml

# Check environment variables for bridge count
echo "\nChecking bridge count environment variables:"
rg "BRIDGE_COUNT" ci/docker-compose.yml

# Check container names and their references
echo "\nChecking container name references:"
rg "CONTAINER_NAME|container_name" ci/docker-compose.yml

# Check skip auth settings
echo "\nChecking authentication settings:"
rg "SKIP_AUTH" ci/docker-compose.yml

Length of output: 1048

ci/run-lightnode.sh (1)

86-91: Fix potential return value masking in add_trusted_peers

The function has shellcheck warnings about masking return values.

Apply this diff as suggested in previous reviews:

add_trusted_peers() {
-  local trusted_peers="$(cat "$TRUSTED_PEERS_FILE")"
-  local formatted_peers=$(echo "$trusted_peers" | sed 's/\([^,]*\)/"\1"/g')
+  local trusted_peers
+  trusted_peers="$(cat "$TRUSTED_PEERS_FILE")"
+  local formatted_peers
+  formatted_peers=$(echo "$trusted_peers" | sed 's/\([^,]*\)/"\1"/g')
   echo "Trusting peers: $formatted_peers"
   sed -i'.bak' "s|TrustedPeers = .*|TrustedPeers = [$formatted_peers]|" "$CONFIG_DIR/config.toml"
}
🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 87-87: Declare and assign separately to avoid masking return values.

(SC2155)


[warning] 88-88: Declare and assign separately to avoid masking return values.

(SC2155)

ci/run-bridge.sh (2)

115-116: Improve background process handling

The current implementation of background process could be improved.

As suggested in previous reviews, implement proper background process tracking and cleanup:

+# Track background processes
+declare -a bg_pids
+
+# Cleanup function
+cleanup() {
+    for pid in "${bg_pids[@]}"; do
+        kill "$pid" 2>/dev/null || true
+    done
+}
+
+# Set trap
+trap cleanup EXIT
+
 main() {
     # ... previous code ...
     
-    append_trusted_peers &
-    sleep 4
+    append_trusted_peers & 
+    bg_pids+=($!)
+    
+    # Wait for trusted peers file to be created or timeout
+    timeout=30
+    start_time=$(date +%s)
+    while [[ ! -f "$TRUSTED_PEERS_FILE" ]]; do
+        current_time=$(date +%s)
+        if (( current_time - start_time >= timeout )); then
+            echo "Timeout waiting for trusted peers"
+            exit 1
+        fi
+        sleep 1
+    done

Line range hint 1-1: Verify the impact on CI pipeline

The changes look good and address the light node connection issues. Let's verify the impact on the CI pipeline.

✅ Verification successful

Let me gather more information about the CI configuration and pipeline changes.


Based on the gathered information, I can now provide the final response:

CI pipeline changes have been properly implemented with robust error handling

The changes in ci/run-bridge.sh show significant improvements:

  • Proper file locking mechanism implemented for TRUSTED_PEERS_FILE to prevent race conditions
  • Timeout handling added for peer ID retrieval (30 seconds limit)
  • Robust error handling with proper exit codes
  • Clear logging for debugging purposes
  • Sequential initialization steps with proper waiting mechanisms
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check CI workflow files for light node configuration
echo "Checking CI workflow configuration..."
rg -l "BRIDGE_COUNT|TRUSTED_PEERS_FILE" .github/workflows/

# Check for any remaining connection issues in CI logs
echo "Checking recent CI logs for connection issues..."
gh run list --limit 5 --json conclusion,databaseId,url \
  | jq -r '.[] | select(.conclusion=="failure") | .url' \
  | while read -r url; do
    gh run view "$url" --log | rg -i "connection|light.*node"
  done

Length of output: 377


Script:

#!/bin/bash
# Check for CI configuration changes in the PR
echo "Checking CI-related changes..."
rg -l "ci/" --type sh

# Look for bridge-related configuration
echo "Checking bridge configuration..."
fd "bridge" --type f --exec cat {}

# Check workflow files for CI configuration
echo "Checking workflow files..."
fd "workflow" -e yml -e yaml --exec cat {}

Length of output: 4513

ci/run-validator.sh (2)

21-23: Verify shared directory permissions and synchronization

The change from GENESIS_DIR to SHARED_DIR makes sense as it's now used for sharing both genesis hash and peer addresses. However, ensure that:

  1. The /shared directory is properly mounted and accessible by all nodes
  2. File permissions are correctly set for both reading and writing
✅ Verification successful

Directory sharing configuration is properly implemented

The change from GENESIS_DIR to SHARED_DIR is correctly implemented with proper infrastructure support:

  • The shared volume is configured as a tmpfs mount in docker-compose.yml, ensuring fast and secure inter-container communication
  • All nodes (validator, bridge, light) consistently use the same /shared directory path
  • The shared directory is properly mounted across all containers using Docker volumes
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if the shared directory is referenced in docker-compose.yml
rg -l "shared:" 
# Check if other scripts are using the same shared directory
rg "/shared" --type sh

Length of output: 223


Script:

#!/bin/bash
# Check how the shared directory is configured in docker-compose.yml
rg -A 5 "shared:" ci/docker-compose.yml

# Check how the shared directory is used in other scripts
rg -B 2 -A 2 "/shared" ci/run-{validator,lightnode,bridge}.sh

Length of output: 1598


Line range hint 176-177: Review connection configuration for light node issues

The script sets the validator to listen on 0.0.0.0:26657, but PR comments mention connection issues between the light node and validator. Consider:

  1. Verifying if this binding address is appropriate for your CI environment
  2. Testing with localhost as mentioned in PR comments

ci/docker-compose.yml Show resolved Hide resolved
ci/run-lightnode.sh Show resolved Hide resolved
ci/run-bridge.sh Show resolved Hide resolved
@distractedm1nd distractedm1nd merged commit 93a85ec into main Dec 16, 2024
5 checks passed
@smuu smuu deleted the smuu/20241213-fix-ci-and-test branch December 16, 2024 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants