add a basic framework for testing individual agents #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

peterj merged 2 commits into main from peterj/agenttesting

Jan 23, 2025

.gitignore

-Original file line number
+Diff line change
@@ -1,6 +1,8 @@
     ### dotenv template
     python/.env
+    test_results
     ### Go template
     # If you prefer the allow list template instead of the deny list, see community template:
     # https://github.com/github/gitignore/blob/main/community/Golang/Go.AllowList.gitignore
@@ Expand Down @@

python/prompts/__init__.py

-Original file line number
+Diff line change
@@ -1,3 +1,4 @@
     from ._istio_crd import get_istio_crd_prompt
+    from .models import IstioCrdType
-    __all__ = ["get_istio_crd_prompt"]
+    __all__ = ["get_istio_crd_prompt"]

python/prompts/_istio_crd.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -551,6 +551,7 @@ def __init__(self, crd_type: "IstioCrdType"): @@
             super().__init__(
                 name=f"Istio {crd_type.value} Template",
                 description=f"Template for generating {crd_type.value} configurations",
+                version="0.0.1",
                 sections=[
                     PromptSection(
                         name="Role Definition",
@@ Expand Down @@

python/prompts/base.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -49,11 +49,13 @@ def __init__( @@
             self,
             name: str,
             description: str,
+            version: str,
             sections: list[PromptSection] | None,
             variables: list[TemplateVariable] | None = None,
         ):
             self.name = name
             self.description = description
+            self.version = version
             self.sections = sections
             self.variables = variables or []
             self._validate_variables()
@@ Expand Down @@

python/tests/README.md

-Original file line number
+Diff line change
@@ -0,0 +1,115 @@
+    ## Artifacts
+    The test data is split into two groups - test cases and agent configurations.
+    ### Test cases
+    The test cases file consists of one or more test cases, each having an input prompt (this is the query that's sent directly to the agent) and the expected output, which is the response that the agent should return.
+    ```yaml
+    version: "1.0"
+    metadata:
+      description: "Authorization Policy Test Cases"
+    test_cases:
+      - name: deny_post_8080
+        input: "Deny results with POST method on port 8080 on all workloads in the foo namespace"
+        category: AuthorizationPolicy
+        expected_output:
+          apiVersion: security.istio.io/v1
+          kind: AuthorizationPolicy
+          metadata:
+            name: httpbin
+            namespace: foo
+          spec:
+            action: DENY
+            rules:
+              - to:
+                  - operation:
+                      methods:
+                        - POST
+                      ports:
+                        - "8080"
+    ```
+    ### Agent configuration
+    The agent configuration specifies the system prompt for the agent. This is the context that the agent uses to generate the response.
+    ```yaml
+    version: "1.0"
+    name: "istio_authpolicy_crd_agent"
+    metadata:
+      description: "Agent for generating Istio Authorization Policy CRDs"
+      version: "0.0.1"
+    system_messages:
+      - |
+        You're an Istio CRD agent. You modify or create a new JSON based on the UQ. The JSON must conform to the PROTO SPEC. The response must only include one or more AuthorizationPolicy resource type.
+        PROTO...
+    ```
+    ## Running tests
+    To run the tests you pass in the test case, the agent file and specify the model you want to use:
+    ```bash
+    uv run main.py run test_cases/authpolicy_test_cases.yaml agents/istio_crd_agent_0.yaml --model gpt-4o-mini
+    ```
+    Once you've created the baseline results (or first test results), you can modify the prompt or model and run the tests again to see if the results change.
+    To check for changes in the results, you can use the `compare` command:
+    ```bash
+    uv run main.py compare test_results/results1.json test_results/results2.json
+    ```
+    The `compare` command will output the differences between the two test results files and provide a summary of the changes between the two runs:
+    ```console
+    Comparing results_20250122_153628.json with results_20250122_153642.json
+    Model changed: True
+    Prompt changed: False
+    Analyzing 2 test cases:
+    === Test 1 ===
+    Input: Deny results with POST method on port 8080 on all workloads in the foo namespace
+    Duration delta: 6569.08ms
+    Similarity: 95.46% → 95.46% (Δ: +0.00%)
+    No differences in output
+    === Test 2 ===
+    Input: Allow GET requests on port 3000 for service-a in the bar namespace
+    Duration delta: 3134.77ms
+    Similarity: 87.52% → 83.91% (Δ: -3.62%)
+    Output differences:
+    --- results_20250122_153628.json (Test 2)
+    +++ results_20250122_153642.json (Test 2)
+    @@ -2,16 +2,16 @@
+       "apiVersion": "security.istio.io/v1",
+       "kind": "AuthorizationPolicy",
+       "metadata": {
+    -    "name": "allow-get-service-a",
+    +    "name": "allow-get-port-3000",
+         "namespace": "bar"
+       },
+       "spec": {
+    +    "action": "ALLOW",
+         "selector": {
+    -      "matchLabels": {
+    +      "match_labels": {
+             "app": "service-a"
+           }
+         },
+    -    "action": "ALLOW",
+         "rules": [
+           {
+             "to": [
+    === Summary Statistics ===
+    Tests with differences: 1 of 2
+    Average similarity delta: -1.81%
+    Average duration delta: +4851.93ms
+    ```

python/tests/__init__.py

Empty file.

python/tests/agent_tester.py

-Original file line number
+Diff line change
@@ -0,0 +1,93 @@
+    from autogen_agentchat.agents import AssistantAgent
+    from schema import TestCase, TestRunResult, TestResult
+    from datetime import datetime
+    from pathlib import Path
+    from dataclasses import asdict
+    import json
+    import hashlib
+    import difflib
+    class AgentTester:
+        def __init__(self, agent: "AssistantAgent", test_cases: list[TestCase], results_dir: str = "test_results"):
+            self.agent = agent
+            self.test_cases = test_cases
+            self.results_dir = Path(results_dir)
+            self.results_dir.mkdir(exist_ok=True)
+        async def run_tests(self) -> TestRunResult:
+            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+            model = self.agent._model_client._to_config().model_dump().get('model')
+            # Get all system messages once for the entire test run
+            system_msg = "\n".join([msg.content for msg in self.agent._system_messages])
+            # Create config once for the entire test run
+            config = {
+                "agent_name": self.agent.name,
+                "tools": "\n".join([tool.name for tool in self.agent._tools]),
+                "model": model,
+                "prompt": system_msg,
+                "prompt_hash": hashlib.sha256(system_msg.encode('utf-8')).hexdigest(),
+            }
+            results = []
+            for test_case in self.test_cases:
+                print(".", end="", flush=True)
+                start = datetime.now()
+                # Run the agent with the test input
+                response = await self.agent.run(task=test_case.input)
+                end = datetime.now()
+                duration_ms = (end - start).total_seconds() * 1000
+                # Only get the last TextMessage and get the content of it
+                output = response.messages[-1].content
+                similarity = self._calculate_similarity(test_case.expected_output, json.loads(output))
+                # Create test result without config
+                result = TestResult(
+                    category=test_case.category,
+                    input=test_case.input,
+                    expected_output=test_case.expected_output,
+                    actual_output=json.loads(output),
+                    duration_ms=duration_ms,
+                    similarity=similarity
+                )
+                results.append(result)
+            print()
+            # Create the test run result that combines config and individual results
+            test_run_result = TestRunResult(
+                timestamp=timestamp,
+                config=config,
+                results=results
+            )
+            # Save results
+            self._save_results(test_run_result)
+            return test_run_result
+        def _calculate_similarity(self, dict1: dict, dict2: dict):
+            # Convert both dictionaries to strings with consistent formatting
+            str1 = json.dumps(dict1, sort_keys=True)
+            str2 = json.dumps(dict2, sort_keys=True)
+            return difflib.SequenceMatcher(None, str1, str2).ratio() * 100
+        def _save_results(self, test_run_result: TestRunResult):
+            # Convert results to JSON-serializable format
+            results_dict = {
+                "timestamp": test_run_result.timestamp,
+                "config": test_run_result.config,
+                "results": [asdict(result) for result in test_run_result.results]
+            }
+            # Save to JSON file
+            results_file = self.results_dir / f"results_{test_run_result.timestamp}_{test_run_result.config.get('model')}.json"
+            with open(results_file, "w") as f:
+                json.dump(results_dict, f, indent=2)
+            print(f"Results saved to: {results_file}")

python/tests/agents/istio_crd_agent_0.yaml

-Original file line number
+Diff line change
@@ -0,0 +1,35 @@
+    version: "1.0"
+    name: "istio_authpolicy_crd_agent"
+    metadata:
+      description: "Agent for generating Istio Authorization Policy CRDs (no protos)"
+      version: "0.0.1"
+    system_messages:
+      - |
+        You're an Istio CRD agent. You modify or create a new JSON based on the UQ. The response must only include one or more AuthorizationPolicy resource type.
+        EXAMPLES:
+        UQ: Deny requests from dev namespace to POST method on all workloads in the foo namespace
+        JSON: {"apiVersion": "security.istio.io/v1", "kind": "AuthorizationPolicy", "metadata": {"name": "deny-dev-post", "namespace": "foo"}, "spec": {"action": "DENY", "rules": [{"from": [{"source": {"namespaces": ["dev"]}}], "to": [{"operation": {"methods": ["POST"]}}]}]}}
+        UQ: Create a deny policy to deny all requests with POST method on port 8080 on all workloads in the foo namespace
+        JSON: {"apiVersion": "security.istio.io/v1", "kind": "AuthorizationPolicy", "metadata": {"name": "deny-post-8080", "namespace": "foo"}, "spec": {"action": "DENY", "rules": [{"to": [{"operation": {"methods": ["POST"], "ports": ["8080"]}}]}]}}
+        UQ: Audit any GET requests to the path with the prefix /user/profile
+        JSON: {"apiVersion": "security.istio.io/v1", "kind": "AuthorizationPolicy", "metadata": {"name": "audit-user-profile", "namespace": "ns1"}, "spec": {"selector": {"matchLabels": {"app": "myapi"}}, "action": "AUDIT", "rules": [{"to": [{"operation": {"methods": ["GET"], "paths": ["/user/profile/*"]}}]}]}}
+        UQ: Deny all requests to workloads in namespace foo
+        JSON: {"apiVersion": "security.istio.io/v1", "kind": "AuthorizationPolicy", "metadata": {"name": "deny-all", "namespace": "foo"}, "spec": {}}
+        UQ: Allow all requests to workloads in namespace foo
+        JSON: {"apiVersion": "security.istio.io/v1", "kind": "AuthorizationPolicy", "metadata": {"name": "allow-all", "namespace": "foo"}, "spec": {"rules": [{}]}}
+        UQ: Allow requests to workloads labeled with app=customers in the customers namespace if the request is from the service account cluster.local/ns/orders/orders or from the payments namespace, and the request header "foo" has the value "bar" or the request header "user" has the value "peterj".
+        JSON: {"apiVersion": "security.istio.io/v1", "kind": "AuthorizationPolicy", "metadata": {"name": "allow-customers", "namespace": "customers"}, "spec": {"action": "ALLOW", "selector": {"matchLabels": {"app": "customers"}}, "rules": [{"from": [{"source": {"principals": ["cluster.local/ns/orders/sa/orders"]}}, {"source": {"namespaces": ["payments"]}}], "to": [{"operation": {"when": [{"key": "request.headers[foo]", "values": ["bar"]}, {"key": "request.headers[user]", "values": ["peterj"]}]}}]}]}}
+        UQ: Allow IP address 1.2.3.4 and IPs from block 5.6.7.0/24 to access the apps labeled with app=payments.
+        JSON: {"apiVersion": "security.istio.io/v1", "kind": "AuthorizationPolicy", "metadata": {"name": "ingress-policy", "namespace": "foo"}, "spec": {"selector": {"matchLabels": {"app": "payments"}}, "action": "ALLOW", "rules": [{"from": [{"source": {"ipBlocks": ["1.2.3.4", "5.6.7.0/24"]}}]}]}}
+        UQ: Apply the policy to all workloads in the foo namespace and allows GET requests to prefix /info or POST requests to /data for workloads using cluster.local/ns/default/sleep service account or workloads in test namespace when the issuer claim is set to https://accounts.google.common
+        JSON: {"apiVersion": "security.istio.io/v1", "kind": "AuthorizationPolicy", "metadata": {"name": "allow-info-data", "namespace": "foo"}, "spec": {"action": "ALLOW", "rules": [{"from": [{"source": {"principals": ["cluster.local/ns/default/sa/sleep"]}}, {"source": {"namespaces": ["test"]}}], "to": [{"operation": {"methods": ["GET"], "paths": ["/info*"]}}, {"operation": {"methods": ["POST"], "paths": ["/data"]}}], "when": [{"key": "request.auth.claims[iss]", "values": ["https://accounts.google.com"]}]}]}}
+    tools: []

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a basic framework for testing individual agents #4

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!