Skip to content

Problems testing evaluation.py #53

@sockduct

Description

@sockduct

Hello,

I'm trying to replicate your results with SEAL-0.

Request: Any tips to get this to run faster and to eliminate the infinite loop and node ... permanently blocked from aggregation errors?

Setup:

  • Install on Ubuntu 24
  • Native setup
  • Add OpenRouter API Key
  • Start backend and frontend
  • Test run of evaluation.py: python evaluation.py --num-examples 3

Issues:

  • Takes a long time to run - around an hour - seems like a long time for only 3 questions (from the set of 130 seal-0.csv questions)
  • In backend output get lots of warnings/errors:
  • INFINITE LOOP DETECTED: Node root stuck in PLAN_DONE for 1902.3s
  • Node root permanently blocked from aggregation - considering forced completion
  • Even when evaluation.py finishes, the backend keeps repeating:
    TaskScheduler: No nodes in READY or AGGREGATING status
    🔍 AGGREGATION DEBUG - Node root sub_graph_id: subgraph_root, found 3 children
    ⏳ AGGREGATION BLOCKED - Node root cannot AGGREGATE: 2/3 children incomplete: root.2:PLAN_DONE, root.3:PENDING
    🔍 AGGREGATION DEBUG - Node root.2 sub_graph_id: subgraph_root.2, found 4 children
    ⏳ AGGREGATION BLOCKED - Node root.2 cannot AGGREGATE: 2/4 children incomplete: root.2.3:RUNNING, root.2.4:PENDING
    🚨 INFINITE LOOP DETECTED: Node root stuck in PLAN_DONE for 2162.9s
    🔍 AGGREGATION DEBUG - Node root sub_graph_id: subgraph_root, found 3 children
    ⏳ AGGREGATION BLOCKED - Node root cannot AGGREGATE: 2/3 children incomplete: root.2:PLAN_DONE, root.3:PENDING
    🚨 Node root permanently blocked from aggregation - considering forced completion
    🚨 INFINITE LOOP DETECTED: Node root.2 stuck in PLAN_DONE for 1742.0s
    🔍 AGGREGATION DEBUG - Node root.2 sub_graph_id: subgraph_root.2, found 4 children
    ⏳ AGGREGATION BLOCKED - Node root.2 cannot AGGREGATE: 2/4 children incomplete: root.2.3:RUNNING, root.2.4:PENDING
    (...)

Output from running evaluation:
$ python evaluation.py --num-examples 3
🔍 Testing connection to server at http://localhost:5000...
✅ Connected to server - Profile: None
📋 Processing 3 out of 3 total queries
🏁 Starting 2 worker processes for 3 queries...
Processing queries: 0%| | 0/3 [00:00<?, ?it/s]🤖 [EvalWorker-2] Worker initialized, connected to http://localhost:5000
🚀 [EvalWorker-2] Creating project for query #1: 'Who holds the all-time record at the Grammys for the most wins in the album of the year category?'
⏱️ [EvalWorker-2] Waiting 5.0s before request (rate limiting)
🤖 [EvalWorker-1] Worker initialized, connected to http://localhost:5000
🚀 [EvalWorker-1] Creating project for query #2: 'How many NBA players have scored 60 or more points in a regular season game since 2023?'
⏱️ [EvalWorker-1] Waiting 5.0s before request (rate limiting)
✅ [EvalWorker-1] Created project c427a462-63f7-48af-ba98-18ea1c92c5e9 for query #2
📊 Project c427a462-63f7-48af-ba98-18ea1c92c5e9 status: active
✅ [EvalWorker-2] Created project ff2c9540-f231-4419-9105-d902dbfb7c29 for query #1
📊 Project ff2c9540-f231-4419-9105-d902dbfb7c29 status: active
📊 Project c427a462-63f7-48af-ba98-18ea1c92c5e9 status: running
📊 Project ff2c9540-f231-4419-9105-d902dbfb7c29 status: running
⏱️ Timeout waiting for results. Checking worker status...
⏱️ Timeout waiting for results. Checking worker status...
⏱️ Timeout waiting for results. Checking worker status...
📊 Project c427a462-63f7-48af-ba98-18ea1c92c5e9 status: completed
✅ [EvalWorker-1] Completed query #2 in 1082.46s
🚀 [EvalWorker-1] Creating project for query #3: 'What is the most recent film to join the top 10 highest-grossing films of all time?'
⏱️ [EvalWorker-1] Waiting 5.0s before request (rate limiting)
Processing queries: 33%|██████████████▎ | 1/3 [18:03<36:06, 1083.00s/it]📊 Progress: 1/3 - Latest project: c427a462-63f7-48af-ba98-18ea1c92c5e9
✅ [EvalWorker-1] Created project 96491b5f-6cbe-4cbd-aeea-3ee03274184c for query #3
📊 Project 96491b5f-6cbe-4cbd-aeea-3ee03274184c status: active
📊 Project 96491b5f-6cbe-4cbd-aeea-3ee03274184c status: running
⏱️ Timeout waiting for results. Checking worker status...
⏱️ Timeout waiting for results. Checking worker status...
⏱️ Project ff2c9540-f231-4419-9105-d902dbfb7c29 timed out after 1800 seconds
✅ [EvalWorker-2] Completed query #1 in 1805.53s
🏁 [EvalWorker-2] Worker finished
Processing queries: 67%|█████████████████████████████▎ | 2/3 [30:06<14:31, 871.28s/it]📊 Progress: 2/3 - Latest project: ff2c9540-f231-4419-9105-d902dbfb7c29
⏱️ Timeout waiting for results. Checking worker status...
⏱️ Timeout waiting for results. Checking worker status...
⏱️ Timeout waiting for results. Checking worker status...
⏱️ Project 96491b5f-6cbe-4cbd-aeea-3ee03274184c timed out after 1800 seconds
✅ [EvalWorker-1] Completed query #3 in 1805.55s
Processing queries: 100%|████████████████████████████████████████████| 3/3 [48:08<00:00, 967.72s/it]🏁 [EvalWorker-1] Worker finished
📊 Progress: 3/3 - Latest project: 96491b5f-6cbe-4cbd-aeea-3ee03274184c
Processing queries: 100%|████████████████████████████████████████████| 3/3 [48:08<00:00, 962.85s/it]

✅ All evaluations completed
💾 Results saved to server_eval_results.csv

📊 Evaluation Summary:
Total queries: 3
Successful: 3
Failed: 0
Average execution time: 1564.51 seconds

🌐 Projects created (viewable in frontend):

  • ff2c9540-f231-4419-9105-d902dbfb7c29: Who holds the all-time record at the Grammys for t...
  • c427a462-63f7-48af-ba98-18ea1c92c5e9: How many NBA players have scored 60 or more points...
  • 96491b5f-6cbe-4cbd-aeea-3ee03274184c: What is the most recent film to join the top 10 hi...

Any suggestions appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions