| # Test 3: Live Evaluation with Agent LLM Inspection |
|
|
| ## Run Command |
| ```bash |
| python evaluation/run_evaluation_sprint.py --questions 5 --output results.json |
| ``` |
|
|
| ## What to Look For |
|
|
| ### Phase 1: Orchestrator Load (should see in first 60 seconds) |
| ``` |
| [1/4] Loading ForgeEngine with Phase 6... |
| β ForgeEngine loaded |
| β Agents have orchestrator: True |
| β Available adapters: ['newton', 'davinci', 'empathy', ...] |
| ``` |
|
|
| **CRITICAL:** If you see "False" or "Using template-based agents" β orchestrator failed to load |
|
|
| ### Phase 2: Agent Setup Inspection |
| ``` |
| [AGENT SETUP INSPECTION] |
| Orchestrator available: True |
| Available adapters: [...] |
| |
| Agent LLM modes: |
| Newton β LLM (orch=True, adapter=newton) |
| Quantum β LLM (orch=True, adapter=quantum) |
| DaVinci β LLM (orch=True, adapter=davinci) |
| Philosophy β LLM (orch=True, adapter=philosophy) |
| Empathy β LLM (orch=True, adapter=empathy) |
| Ethics β LLM (orch=True, adapter=philosophy) |
| ``` |
|
|
| **CRITICAL**: If any show "β TEMPLATE" β agent didn't get orchestrator |
|
|
| ### Phase 3: First Question Synthesis Sample |
| ``` |
| [1/5] What is the speed of light in vacuum?... |
| [Phase 1-5] 2340 chars, correctness=0.50 |
| Sample: "The speed of light is a fundamental constant... |
| [Phase 6 Full] 2150 chars, correctness=0.65 |
| Sample: "Light propagates through vacuum at precisely... |
| [Phase 6 -PreFlight] 2100 chars, correctness=0.62 |
| Sample: "The speed of light, denoted by the symbol c... |
| ``` |
|
|
| **What it means**: |
| - If Phase 6 Full/No-PreFlight have **longer** synthesis than Phase 1-5 β agents doing more reasoning β
|
| - If Phase 1-5 has **longer** synthesis β something's wrong β |
| - If synthesis reads generic ("analyzing through lens") β likely templates β |
| - If synthesis is specific ("speed of light is 299,792,458 m/s") β likely real LLM β
|
|
|
| ### Phase 4: Final Scores |
| Look for this pattern: |
| ``` |
| π EVALUATION SUMMARY |
| Condition | Correctness | Depth | Synthesis Len |
| ββββββββββββββββββββΌββββββββββββββΌββββββββΌββββββββββββββ |
| Baseline (Llama): | 0.50 | 1 | 500 |
| Phase 1-5: | 0.48 | 5 | 2100 |
| Phase 6 Full: | 0.60 | 5 | 2200 |
| Phase 6 -PreFlight:| 0.58 | 5 | 2150 |
| ``` |
|
|
| **Verdict**: |
| - Phase 6 > Phase 1-5 and Phase 1-5 > Baseline β System improving β
|
| - If Phase 6 < Phase 1-5 β Something wrong with Phase 6 patches β |
| - If Phase 6 Full β Phase 1-5 β Semantics/preflight not helping much (acceptable) |
|
|
| ## Critical Checkpoints |
|
|
| | Checkpoint | Success | Failure | Action | |
| |-----------|---------|---------|--------| |
| | Orchestrator loads | Logs say "ready" | Logs say "error" | Check if base GGUF path exists | |
| | All agents show βLLM | All 6 agents marked β | Any marked β | Investigate which agent failed | |
| | Synthesis length increases | Phase6 > Phase1-5 | Phase1-5 > Phase6 | Check if agents using LLM | |
| | Correctness improves | Phase6 > Phase1-5 | Phase1-5 β₯ Phase6 | Adapters may be weak | |
| | Synthesis is specific | Mentions concrete details | Generic template text | Agents fell back to templates | |
|
|
| ## Expected Timeline |
|
|
| - **Orchestrator load**: ~60 seconds (one-time, then fast) |
| - **First question (debate)**: ~30-45 seconds |
| - **5 questions total**: ~3-5 minutes |
| - **Final report**: <1 second |
|
|
| ## If Something Goes Wrong |
|
|
| 1. **Orchestrator fails to load** |
| - Check: `ls J:\codette-training-lab\bartowski\Meta-Llama-3.1-8B-Instruct-GGUF\*.gguf` |
| - Check: `ls J:\codette-training-lab\adapters\*.gguf` |
|
|
| 2. **Agents show β TEMPLATE** |
| - Check logs for "CodetteOrchestrator not available:" |
| - Check Python path includes inference directory |
|
|
| 3. **Synthesis is still template-like** |
| - Check sample text doesn't contain "{concept}" |
| - Check if error logs show "falling back to templates" |
|
|
| 4. **Correctness doesn't improve** |
| - Adapters may be undertrained |
| - System prompts may need refinement |
| - Debate mechanism itself may be limiting factor |
|
|
| ## Success Criteria β
|
|
|
| All of these should be true: |
| 1. Orchestrator loads successfully |
| 2. All agents show β LLM mode |
| 3. Phase 6 synthesis is longer than Phase 1-5 |
| 4. First question synthesis is specific and domain-aware |
| 5. Correctness improves from Phase 1-5 to Phase 6 |
|
|
| If all 5 are true β **Mission accomplished!** π |
|
|