Evaluation with LLM Judge

Hive provides a robust evaluation framework that leverages LLM judges to verify agent outputs, detect hallucinations, and ensure execution quality. By utilizing the 3-level runtime logging system, the framework provides judges with the granular context necessary to evaluate not just the final result, but the intermediate reasoning steps and tool usage.

Overview

The evaluation system in Hive operates on the principle of Automated Verification with Human-in-the-Loop (HITL) Oversight. When an agent completes a task, the framework can generate test cases and evaluation requests. These requests compare the actual output against the original goal and the intermediate tool logs to determine if the agent's behavior was correct, efficient, and grounded in the provided data.

The 3-Level Evaluation Context

The LLM Judge uses data from the runtime_log_schemas to perform deep analysis. Evaluations can be performed at three distinct levels of granularity:

Summary Evaluation (Level 1): Focuses on the RunSummaryLog. The judge checks if the overall goal was met and assesses execution_quality.
Node-Level Evaluation (Level 2): Uses NodeDetail to verify if specific nodes (e.g., a "Researcher" node or "Planner" node) succeeded and identifies attention_reasons for failures.
Step-Level Evaluation (Level 3): Analyzes NodeStepLog and ToolCallLog. This is critical for hallucination detection, as the judge compares the tool's raw result with the LLM's llm_text description of that result.

Automated Test Generation & Approval

Hive facilitates a "Self-Improving" loop where failure data is captured and turned into test cases. To ensure safety, LLM-generated tests and evaluation criteria require user approval before being merged into the permanent test suite.

The Approval Workflow

The framework uses the ApprovalRequest model to handle the transition from LLM-generated evaluation to human-verified ground truth.

from framework.testing.approval_types import ApprovalAction, ApprovalRequest

# Example: A request generated after an agent run
request = ApprovalRequest(
    test_id="test_research_accuracy_001",
    action=ApprovalAction.APPROVE, # Default suggestion
    approved_by="senior_dev"
)

Users can interact with these evaluations via the CLI or programmatic interfaces using the following actions:

APPROVE: Accept the evaluation/test as-is.
MODIFY: Provide modified_code or updated criteria.
REJECT: Decline the evaluation (requires a reason).
SKIP: Defer the decision.

Programmatic Evaluation Result Handling

When running batch evaluations (e.g., in a CI/CD pipeline or during agent "evolution"), Hive returns a BatchApprovalResult. This provides a summary of the agent's performance across multiple goals.

Data Models

ApprovalResult

The result of an individual evaluation check.

BatchApprovalResult

A summary of a multi-test evaluation run.

class BatchApprovalResult(BaseModel):
    goal_id: str
    total: int
    approved: int
    modified: int
    rejected: int
    skipped: int
    errors: int
    results: list[ApprovalResult]

Running Evaluations via CLI

You can review pending evaluations and generated tests using the Hive interactive CLI. This interface displays the confidence scores from the LLM judge and allows for real-time code modification.

# Review pending evaluations for a specific agent goal
python -m framework.testing.approval_cli --goal-id "deep_research_v1"

The CLI will display:

Criteria: What the LLM Judge was looking for.
Confidence: The judge's certainty score (0.0 - 1.0).
Test Code: The logic used to verify the agent's output.

Detecting Hallucinations

To detect hallucinations, Hive compares the ToolCallLog.result (the truth) with the NodeStepLog.llm_text (the agent's claim).

Evaluation Logic:

Extract: The judge extracts facts from the llm_text.
Cross-Reference: It checks if those facts exist in the tool_calls results within the same trace.
Flag: If a fact is asserted that is not present in the tool output, the NodeDetail.needs_attention flag is set to True with an attention_reason of "potential_hallucination".