Evaluation with LLM Judge
Hive provides a robust evaluation framework that leverages LLM judges to verify agent outputs, detect hallucinations, and ensure execution quality. By utilizing the 3-level runtime logging system, the framework provides judges with the granular context necessary to evaluate not just the final result, but the intermediate reasoning steps and tool usage.
Overview
The evaluation system in Hive operates on the principle of Automated Verification with Human-in-the-Loop (HITL) Oversight. When an agent completes a task, the framework can generate test cases and evaluation requests. These requests compare the actual output against the original goal and the intermediate tool logs to determine if the agent's behavior was correct, efficient, and grounded in the provided data.
The 3-Level Evaluation Context
The LLM Judge uses data from the runtime_log_schemas to perform deep analysis. Evaluations can be performed at three distinct levels of granularity:
- Summary Evaluation (Level 1): Focuses on the
RunSummaryLog. The judge checks if the overall goal was met and assessesexecution_quality. - Node-Level Evaluation (Level 2): Uses
NodeDetailto verify if specific nodes (e.g., a "Researcher" node or "Planner" node) succeeded and identifiesattention_reasonsfor failures. - Step-Level Evaluation (Level 3): Analyzes
NodeStepLogandToolCallLog. This is critical for hallucination detection, as the judge compares the tool's rawresultwith the LLM'sllm_textdescription of that result.
Automated Test Generation & Approval
Hive facilitates a "Self-Improving" loop where failure data is captured and turned into test cases. To ensure safety, LLM-generated tests and evaluation criteria require user approval before being merged into the permanent test suite.
The Approval Workflow
The framework uses the ApprovalRequest model to handle the transition from LLM-generated evaluation to human-verified ground truth.
from framework.testing.approval_types import ApprovalAction, ApprovalRequest
# Example: A request generated after an agent run
request = ApprovalRequest(
test_id="test_research_accuracy_001",
action=ApprovalAction.APPROVE, # Default suggestion
approved_by="senior_dev"
)
Users can interact with these evaluations via the CLI or programmatic interfaces using the following actions:
APPROVE: Accept the evaluation/test as-is.MODIFY: Providemodified_codeor updated criteria.REJECT: Decline the evaluation (requires areason).SKIP: Defer the decision.
Programmatic Evaluation Result Handling
When running batch evaluations (e.g., in a CI/CD pipeline or during agent "evolution"), Hive returns a BatchApprovalResult. This provides a summary of the agent's performance across multiple goals.
Data Models
ApprovalResult
The result of an individual evaluation check.
| Field | Type | Description |
| :--- | :--- | :--- |
| test_id | str | Unique identifier for the evaluated test. |
| action | ApprovalAction | The decision made (approve/reject/etc). |
| success | bool | Whether the evaluation process itself completed. |
| message | str \| None | Feedback from the judge or human reviewer. |
| timestamp | datetime | When the evaluation occurred. |
BatchApprovalResult
A summary of a multi-test evaluation run.
class BatchApprovalResult(BaseModel):
goal_id: str
total: int
approved: int
modified: int
rejected: int
skipped: int
errors: int
results: list[ApprovalResult]
Running Evaluations via CLI
You can review pending evaluations and generated tests using the Hive interactive CLI. This interface displays the confidence scores from the LLM judge and allows for real-time code modification.
# Review pending evaluations for a specific agent goal
python -m framework.testing.approval_cli --goal-id "deep_research_v1"
The CLI will display:
- Criteria: What the LLM Judge was looking for.
- Confidence: The judge's certainty score (0.0 - 1.0).
- Test Code: The logic used to verify the agent's output.
Detecting Hallucinations
To detect hallucinations, Hive compares the ToolCallLog.result (the truth) with the NodeStepLog.llm_text (the agent's claim).
Evaluation Logic:
- Extract: The judge extracts facts from the
llm_text. - Cross-Reference: It checks if those facts exist in the
tool_callsresults within the same trace. - Flag: If a fact is asserted that is not present in the tool output, the
NodeDetail.needs_attentionflag is set toTruewith anattention_reasonof"potential_hallucination".