hunfabric/tools/skill-turn-harness/README.md

# Hybrid Python Test Harness for Antigravity Skills

## Overview

This project provides a robust, hybrid test harness for developing and evaluating Antigravity skills. It solves the "inner dev loop" problem by allowing you to test local, unpacked skills directly against the Antigravity engine (via the Python SDK), while using an LLM to deterministically grade the agent's behavior.

## Table of Contents

- [Prerequisites](#prerequisites)
- [How to Use](#how-to-use)
  - [Basic Usage](#basic-usage)
  - [Command Line Options](#command-line-options)
  - [Expected Output](#expected-output)
- [Testing Local Skills (Inner Dev Loop)](#testing-local-skills-inner-dev-loop)
- [Writing Playbooks](#writing-playbooks)
- [Running the Pytest Suite](#running-the-pytest-suite)
- [Writing Playbooks: Three Modes of Testing](#writing-playbooks-three-modes-of-testing)

The architecture relies on three main components:

- **Orchestrator (Python):** Drives the execution loop, reads YAML playbooks, and manages isolated workspaces for each test run to prevent session caching issues.
- **Execution Target (Antigravity SDK):** The skill is executed using the `google-antigravity` Python SDK, which manages the localharness engine in-process. This eliminates the dependency on a globally installed CLI.
- **Evaluator (Gemini API):** The semantic evaluation of the agent's output is performed via direct API calls to `gemini-2.5-flash` using `google-genai`. This bypasses brittle string-parsing and guarantees structured JSON output (Pass/Fail + Reasoning).

## Prerequisites

You can run the harness directly using `uv` (recommended), which will automatically handle downloading and running with the required dependencies:

```bash
uv run harness.py playbooks/my-playbook.yaml
```

Alternatively, ensure you have a Python virtual environment set up with the required dependencies:

```bash
pip install google-antigravity google-genai pydantic pyyaml click pytest
```

You also need your Gemini API key available in your environment, or stored in `~/.gemini/key.env`:

```bash
export GEMINI_API_KEY="your_api_key_here"
```

## How to Use

The main entry point is the `harness.py` CLI tool.

> [!NOTE]
> All commands in this guide assume you are running from the `tools/skill-turn-harness` directory. If running from the repository root, prefix paths accordingly (e.g., `python3 tools/skill-turn-harness/harness.py ...`).

### Basic Usage

To run a test, provide a YAML playbook:

```bash
python3 harness.py playbooks/my-playbook.yaml
```

### Command Line Options

- `playbook` (Required): The path to the YAML playbook defining the test steps.
- `--log-dir <path>` (Optional): The directory where the harness will write detailed Markdown logs, session JSONs, and JSON failure dumps. Defaults to `./logs`.
- `--skill-src <path>` (Optional): The path to a local, unpacked skill directory. See the "Testing Local Skills" section below for details.
- `--env-file <path>` (Optional): The path to a standard `.env` file containing key-value pairs (e.g. `MY_SECRET=123`). This is used for secure string substitution within your playbook steps.
- `--keep-workspace` (Optional): Preserve the temporary workspace directory (`/tmp/gemini_harness_*`) after execution to inspect files generated by the agent.
- `--agent-model <model>` (Optional): Override the model the agent uses (e.g., `gemini-2.5-pro`). Overrides playbook definition.
- `--evaluator-model <model>` (Optional): Override the model the test harness uses to grade and simulate (e.g., `gemini-2.5-flash`). Overrides playbook definition.
- `--max-deviations <number>` (Optional): Set the maximum number of minor deviations/mistakes (such as rule violations or incorrect tool calls) the agent can make during autonomous/hybrid mode before the harness fails the test run. Defaults to `3`.
- `--debug` (Optional): Enable verbose debug logging for the SDK (e.g., WebSocket traffic).

⚠️ **Security Warning regarding Logs:**
If your playbooks require secrets (like API keys or passwords) via the `env` array, the harness will substitute them before executing the CLI. Because the harness traces all inputs and outputs for debugging, **these substituted secrets will be written in plain text** to your `logs/` directory.
A default `.gitignore` is provided in the `logs/` directory to prevent committing these files, but care should still be taken to avoid leaking secrets into your repository.

### Expected Output

The harness executes the playbook, rendering thoughts and tool calls in real-time, and streams the results to the console with active context usage stats:

```text
--- Tuning: FAST Setup PoC | Workspace: /tmp/gemini_harness_abc123 ---

[Step 1]
Tester:
Hi, please activate the fast-setup-poc skill and let's configure FAST.
  🧠 Thinking:
  Let's activate the fast-setup-poc skill and check the requirements.
  🛠️ [Tool Call]: list_directory(path=.)
  ...
✅ [PASS Step 1]: The agent greeted the user ('Hi'), confirmed it was configuring FAST, and asked for the Project ID. All parts of the objective were fulfilled.

[Step 2] [Context: 4,512]
Tester:
my-super-project-123
...

✅ [SUCCESS] Playbook 'FAST Setup PoC' completed successfully.
📄 Session JSON saved to: logs/FAST_Setup_PoC_session.json
📄 Markdown log saved to: logs/FAST_Setup_PoC_log.md
```

If a step fails, the harness halts immediately and dumps the full interaction trace to a JSON file (e.g., `logs/FAST_Setup_PoC_failed.json`) for debugging.

## Testing Local Skills (Inner Dev Loop)

When developing a complex skill (with multiple markdown files, prompt templates, or tools), you don't want to package and globally install it just to run a test.

The harness supports testing local skills directly using the `--skill-src` flag:

```bash
python3 harness.py playbooks/my-playbook.yaml --skill-src ./my-local-skill/
```

**How it works under the hood:**

The harness passes the skill path to the SDK's `LocalAgentConfig(skills_paths=[...])`. The Antigravity engine loads the skill dynamically for the duration of the session. Unlike the old CLI-based linking, this is completely isolated and does not modify your global environment.

## Workspace Management & Copying Behavior

By default, the test harness executes the agent in an isolated temporary workspace (e.g., `/tmp/gemini_harness_<hash>`) to prevent session caching and protect your repository from accidental file modifications.

### Copying vs. Symlinking
To ensure both safety and compatibility with local search tools (such as the agent's built-in `search_directory` / grep tool, which can crash when traversing directory symlinks), **the harness copies the configured playbook directories recursively instead of symlinking them**.

To keep the workspace lightweight and prevent the agent from "cheating" by reading the test definitions, the copy operation implements strict exclusion rules:
- **Excluded Dependencies/Cache**: `.terraform`, `.git`, `.venv`, `venv`, `__pycache__`, `.pytest_cache` are skipped. This reduces the copied size of directories like `fast/` from 1.4GB to a few megabytes, making workspace setup near-instant.
- **Excluded Harness**: The `skill-turn-harness` directory itself is **strictly excluded** from the copy. This prevents the agent under test from walking the workspace, reading the playbook YAML definitions, and "cheating" by peeking at the expected inputs/outcomes.

### Linking the `skills` Directory
If your autonomous playbook instructs the agent to "activate" or "inspect" the skill, the model may attempt to search the workspace for the skill's source files (like `SKILL.md`). For these playbooks, ensure you add `skills` to the playbook's `link_paths` so the agent can resolve the lookup locally:

```yaml
tmpdir:
  link_paths:
    - fast
    - modules
    - tools
    - skills # <-- Make sure to include this so the agent can find skill files
```

### Isolated Chat History
The harness configures the SDK to write raw session state directly to the configured `--log-dir` (under `log_dir/chats/`). This ensures that test execution conversations remain isolated and do not pollute your global Antigravity/Jetski desktop history.

## Writing Playbooks

Playbooks are written in YAML. For autocompletion and validation in VS Code, add the schema annotation to the top of your playbook.

If your playbook requires environment variables (e.g., secrets), declare them in the `env` array. You can then reference them in your `steps` using `${VAR_NAME}`. If a variable is declared but not found in the environment (or passed via `--env-file`), the harness will safely halt before execution.

```yaml
# yaml-language-server: $schema=../playbooks/playbook.schema.json
name: "My Test Playbook"
timeout: 120
agent_model: "gemini-2.5-pro"
evaluator_model: "gemini-2.5-flash"
env:
  - MY_API_KEY
steps:
  - user_input: "Hi, activate my-skill and use this key: ${MY_API_KEY}"
    expected_outcome: "The agent should greet the user and acknowledge the key."
```

## Running the Pytest Suite

This repository includes a `pytest` suite in the `test/` directory to test the harness itself.

To run the fast unit tests (which mock the CLI execution):

```bash
python3 -m pytest test/test_harness.py -m "not e2e" -v
```

To run the full End-to-End (E2E) test (which dynamically links the fixture skill and hits the real Gemini API):

```bash
python3 -m pytest test/test_harness.py -m "e2e" -v
```

## Writing Playbooks: Three Modes of Testing

The harness supports three modes of execution depending on how you structure your YAML playbook. The mode is inferred automatically based on the presence of the `steps` and/or `persona` keys.

### 1. Scripted Mode (Unit / Regression)
**Best for:** Ensuring the exact, rigid state machine of a skill hasn't broken.

You define a strict, sequential list of `steps`. The harness feeds the `user_input` and checks if the agent's response satisfies the `expected_outcome` via an LLM evaluation.

```yaml
name: "FAST Setup PoC - Scripted"
steps:
  - user_input: "Hi, let's configure FAST."
    expected_outcome: "The agent should greet the user and ask for the Project ID."
  - user_input: "my-super-project-123"
    expected_outcome: "The agent should acknowledge the Project ID and ask for the preferred Region."
```

### 2. Autonomous "Pond" Mode (E2E / Fuzz Testing)
**Best for:** Testing how the skill handles the messy reality of natural language and conversational drift.

Instead of providing a rigid script, you define a declarative **Persona** with a "Pond" of knowledge and explicit `success_criteria`. A secondary LLM agent acts as the simulated user, dynamically reading the CLI's outputs, fishing data from the "pond," and generating the next input until the success criteria are met or the `max_turns` limit is reached.

```yaml
name: "FAST Setup PoC - Autonomous"
persona:
  initial_user_input: "Hi, let's configure FAST."
  context: >
    You are a GCP developer. Your Project ID is my-project-123 and region is europe-west1.
    Do not volunteer information until the agent explicitly asks for it.
  max_turns: 10
  success_criteria:
    llm_checks:
      - "The agent provided a final configuration summary containing the correct project_id and region."
    tool_calls_contain:
      run_shell_command:
        - "gcloud organizations add-iam-policy-binding"
    files_exist:
      - "0-org-setup.auto.tfvars"
```

### 3. Hybrid Fallback Mode
**Best for:** Testing happy-path compliance while ensuring the agent can recover from unexpected deviations.

If a playbook defines **both** `steps` and a `persona`, the harness runs in Hybrid mode. It attempts to execute the rigid `steps` first. If the skill deviates or fails a step evaluation, instead of failing the test outright, the harness **falls back** to the autonomous persona. The simulated user takes over the conversation history and attempts to guide the agent back on track to meet the `success_criteria`. If successful, the test returns a `PASS WITH WARNINGS`.