* Refactor skill turn harness, fix session serialization, and resolve E2E test failures * Ignore symlinks during workspace copying and enforce sandbox boundaries in playbooks * Refactor interaction loop to use clean async generator-based Event flow * Introduce dedicated async generator test and improve autonomous tester instructions * Enforce strict sandbox awareness and Step 8 policy import gates * Track and display conversation context size next to turn headers * Streamline token usage display to only appear in turn step headers * Refactor token usage tracking to show actual active context size * Implement progress tracking block and human recovery in test harness * docs: document and categorize repository skills and tools * docs: add maintenance instructions for updating FACTORIES.md tables * docs: add missing data-catalog-policy-tag factory in FACTORIES.md * docs: add missing networking stage sub-factories in FACTORIES.md * docs: add systematic commands for discovering module/stage factories in FACTORIES.md * docs: add missing vpcs factories in 0-org-setup and 2-project-factory stages
Hybrid Python Test Harness for Antigravity Skills
Overview
This project provides a robust, hybrid test harness for developing and evaluating Antigravity skills. It solves the "inner dev loop" problem by allowing you to test local, unpacked skills directly against the Antigravity engine (via the Python SDK), while using an LLM to deterministically grade the agent's behavior.
Table of Contents
- Prerequisites
- How to Use
- Testing Local Skills (Inner Dev Loop)
- Writing Playbooks
- Running the Pytest Suite
- Writing Playbooks: Three Modes of Testing
The architecture relies on three main components:
- Orchestrator (Python): Drives the execution loop, reads YAML playbooks, and manages isolated workspaces for each test run to prevent session caching issues.
- Execution Target (Antigravity SDK): The skill is executed using the
google-antigravityPython SDK, which manages the localharness engine in-process. This eliminates the dependency on a globally installed CLI. - Evaluator (Gemini API): The semantic evaluation of the agent's output is performed via direct API calls to
gemini-2.5-flashusinggoogle-genai. This bypasses brittle string-parsing and guarantees structured JSON output (Pass/Fail + Reasoning).
Prerequisites
You can run the harness directly using uv (recommended), which will automatically handle downloading and running with the required dependencies:
uv run harness.py playbooks/my-playbook.yaml
Alternatively, ensure you have a Python virtual environment set up with the required dependencies:
pip install google-antigravity google-genai pydantic pyyaml click pytest
You also need your Gemini API key available in your environment, or stored in ~/.gemini/key.env:
export GEMINI_API_KEY="your_api_key_here"
How to Use
The main entry point is the harness.py CLI tool.
Note
All commands in this guide assume you are running from the
tools/skill-turn-harnessdirectory. If running from the repository root, prefix paths accordingly (e.g.,python3 tools/skill-turn-harness/harness.py ...).
Basic Usage
To run a test, provide a YAML playbook:
python3 harness.py playbooks/my-playbook.yaml
Command Line Options
playbook(Required): The path to the YAML playbook defining the test steps.--log-dir <path>(Optional): The directory where the harness will write detailed Markdown logs, session JSONs, and JSON failure dumps. Defaults to./logs.--skill-src <path>(Optional): The path to a local, unpacked skill directory. See the "Testing Local Skills" section below for details.--env-file <path>(Optional): The path to a standard.envfile containing key-value pairs (e.g.MY_SECRET=123). This is used for secure string substitution within your playbook steps.--keep-workspace(Optional): Preserve the temporary workspace directory (/tmp/gemini_harness_*) after execution to inspect files generated by the agent.--agent-model <model>(Optional): Override the model the agent uses (e.g.,gemini-2.5-pro). Overrides playbook definition.--evaluator-model <model>(Optional): Override the model the test harness uses to grade and simulate (e.g.,gemini-2.5-flash). Overrides playbook definition.--max-deviations <number>(Optional): Set the maximum number of minor deviations/mistakes (such as rule violations or incorrect tool calls) the agent can make during autonomous/hybrid mode before the harness fails the test run. Defaults to3.--debug(Optional): Enable verbose debug logging for the SDK (e.g., WebSocket traffic).
⚠️ Security Warning regarding Logs:
If your playbooks require secrets (like API keys or passwords) via the env array, the harness will substitute them before executing the CLI. Because the harness traces all inputs and outputs for debugging, these substituted secrets will be written in plain text to your logs/ directory.
A default .gitignore is provided in the logs/ directory to prevent committing these files, but care should still be taken to avoid leaking secrets into your repository.
Expected Output
The harness executes the playbook, rendering thoughts and tool calls in real-time, and streams the results to the console with active context usage stats:
--- Tuning: FAST Setup PoC | Workspace: /tmp/gemini_harness_abc123 ---
[Step 1]
Tester:
Hi, please activate the fast-setup-poc skill and let's configure FAST.
🧠 Thinking:
Let's activate the fast-setup-poc skill and check the requirements.
🛠️ [Tool Call]: list_directory(path=.)
...
✅ [PASS Step 1]: The agent greeted the user ('Hi'), confirmed it was configuring FAST, and asked for the Project ID. All parts of the objective were fulfilled.
[Step 2] [Context: 4,512]
Tester:
my-super-project-123
...
✅ [SUCCESS] Playbook 'FAST Setup PoC' completed successfully.
📄 Session JSON saved to: logs/FAST_Setup_PoC_session.json
📄 Markdown log saved to: logs/FAST_Setup_PoC_log.md
If a step fails, the harness halts immediately and dumps the full interaction trace to a JSON file (e.g., logs/FAST_Setup_PoC_failed.json) for debugging.
Testing Local Skills (Inner Dev Loop)
When developing a complex skill (with multiple markdown files, prompt templates, or tools), you don't want to package and globally install it just to run a test.
The harness supports testing local skills directly using the --skill-src flag:
python3 harness.py playbooks/my-playbook.yaml --skill-src ./my-local-skill/
How it works under the hood:
The harness passes the skill path to the SDK's LocalAgentConfig(skills_paths=[...]). The Antigravity engine loads the skill dynamically for the duration of the session. Unlike the old CLI-based linking, this is completely isolated and does not modify your global environment.
Workspace Management & Copying Behavior
By default, the test harness executes the agent in an isolated temporary workspace (e.g., /tmp/gemini_harness_<hash>) to prevent session caching and protect your repository from accidental file modifications.
Copying vs. Symlinking
To ensure both safety and compatibility with local search tools (such as the agent's built-in search_directory / grep tool, which can crash when traversing directory symlinks), the harness copies the configured playbook directories recursively instead of symlinking them.
To keep the workspace lightweight and prevent the agent from "cheating" by reading the test definitions, the copy operation implements strict exclusion rules:
- Excluded Dependencies/Cache:
.terraform,.git,.venv,venv,__pycache__,.pytest_cacheare skipped. This reduces the copied size of directories likefast/from 1.4GB to a few megabytes, making workspace setup near-instant. - Excluded Harness: The
skill-turn-harnessdirectory itself is strictly excluded from the copy. This prevents the agent under test from walking the workspace, reading the playbook YAML definitions, and "cheating" by peeking at the expected inputs/outcomes.
Linking the skills Directory
If your autonomous playbook instructs the agent to "activate" or "inspect" the skill, the model may attempt to search the workspace for the skill's source files (like SKILL.md). For these playbooks, ensure you add skills to the playbook's link_paths so the agent can resolve the lookup locally:
tmpdir:
link_paths:
- fast
- modules
- tools
- skills # <-- Make sure to include this so the agent can find skill files
Isolated Chat History
The harness configures the SDK to write raw session state directly to the configured --log-dir (under log_dir/chats/). This ensures that test execution conversations remain isolated and do not pollute your global Antigravity/Jetski desktop history.
Writing Playbooks
Playbooks are written in YAML. For autocompletion and validation in VS Code, add the schema annotation to the top of your playbook.
If your playbook requires environment variables (e.g., secrets), declare them in the env array. You can then reference them in your steps using ${VAR_NAME}. If a variable is declared but not found in the environment (or passed via --env-file), the harness will safely halt before execution.
# yaml-language-server: $schema=../playbooks/playbook.schema.json
name: "My Test Playbook"
timeout: 120
agent_model: "gemini-2.5-pro"
evaluator_model: "gemini-2.5-flash"
env:
- MY_API_KEY
steps:
- user_input: "Hi, activate my-skill and use this key: ${MY_API_KEY}"
expected_outcome: "The agent should greet the user and acknowledge the key."
Running the Pytest Suite
This repository includes a pytest suite in the test/ directory to test the harness itself.
To run the fast unit tests (which mock the CLI execution):
python3 -m pytest test/test_harness.py -m "not e2e" -v
To run the full End-to-End (E2E) test (which dynamically links the fixture skill and hits the real Gemini API):
python3 -m pytest test/test_harness.py -m "e2e" -v
Writing Playbooks: Three Modes of Testing
The harness supports three modes of execution depending on how you structure your YAML playbook. The mode is inferred automatically based on the presence of the steps and/or persona keys.
1. Scripted Mode (Unit / Regression)
Best for: Ensuring the exact, rigid state machine of a skill hasn't broken.
You define a strict, sequential list of steps. The harness feeds the user_input and checks if the agent's response satisfies the expected_outcome via an LLM evaluation.
name: "FAST Setup PoC - Scripted"
steps:
- user_input: "Hi, let's configure FAST."
expected_outcome: "The agent should greet the user and ask for the Project ID."
- user_input: "my-super-project-123"
expected_outcome: "The agent should acknowledge the Project ID and ask for the preferred Region."
2. Autonomous "Pond" Mode (E2E / Fuzz Testing)
Best for: Testing how the skill handles the messy reality of natural language and conversational drift.
Instead of providing a rigid script, you define a declarative Persona with a "Pond" of knowledge and explicit success_criteria. A secondary LLM agent acts as the simulated user, dynamically reading the CLI's outputs, fishing data from the "pond," and generating the next input until the success criteria are met or the max_turns limit is reached.
name: "FAST Setup PoC - Autonomous"
persona:
initial_user_input: "Hi, let's configure FAST."
context: >
You are a GCP developer. Your Project ID is my-project-123 and region is europe-west1.
Do not volunteer information until the agent explicitly asks for it.
max_turns: 10
success_criteria:
llm_checks:
- "The agent provided a final configuration summary containing the correct project_id and region."
tool_calls_contain:
run_shell_command:
- "gcloud organizations add-iam-policy-binding"
files_exist:
- "0-org-setup.auto.tfvars"
3. Hybrid Fallback Mode
Best for: Testing happy-path compliance while ensuring the agent can recover from unexpected deviations.
If a playbook defines both steps and a persona, the harness runs in Hybrid mode. It attempts to execute the rigid steps first. If the skill deviates or fails a step evaluation, instead of failing the test outright, the harness falls back to the autonomous persona. The simulated user takes over the conversation history and attempts to guide the agent back on track to meet the success_criteria. If successful, the test returns a PASS WITH WARNINGS.