Files

Ludovico Magnocavallo 81f72e8068 Add FAST Prerequisites Skill and Gemini Skill Test Harness (#3979 )

* initial version of a FAST pre-install skill

* first round of testing

* Update fast-0-org-setup-prereqs skill with improved UX and local path handling

- Add explicit lockout warning and stop condition if the user is not a member of the provided Admin Principal group.
- Streamline bootstrap project selection to only prompt for an override if the active gcloud project is rejected.
- Restrict dataset discovery strictly to the `fast/stages/0-org-setup/datasets/` directory.
- Improve location handling by referencing `defaults.schema.json` for Standard GCP and auto-configuring fixed regions for GCD.
- Add comprehensive `local_path` management: prompt for customization, create directories, move `defaults.yaml` to the local data folder, and symlink `0-org-setup.auto.tfvars` back to the stage directory.

* add testing scenarios, implement initial changes for scenario 2

* move skills

* move to a skills/fast subfolder

* Refactor fast-0-org-setup prereqs skill

* Add skill-turn-harness utility tool

* Use relative markdown links for skill references

* Use descriptive titles for markdown links in skill references

* Add descriptions to each phase in the prerequisites workflow map

* Use backslash for markdown line breaks in skill map

* Update README security warning to mention default .gitignore

* shebang

* Update fast prereqs skill rules to force sequential question flow and refine harness tool with proper ctrl+c handling and slugified log paths

* Move playbook-gcp-dev.yaml to fast/prerequisites/gcp-dev.yaml and update fast prerequisites

* docs(skill-turn-harness): detail autonomous pond testing approach

* docs(skill-turn-harness): add final_state_checks to pond architecture and update toc

* Refine fast prereqs SKILL and gcp-dev playbook to strictly align with one-question-at-a-time rule

* feat(skill-turn-harness): update playbook schema for autonomous persona mode

* feat(skill-turn-harness): implement autonomous persona testing mode and fallback logic

* docs(skill-turn-harness): document the three modes of testing and update ToC

* implement timeout, schema validation, configurable cli

* chore: remove accidentally committed log files

* chore: ignore logs directory

* feat(skill-harness): implement tool execution interception, configurable workspace, and modularized validation

* feat(skill-harness): add model configuration and update README

* fix(skill-harness): automatically inject -y flag to gemini commands

* docs(skill-harness): add TODO.md with analysis for skill environment dependencies

* feat(skill-harness): add working_dir support and clean up fixtures

- Implement working_dir in harness to run tests in specific directories.
- Rename test fixtures and playbooks to be more descriptive.
- Add E2E test for working_dir.
- Apply code quality improvements to harness.py (imports, linting).
- Update README with working directory considerations and usage notes.
- Update phase3-bootstrap-and-iam.md skill doc to add execution rule against creating temp scripts.

* fix: capture customer_id and respect relative paths

* Implement isolated temp workspace sandboxing with symlinks in test harness

* Configure GCD manual autonomous playbook and align Phase 3/4 steps order

* Fix linting and schema tests failures

- Add missing license headers to tools/skill-turn-harness files.

- Fix trailing spaces and newlines in playbooks.

- Ignore tools directory in schema tests workflow.

TAG=agy

CONV=1bb75453-c3e2-448b-bae9-8e332a068012

* Fix Python formatting with yapf

TAG=agy

CONV=1bb75453-c3e2-448b-bae9-8e332a068012

* Refactor skill-turn-harness to use Antigravity SDK

- Migrated harness from gemini-cli subprocesses to Antigravity SDK.
- Implemented real-time step streaming and console logging.
- Added color-coded terminal output (dark gray headers, blue inputs, pink outputs).
- Collapsed excessive newlines in streamed thoughts.
- Excluded harness codebase from workspace copy to prevent agent cheating.
- Enabled skills folder copy to resolve agent lookup loops.
- Added key validation and CLI --debug flag.

* Fix autonomous turn layout: print Turn ID before execution

- Moved the [Autonomous Turn X] header print to before running the agent turn.
- This groups the real-time thinking and tool calls under the correct Turn ID block, instead of displaying them before the label.

* Remove obsolete .log.md from prerequisites skill directory

2026-05-22 17:16:54 +00:00

12 KiB

Raw Blame History

Hybrid Python Test Harness for Antigravity Skills

Overview

This project provides a robust, hybrid test harness for developing and evaluating Antigravity skills. It solves the "inner dev loop" problem by allowing you to test local, unpacked skills directly against the Antigravity engine (via the Python SDK), while using an LLM to deterministically grade the agent's behavior.

Prerequisites
How to Use
Testing Local Skills (Inner Dev Loop)
Writing Playbooks
Running the Pytest Suite
Writing Playbooks: Three Modes of Testing

The architecture relies on three main components:

Orchestrator (Python): Drives the execution loop, reads YAML playbooks, and manages isolated workspaces for each test run to prevent session caching issues.
Execution Target (Antigravity SDK): The skill is executed using the google-antigravity Python SDK, which manages the localharness engine in-process. This eliminates the dependency on a globally installed CLI.
Evaluator (Gemini API): The semantic evaluation of the agent's output is performed via direct API calls to gemini-2.5-flash using google-genai. This bypasses brittle string-parsing and guarantees structured JSON output (Pass/Fail + Reasoning).

Prerequisites

You can run the harness directly using uv (recommended), which will automatically handle downloading and running with the required dependencies:

uv run harness.py playbooks/my-playbook.yaml

Alternatively, ensure you have a Python virtual environment set up with the required dependencies:

pip install google-antigravity google-genai pydantic pyyaml click pytest

You also need your Gemini API key available in your environment, or stored in ~/.gemini/key.env:

export GEMINI_API_KEY="your_api_key_here"

How to Use

The main entry point is the harness.py CLI tool.

Note

All commands in this guide assume you are running from the tools/skill-turn-harness directory. If running from the repository root, prefix paths accordingly (e.g., python3 tools/skill-turn-harness/harness.py ...).

Basic Usage

To run a test, provide a YAML playbook:

python3 harness.py playbooks/my-playbook.yaml

Command Line Options

playbook (Required): The path to the YAML playbook defining the test steps.
--log-dir <path> (Optional): The directory where the harness will write detailed Markdown logs, session JSONs, and JSON failure dumps. Defaults to ./logs.
--skill-src <path> (Optional): The path to a local, unpacked skill directory. See the "Testing Local Skills" section below for details.
--env-file <path> (Optional): The path to a standard .env file containing key-value pairs (e.g. MY_SECRET=123). This is used for secure string substitution within your playbook steps.
--keep-workspace (Optional): Preserve the temporary workspace directory (/tmp/gemini_harness_*) after execution to inspect files generated by the agent.
--agent-model <model> (Optional): Override the model the agent uses (e.g., gemini-2.5-pro). Overrides playbook definition.
--evaluator-model <model> (Optional): Override the model the test harness uses to grade and simulate (e.g., gemini-2.5-flash). Overrides playbook definition.
--debug (Optional): Enable verbose debug logging for the SDK (e.g., WebSocket traffic).

⚠️ Security Warning regarding Logs: If your playbooks require secrets (like API keys or passwords) via the env array, the harness will substitute them before executing the CLI. Because the harness traces all inputs and outputs for debugging, these substituted secrets will be written in plain text to your logs/ directory. A default .gitignore is provided in the logs/ directory to prevent committing these files, but care should still be taken to avoid leaking secrets into your repository.

Expected Output

The harness executes the CLI steps, evaluates the responses, and streams the results to the console:

--- Tuning: FAST Setup PoC | Workspace: /tmp/gemini_harness_abc123 ---

[Step 1] Input: Hi, please activate the fast-setup-poc skill and let's configure FAST.
[Step 1] Output: Hi, let's configure FAST. Please provide your Google Cloud Project ID.
✅ [PASS Step 1]: The agent greeted the user ('Hi'), confirmed it was configuring FAST, and asked for the Project ID. All parts of the objective were fulfilled.

...

✅ [SUCCESS] Playbook 'FAST Setup PoC' completed successfully.
📄 Markdown log saved to: logs/FAST_Setup_PoC_log.md

If a step fails, the harness halts immediately and dumps the full interaction trace to a JSON file (e.g., logs/FAST_Setup_PoC_failed.json) for debugging.

Testing Local Skills (Inner Dev Loop)

When developing a complex skill (with multiple markdown files, prompt templates, or tools), you don't want to package and globally install it just to run a test.

The harness supports testing local skills directly using the --skill-src flag:

python3 harness.py playbooks/my-playbook.yaml --skill-src ./my-local-skill/

How it works under the hood:

The harness passes the skill path to the SDK's LocalAgentConfig(skills_paths=[...]). The Antigravity engine loads the skill dynamically for the duration of the session. Unlike the old CLI-based linking, this is completely isolated and does not modify your global environment.

Workspace Management & Copying Behavior

By default, the test harness executes the agent in an isolated temporary workspace (e.g., /tmp/gemini_harness_<hash>) to prevent session caching and protect your repository from accidental file modifications.

Copying vs. Symlinking

To ensure both safety and compatibility with local search tools (such as the agent's built-in search_directory / grep tool, which can crash when traversing directory symlinks), the harness copies the configured playbook directories recursively instead of symlinking them.

To keep the workspace lightweight and prevent the agent from "cheating" by reading the test definitions, the copy operation implements strict exclusion rules:

Excluded Dependencies/Cache: .terraform, .git, .venv, venv, __pycache__, .pytest_cache are skipped. This reduces the copied size of directories like fast/ from 1.4GB to a few megabytes, making workspace setup near-instant.
Excluded Harness: The skill-turn-harness directory itself is strictly excluded from the copy. This prevents the agent under test from walking the workspace, reading the playbook YAML definitions, and "cheating" by peeking at the expected inputs/outcomes.

Linking the `skills` Directory

If your autonomous playbook instructs the agent to "activate" or "inspect" the skill, the model may attempt to search the workspace for the skill's source files (like SKILL.md). For these playbooks, ensure you add skills to the playbook's link_paths so the agent can resolve the lookup locally:

tmpdir:
  link_paths:
    - fast
    - modules
    - tools
    - skills # <-- Make sure to include this so the agent can find skill files

Isolated Chat History

The harness configures the SDK to write raw session state directly to the configured --log-dir (under log_dir/chats/). This ensures that test execution conversations remain isolated and do not pollute your global Antigravity/Jetski desktop history.

Writing Playbooks

Playbooks are written in YAML. For autocompletion and validation in VS Code, add the schema annotation to the top of your playbook.

If your playbook requires environment variables (e.g., secrets), declare them in the env array. You can then reference them in your steps using ${VAR_NAME}. If a variable is declared but not found in the environment (or passed via --env-file), the harness will safely halt before execution.

To run the test in a specific directory (e.g., the repository root), specify working_dir. If omitted, a temporary isolated workspace is created.

# yaml-language-server: $schema=../playbooks/playbook.schema.json
name: "My Test Playbook"
timeout: 120
agent_model: "gemini-2.5-pro"
evaluator_model: "gemini-2.5-flash"
working_dir: "." # Run in the directory where harness is executed
env:
  - MY_API_KEY
steps:
  - user_input: "Hi, activate my-skill and use this key: ${MY_API_KEY}"
    expected_outcome: "The agent should greet the user and acknowledge the key."

Running the Pytest Suite

This repository includes a pytest suite in the test/ directory to test the harness itself.

To run the fast unit tests (which mock the CLI execution):

python3 -m pytest test/test_harness.py -m "not e2e" -v

To run the full End-to-End (E2E) test (which dynamically links the fixture skill and hits the real Gemini API):

python3 -m pytest test/test_harness.py -m "e2e" -v

Writing Playbooks: Three Modes of Testing

The harness supports three modes of execution depending on how you structure your YAML playbook. The mode is inferred automatically based on the presence of the steps and/or persona keys.

1. Scripted Mode (Unit / Regression)

Best for: Ensuring the exact, rigid state machine of a skill hasn't broken.

You define a strict, sequential list of steps. The harness feeds the user_input and checks if the agent's response satisfies the expected_outcome via an LLM evaluation.

name: "FAST Setup PoC - Scripted"
steps:
  - user_input: "Hi, let's configure FAST."
    expected_outcome: "The agent should greet the user and ask for the Project ID."
  - user_input: "my-super-project-123"
    expected_outcome: "The agent should acknowledge the Project ID and ask for the preferred Region."

2. Autonomous "Pond" Mode (E2E / Fuzz Testing)

Best for: Testing how the skill handles the messy reality of natural language and conversational drift.

Instead of providing a rigid script, you define a declarative Persona with a "Pond" of knowledge and explicit success_criteria. A secondary LLM agent acts as the simulated user, dynamically reading the CLI's outputs, fishing data from the "pond," and generating the next input until the success criteria are met or the max_turns limit is reached.

name: "FAST Setup PoC - Autonomous"
persona:
  initial_user_input: "Hi, let's configure FAST."
  context: >
    You are a GCP developer. Your Project ID is my-project-123 and region is europe-west1.
    Do not volunteer information until the agent explicitly asks for it.
  max_turns: 10
  success_criteria:
    llm_checks:
      - "The agent provided a final configuration summary containing the correct project_id and region."
    tool_calls_contain:
      run_shell_command:
        - "gcloud organizations add-iam-policy-binding"
    files_exist:
      - "0-org-setup.auto.tfvars"

3. Hybrid Fallback Mode

Best for: Testing happy-path compliance while ensuring the agent can recover from unexpected deviations.

If a playbook defines both steps and a persona, the harness runs in Hybrid mode. It attempts to execute the rigid steps first. If the skill deviates or fails a step evaluation, instead of failing the test outright, the harness falls back to the autonomous persona. The simulated user takes over the conversation history and attempts to guide the agent back on track to meet the success_criteria. If successful, the test returns a PASS WITH WARNINGS.

12 KiB Raw Blame History