Add FAST Prerequisites Skill and Gemini Skill Test Harness (#3979)

* initial version of a FAST pre-install skill * first round of testing * Update fast-0-org-setup-prereqs skill with improved UX and local path handling - Add explicit lockout warning and stop condition if the user is not a member of the provided Admin Principal group. - Streamline bootstrap project selection to only prompt for an override if the active gcloud project is rejected. - Restrict dataset discovery strictly to the `fast/stages/0-org-setup/datasets/` directory. - Improve location handling by referencing `defaults.schema.json` for Standard GCP and auto-configuring fixed regions for GCD. - Add comprehensive `local_path` management: prompt for customization, create directories, move `defaults.yaml` to the local data folder, and symlink `0-org-setup.auto.tfvars` back to the stage directory. * add testing scenarios, implement initial changes for scenario 2 * move skills * move to a skills/fast subfolder * Refactor fast-0-org-setup prereqs skill * Add skill-turn-harness utility tool * Use relative markdown links for skill references * Use descriptive titles for markdown links in skill references * Add descriptions to each phase in the prerequisites workflow map * Use backslash for markdown line breaks in skill map * Update README security warning to mention default .gitignore * shebang * Update fast prereqs skill rules to force sequential question flow and refine harness tool with proper ctrl+c handling and slugified log paths * Move playbook-gcp-dev.yaml to fast/prerequisites/gcp-dev.yaml and update fast prerequisites * docs(skill-turn-harness): detail autonomous pond testing approach * docs(skill-turn-harness): add final_state_checks to pond architecture and update toc * Refine fast prereqs SKILL and gcp-dev playbook to strictly align with one-question-at-a-time rule * feat(skill-turn-harness): update playbook schema for autonomous persona mode * feat(skill-turn-harness): implement autonomous persona testing mode and fallback logic * docs(skill-turn-harness): document the three modes of testing and update ToC * implement timeout, schema validation, configurable cli * chore: remove accidentally committed log files * chore: ignore logs directory * feat(skill-harness): implement tool execution interception, configurable workspace, and modularized validation * feat(skill-harness): add model configuration and update README * fix(skill-harness): automatically inject -y flag to gemini commands * docs(skill-harness): add TODO.md with analysis for skill environment dependencies * feat(skill-harness): add working_dir support and clean up fixtures - Implement working_dir in harness to run tests in specific directories. - Rename test fixtures and playbooks to be more descriptive. - Add E2E test for working_dir. - Apply code quality improvements to harness.py (imports, linting). - Update README with working directory considerations and usage notes. - Update phase3-bootstrap-and-iam.md skill doc to add execution rule against creating temp scripts. * fix: capture customer_id and respect relative paths * Implement isolated temp workspace sandboxing with symlinks in test harness * Configure GCD manual autonomous playbook and align Phase 3/4 steps order * Fix linting and schema tests failures - Add missing license headers to tools/skill-turn-harness files. - Fix trailing spaces and newlines in playbooks. - Ignore tools directory in schema tests workflow. TAG=agy CONV=1bb75453-c3e2-448b-bae9-8e332a068012 * Fix Python formatting with yapf TAG=agy CONV=1bb75453-c3e2-448b-bae9-8e332a068012 * Refactor skill-turn-harness to use Antigravity SDK - Migrated harness from gemini-cli subprocesses to Antigravity SDK. - Implemented real-time step streaming and console logging. - Added color-coded terminal output (dark gray headers, blue inputs, pink outputs). - Collapsed excessive newlines in streamed thoughts. - Excluded harness codebase from workspace copy to prevent agent cheating. - Enabled skills folder copy to resolve agent lookup loops. - Added key validation and CLI --debug flag. * Fix autonomous turn layout: print Turn ID before execution - Moved the [Autonomous Turn X] header print to before running the agent turn. - This groups the real-time thinking and tool calls under the correct Turn ID block, instead of displaying them before the label. * Remove obsolete .log.md from prerequisites skill directory
2026-05-22 19:16:54 +02:00
parent 1594a01c6f
commit 81f72e8068
32 changed files with 2653 additions and 1 deletions
--- a/tools/skill-turn-harness/.gitignore
+++ b/tools/skill-turn-harness/.gitignore
@@ -0,0 +1,6 @@
+.env*
+!*.env.example
+!*.env.test
+__pycache__/
+.pytest_cache/
+.ruff_cache/
--- a/tools/skill-turn-harness/.style.yapf
+++ b/tools/skill-turn-harness/.style.yapf
@@ -0,0 +1,4 @@
+[style]
+based_on_style=google
+indent_width=2
+split_before_named_assigns=false
--- a/tools/skill-turn-harness/README.md
+++ b/tools/skill-turn-harness/README.md
@@ -0,0 +1,216 @@
+# Hybrid Python Test Harness for Antigravity Skills
+
+## Overview
+
+This project provides a robust, hybrid test harness for developing and evaluating Antigravity skills. It solves the "inner dev loop" problem by allowing you to test local, unpacked skills directly against the Antigravity engine (via the Python SDK), while using an LLM to deterministically grade the agent's behavior.
+
+## Table of Contents
+
+- [Prerequisites](#prerequisites)
+- [How to Use](#how-to-use)
+  - [Basic Usage](#basic-usage)
+  - [Command Line Options](#command-line-options)
+  - [Expected Output](#expected-output)
+- [Testing Local Skills (Inner Dev Loop)](#testing-local-skills-inner-dev-loop)
+- [Writing Playbooks](#writing-playbooks)
+- [Running the Pytest Suite](#running-the-pytest-suite)
+- [Writing Playbooks: Three Modes of Testing](#writing-playbooks-three-modes-of-testing)
+
+The architecture relies on three main components:
+
+- **Orchestrator (Python):** Drives the execution loop, reads YAML playbooks, and manages isolated workspaces for each test run to prevent session caching issues.
+- **Execution Target (Antigravity SDK):** The skill is executed using the `google-antigravity` Python SDK, which manages the localharness engine in-process. This eliminates the dependency on a globally installed CLI.
+- **Evaluator (Gemini API):** The semantic evaluation of the agent's output is performed via direct API calls to `gemini-2.5-flash` using `google-genai`. This bypasses brittle string-parsing and guarantees structured JSON output (Pass/Fail + Reasoning).
+
+## Prerequisites
+
+You can run the harness directly using `uv` (recommended), which will automatically handle downloading and running with the required dependencies:
+
+```bash
+uv run harness.py playbooks/my-playbook.yaml
+```
+
+Alternatively, ensure you have a Python virtual environment set up with the required dependencies:
+
+```bash
+pip install google-antigravity google-genai pydantic pyyaml click pytest
+```
+
+You also need your Gemini API key available in your environment, or stored in `~/.gemini/key.env`:
+
+```bash
+export GEMINI_API_KEY="your_api_key_here"
+```
+
+## How to Use
+
+The main entry point is the `harness.py` CLI tool.
+
+> [!NOTE]
+> All commands in this guide assume you are running from the `tools/skill-turn-harness` directory. If running from the repository root, prefix paths accordingly (e.g., `python3 tools/skill-turn-harness/harness.py ...`).
+
+### Basic Usage
+
+To run a test, provide a YAML playbook:
+
+```bash
+python3 harness.py playbooks/my-playbook.yaml
+```
+
+### Command Line Options
+
+- `playbook` (Required): The path to the YAML playbook defining the test steps.
+- `--log-dir <path>` (Optional): The directory where the harness will write detailed Markdown logs, session JSONs, and JSON failure dumps. Defaults to `./logs`.
+- `--skill-src <path>` (Optional): The path to a local, unpacked skill directory. See the "Testing Local Skills" section below for details.
+- `--env-file <path>` (Optional): The path to a standard `.env` file containing key-value pairs (e.g. `MY_SECRET=123`). This is used for secure string substitution within your playbook steps.
+- `--keep-workspace` (Optional): Preserve the temporary workspace directory (`/tmp/gemini_harness_*`) after execution to inspect files generated by the agent.
+- `--agent-model <model>` (Optional): Override the model the agent uses (e.g., `gemini-2.5-pro`). Overrides playbook definition.
+- `--evaluator-model <model>` (Optional): Override the model the test harness uses to grade and simulate (e.g., `gemini-2.5-flash`). Overrides playbook definition.
+- `--debug` (Optional): Enable verbose debug logging for the SDK (e.g., WebSocket traffic).
+
+⚠️ **Security Warning regarding Logs:**
+If your playbooks require secrets (like API keys or passwords) via the `env` array, the harness will substitute them before executing the CLI. Because the harness traces all inputs and outputs for debugging, **these substituted secrets will be written in plain text** to your `logs/` directory.
+A default `.gitignore` is provided in the `logs/` directory to prevent committing these files, but care should still be taken to avoid leaking secrets into your repository.
+
+### Expected Output
+
+The harness executes the CLI steps, evaluates the responses, and streams the results to the console:
+
+```text
+--- Tuning: FAST Setup PoC | Workspace: /tmp/gemini_harness_abc123 ---
+
+[Step 1] Input: Hi, please activate the fast-setup-poc skill and let's configure FAST.
+[Step 1] Output: Hi, let's configure FAST. Please provide your Google Cloud Project ID.
+✅ [PASS Step 1]: The agent greeted the user ('Hi'), confirmed it was configuring FAST, and asked for the Project ID. All parts of the objective were fulfilled.
+
+...
+
+✅ [SUCCESS] Playbook 'FAST Setup PoC' completed successfully.
+📄 Markdown log saved to: logs/FAST_Setup_PoC_log.md
+```
+
+If a step fails, the harness halts immediately and dumps the full interaction trace to a JSON file (e.g., `logs/FAST_Setup_PoC_failed.json`) for debugging.
+
+## Testing Local Skills (Inner Dev Loop)
+
+When developing a complex skill (with multiple markdown files, prompt templates, or tools), you don't want to package and globally install it just to run a test.
+
+The harness supports testing local skills directly using the `--skill-src` flag:
+
+```bash
+python3 harness.py playbooks/my-playbook.yaml --skill-src ./my-local-skill/
+```
+
+**How it works under the hood:**
+
+The harness passes the skill path to the SDK's `LocalAgentConfig(skills_paths=[...])`. The Antigravity engine loads the skill dynamically for the duration of the session. Unlike the old CLI-based linking, this is completely isolated and does not modify your global environment.
+
+## Workspace Management & Copying Behavior
+
+By default, the test harness executes the agent in an isolated temporary workspace (e.g., `/tmp/gemini_harness_<hash>`) to prevent session caching and protect your repository from accidental file modifications.
+
+### Copying vs. Symlinking
+To ensure both safety and compatibility with local search tools (such as the agent's built-in `search_directory` / grep tool, which can crash when traversing directory symlinks), **the harness copies the configured playbook directories recursively instead of symlinking them**.
+
+To keep the workspace lightweight and prevent the agent from "cheating" by reading the test definitions, the copy operation implements strict exclusion rules:
+- **Excluded Dependencies/Cache**: `.terraform`, `.git`, `.venv`, `venv`, `__pycache__`, `.pytest_cache` are skipped. This reduces the copied size of directories like `fast/` from 1.4GB to a few megabytes, making workspace setup near-instant.
+- **Excluded Harness**: The `skill-turn-harness` directory itself is **strictly excluded** from the copy. This prevents the agent under test from walking the workspace, reading the playbook YAML definitions, and "cheating" by peeking at the expected inputs/outcomes.
+
+### Linking the `skills` Directory
+If your autonomous playbook instructs the agent to "activate" or "inspect" the skill, the model may attempt to search the workspace for the skill's source files (like `SKILL.md`). For these playbooks, ensure you add `skills` to the playbook's `link_paths` so the agent can resolve the lookup locally:
+
+```yaml
+tmpdir:
+  link_paths:
+    - fast
+    - modules
+    - tools
+    - skills # <-- Make sure to include this so the agent can find skill files
+```
+
+### Isolated Chat History
+The harness configures the SDK to write raw session state directly to the configured `--log-dir` (under `log_dir/chats/`). This ensures that test execution conversations remain isolated and do not pollute your global Antigravity/Jetski desktop history.
+
+## Writing Playbooks
+
+Playbooks are written in YAML. For autocompletion and validation in VS Code, add the schema annotation to the top of your playbook.
+
+If your playbook requires environment variables (e.g., secrets), declare them in the `env` array. You can then reference them in your `steps` using `${VAR_NAME}`. If a variable is declared but not found in the environment (or passed via `--env-file`), the harness will safely halt before execution.
+
+To run the test in a specific directory (e.g., the repository root), specify `working_dir`. If omitted, a temporary isolated workspace is created.
+
+```yaml
+# yaml-language-server: $schema=../playbooks/playbook.schema.json
+name: "My Test Playbook"
+timeout: 120
+agent_model: "gemini-2.5-pro"
+evaluator_model: "gemini-2.5-flash"
+working_dir: "." # Run in the directory where harness is executed
+env:
+  - MY_API_KEY
+steps:
+  - user_input: "Hi, activate my-skill and use this key: ${MY_API_KEY}"
+    expected_outcome: "The agent should greet the user and acknowledge the key."
+```
+
+## Running the Pytest Suite
+
+This repository includes a `pytest` suite in the `test/` directory to test the harness itself.
+
+To run the fast unit tests (which mock the CLI execution):
+
+```bash
+python3 -m pytest test/test_harness.py -m "not e2e" -v
+```
+
+To run the full End-to-End (E2E) test (which dynamically links the fixture skill and hits the real Gemini API):
+
+```bash
+python3 -m pytest test/test_harness.py -m "e2e" -v
+```
+
+## Writing Playbooks: Three Modes of Testing
+
+The harness supports three modes of execution depending on how you structure your YAML playbook. The mode is inferred automatically based on the presence of the `steps` and/or `persona` keys.
+
+### 1. Scripted Mode (Unit / Regression)
+**Best for:** Ensuring the exact, rigid state machine of a skill hasn't broken.
+
+You define a strict, sequential list of `steps`. The harness feeds the `user_input` and checks if the agent's response satisfies the `expected_outcome` via an LLM evaluation.
+
+```yaml
+name: "FAST Setup PoC - Scripted"
+steps:
+  - user_input: "Hi, let's configure FAST."
+    expected_outcome: "The agent should greet the user and ask for the Project ID."
+  - user_input: "my-super-project-123"
+    expected_outcome: "The agent should acknowledge the Project ID and ask for the preferred Region."
+```
+
+### 2. Autonomous "Pond" Mode (E2E / Fuzz Testing)
+**Best for:** Testing how the skill handles the messy reality of natural language and conversational drift.
+
+Instead of providing a rigid script, you define a declarative **Persona** with a "Pond" of knowledge and explicit `success_criteria`. A secondary LLM agent acts as the simulated user, dynamically reading the CLI's outputs, fishing data from the "pond," and generating the next input until the success criteria are met or the `max_turns` limit is reached.
+
+```yaml
+name: "FAST Setup PoC - Autonomous"
+persona:
+  initial_user_input: "Hi, let's configure FAST."
+  context: >
+    You are a GCP developer. Your Project ID is my-project-123 and region is europe-west1.
+    Do not volunteer information until the agent explicitly asks for it.
+  max_turns: 10
+  success_criteria:
+    llm_checks:
+      - "The agent provided a final configuration summary containing the correct project_id and region."
+    tool_calls_contain:
+      run_shell_command:
+        - "gcloud organizations add-iam-policy-binding"
+    files_exist:
+      - "0-org-setup.auto.tfvars"
+```
+
+### 3. Hybrid Fallback Mode
+**Best for:** Testing happy-path compliance while ensuring the agent can recover from unexpected deviations.
+
+If a playbook defines **both** `steps` and a `persona`, the harness runs in Hybrid mode. It attempts to execute the rigid `steps` first. If the skill deviates or fails a step evaluation, instead of failing the test outright, the harness **falls back** to the autonomous persona. The simulated user takes over the conversation history and attempts to guide the agent back on track to meet the `success_criteria`. If successful, the test returns a `PASS WITH WARNINGS`.
--- a/tools/skill-turn-harness/TODO.md
+++ b/tools/skill-turn-harness/TODO.md
@@ -0,0 +1,40 @@
+# Test Harness TODOs & Analysis
+
+## The Problem: Skill Environment Dependencies
+
+During E2E testing of the `fast-0-org-setup-prereqs` skill, the autonomous agent hallucinated that it needed to `git clone` the `cloud-foundation-fabric` repository. 
+
+This occurred because the test harness executes the Gemini CLI in a completely empty, isolated temporary workspace (`/tmp/gemini_harness_*`). However, the FAST skill *assumes* it is being executed from the root of the `cloud-foundation-fabric` repository because it needs to:
+1. Read available datasets from `fast/stages/0-org-setup/datasets/*/defaults.yaml`.
+2. Create a symbolic link from the generated `0-org-setup.auto.tfvars` file to `fast/stages/0-org-setup/0-org-setup.auto.tfvars`.
+
+Because these directories did not exist in the isolated workspace, the agent failed to complete Phase 4 of the setup.
+
+## Proposed Solutions for Next Session
+
+We need a way to provide the agent with the expected repository structure during tests without compromising the safety and isolation of the test harness.
+
+### Option 1: Add a `working_dir` Playbook Attribute
+Allow the playbook to specify a `working_dir` (e.g., the actual path to the `cloud-foundation-fabric` repo) where the CLI should be executed, bypassing the temporary workspace creation.
+
+**Risks & Impacts:**
+*   **Chat History Pollution:** The test's conversation will be saved to the global `~/.gemini/tmp/cloud-foundation-fabric/chats/` directory, mixing test runs with the developer's actual day-to-day CLI usage.
+*   **Session Retrieval:** The harness will need to be updated to find the *newest* `session-*.json` file in that directory, rather than assuming it's the only one.
+*   **File Modification Risk:** If the agent hallucinates or a test is poorly written, it could modify or delete real files in the repository instead of sandboxed test files.
+*   **Cleanup:** The harness cannot safely clean up the workspace after the test completes.
+
+### Option 2: The "Symlink Sandbox" (Recommended)
+Keep the isolated temporary workspace (`/tmp/gemini_harness_*`), but add a `symlink_paths` array to the playbook schema. 
+
+Before the test starts, the harness would dynamically create symbolic links from the real repository (e.g., `fast/`, `modules/`) into the temporary workspace.
+
+**Benefits:**
+*   **Total Isolation:** The agent's chat history remains isolated in a temporary `.gemini/tmp/gemini_harness_*/` directory.
+*   **Safe Execution:** The agent sees the directory structure it expects (and can read the `defaults.yaml` files), but any new files it creates (like the `custom-fast-config` directory or the `0-org-setup.auto.tfvars` symlink) are created safely inside the temporary workspace.
+*   **Automatic Cleanup:** The entire workspace (including the symlinks and generated files) is safely deleted when the test finishes.
+
+## Next Steps
+1. Decide between Option 1 (`working_dir`) and Option 2 (`symlink_paths`).
+2. Implement the chosen solution in `harness.py`.
+3. Update `playbook.schema.json` and `README.md`.
+4. Re-run the `gcp-dev-autonomous.yaml` E2E test to verify Phase 4 completes successfully.
--- a/tools/skill-turn-harness/docs/DESIGN.md
+++ b/tools/skill-turn-harness/docs/DESIGN.md
@@ -0,0 +1,42 @@
+# Design Decisions: Test Harness Architecture
+
+## Context
+
+This document captures architectural decisions and considerations for the `harness.py` test harness.
+
+## LangChain Integration Analysis
+
+*Date: April 15, 2026*
+
+We evaluated whether to integrate LangChain into the `harness.py` script. The script currently acts as a lightweight testing harness that uses `subprocess` to interact with the Gemini CLI and the native `google.genai` SDK for evaluation using structured outputs (Pydantic).
+
+### Potential Benefits of LangChain
+
+1. **Model-Agnostic Evaluators (Avoiding Self-Bias):**
+   Currently, the harness uses Gemini 2.5 Flash to evaluate the Gemini CLI. To avoid "self-preference bias", it is often best practice to use a different model family for evaluation. LangChain's `ChatModel` abstractions would allow swapping the evaluator model easily without rewriting API call logic.
+2. **Built-in Evaluation Frameworks:**
+   LangChain provides a dedicated evaluation module (`langchain.evaluation`). Instead of custom prompts, we could leverage pre-built evaluators (like `CriteriaEvalChain`) that are prompt-engineered to reduce hallucinations and false positives.
+3. **Observability and Tracing (LangSmith):**
+   Integration provides seamless access to LangSmith for logging evaluation runs, inspecting prompts, latency, token usage, and tracking pass/fail rates over time.
+4. **Prompt Management:**
+   LangChain's `PromptTemplate` system offers robust handling for complex evaluation criteria (e.g., few-shot examples, dynamic context).
+
+### Drawbacks and Limitations
+
+1. **Overkill for Current Scope:**
+   The current script is lightweight and readable. LangChain is a heavy dependency that introduces complex abstractions (like LCEL/Runnables), adding bloat and a steeper learning curve.
+2. **Native Structured Outputs are Sufficient:**
+   The native `google.genai` SDK already handles structured JSON outputs via `response_schema=EvaluationResult` efficiently and reliably. LangChain's structured output would merely wrap this existing capability.
+3. **External Agent Execution:**
+   LangChain excels at managing agent memory, tools, and reasoning loops. Since our harness tests an external CLI tool via `subprocess.run`, LangChain cannot orchestrate the agent and is relegated strictly to the role of a grader.
+
+### Conclusion & Recommendation
+
+**Recommendation: Hold off on LangChain for now.**
+
+The current architecture is elegant, dependency-light, and perfectly suited for its job. The native `google.genai` SDK handles the structured Pydantic evaluation flawlessly.
+
+**When to reconsider LangChain:**
+
+- We need to evaluate the CLI using non-Google models (e.g., Claude, GPT-4) to ensure unbiased grading.
+- We require visual tracking of test runs, prompt versions, and token costs using LangSmith.
--- a/tools/skill-turn-harness/docs/Hybrid
+++ b/tools/skill-turn-harness/docs/Hybrid
@@ -0,0 +1,92 @@
+# **Architecture Document: Hybrid Python/CLI Test Harness**
+
+This document outlines the architecture for testing the Fabric FAST configuration skill. It uses a hybrid approach, executing the skill in its native CLI environment while maintaining deterministic control via a Python orchestration loop.
+
+## **1\. The Approach: Hybrid Isolation**
+
+To accurately test the skill in its target environment while ensuring the reliability of the test harness, the execution and evaluation layers are strictly separated:
+
+* **Orchestrator (Python):** A Python script acts as the absolute authority. It maintains the state machine, reads the playbook, injects inputs, captures outputs, and triggers evaluations.  
+* **Execution Target (Gemini CLI):** The skill is run via the gemini CLI using Python's subprocess module. This ensures the test reflects the actual user environment. State is maintained across steps using the CLI's session management flags (e.g., \--resume).  
+* **Evaluator (Gemini API):** The semantic evaluation of the CLI's output is performed via direct API calls to Gemini 1.5 Flash. This bypasses the string-parsing unreliability of a CLI and guarantees structured JSON output via Pydantic schemas.
+
+## **2\. The Execution Loop**
+
+The Python orchestrator executes the following rigid sequence for each step in a defined playbook:
+
+1. **Injection:** Read the mocked user input and expected outcome from the playbook step.  
+2. **Subprocess Execution:** Invoke the Gemini CLI with the user input and the designated session\_id. Capture stdout and trap stderr to handle hangs or crashes.  
+3. **Prompt Assembly:** Construct a strict evaluation prompt combining the exact playbook expectation with the raw string response captured from the CLI.  
+4. **Stateless Evaluation:** Call the Gemini API with the evaluation prompt, enforcing a structured output schema (Boolean Pass/Fail and Reasoning).  
+5. **Verdict Enforcement:** If the evaluator returns True, proceed to the next step. If False, immediately halt the loop, dump the interaction trace to a JSON file, and alert the developer.
+
+## **3\. Implementation Code**
+
+The following Python script implements the hybrid harness:
+
+import subprocess  
+import json  
+import sys  
+from pydantic import BaseModel  
+from google import genai  
+from google.genai import types
+
+\# 1\. Define Strict Evaluator Schema  
+class EvaluationResult(BaseModel):  
+    passed: bool  
+    reasoning: str
+
+evaluator\_client \= genai.Client()
+
+def invoke\_skill\_cli(user\_input: str, session\_id: str) \-\> str:  
+    \# Requires the CLI to support a session resume flag for state  
+    command \= \["gemini", "--resume", session\_id, "-p", user\_input\]  
+    try:  
+        result \= subprocess.run(command, capture\_output=True, text=True, timeout=30)  
+        if result.returncode \!= 0:  
+            print(f"⚠️ \[CLI ERROR\]: {result.stderr}", file=sys.stderr)  
+            return f"SYSTEM\_ERROR: {result.stderr}"  
+        return result.stdout.strip()  
+    except subprocess.TimeoutExpired:  
+        print("⚠️ \[CLI TIMEOUT\]", file=sys.stderr)  
+        return "SYSTEM\_ERROR: Timeout"
+
+def run\_hybrid\_tuning\_loop(playbook\_name: str, playbook\_steps: list, session\_id: str):  
+    print(f"--- Tuning: {playbook\_name} | Session: {session\_id} \---")  
+    interaction\_log \= \[\]
+
+    for step\_index, step in enumerate(playbook\_steps):  
+        user\_input, expected\_outcome \= step\['user\_input'\], step\['expected\_outcome'\]  
+          
+        skill\_response \= invoke\_skill\_cli(user\_input, session\_id)  
+        if skill\_response.startswith("SYSTEM\_ERROR"): break  
+              
+        eval\_prompt \= f"""  
+        OBJECTIVE: {expected\_outcome}  
+        ACTUAL RESPONSE: {skill\_response}  
+        Evaluate if the agent fulfilled the objective.  
+        """
+
+        eval\_response \= evaluator\_client.models.generate\_content(  
+            model="gemini-1.5-flash", contents=eval\_prompt,  
+            config=types.GenerateContentConfig(  
+                response\_mime\_type="application/json",  
+                response\_schema=EvaluationResult, temperature=0.0  
+            )  
+        )  
+          
+        parsed\_eval \= json.loads(eval\_response.text)  
+        interaction\_log.append({"step": step\_index \+ 1, "input": user\_input, "evaluation": parsed\_eval})
+
+        if not parsed\_eval\['passed'\]:  
+            print(f"❌ \[FAILURE Step {step\_index \+ 1}\]: {parsed\_eval\['reasoning'\]}")  
+            with open(f"{playbook\_name}\_failed.json", "w") as f: json.dump(interaction\_log, f)  
+            return False 
+
+    print(f"✅ \[SUCCESS\]")  
+    return True
+
+## **4\. Critical Implementation Warnings**
+
+* **Session Data Persistence:** The CLI likely persists session states to disk (e.g., in a local database or JSON file). If you reuse the same session\_id for consecutive test runs without manually deleting the cache file, the skill will inherit the context of the previous run, causing immediate test failures. You must either generate a UUID for every run or build a cache-clearing mechanism into the Python script.  
+* **Context Window Discipline:** The evaluation prompt is strictly limited to the current playbook objective and the immediate CLI response. Do not feed the entire CLI conversation history to the Evaluator API, as this significantly increases the risk of hallucinated grading.
--- a/tools/skill-turn-harness/harness.py
+++ b/tools/skill-turn-harness/harness.py
--- a/tools/skill-turn-harness/logs/.gitignore
+++ b/tools/skill-turn-harness/logs/.gitignore
@@ -0,0 +1,2 @@
+*
+!.gitignore
--- a/tools/skill-turn-harness/playbooks/.gitignore
+++ b/tools/skill-turn-harness/playbooks/.gitignore
@@ -0,0 +1 @@
+**/*.env
--- a/tools/skill-turn-harness/playbooks/fast/prerequisites/gcd-custom-manual-autonomous.yaml
+++ b/tools/skill-turn-harness/playbooks/fast/prerequisites/gcd-custom-manual-autonomous.yaml
@@ -0,0 +1,78 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# yaml-language-server: $schema=../../playbook.schema.json
+name: "FAST 0-org-setup Prereqs - GCD Custom Manual Flow (Autonomous)"
+timeout: 300
+agent_model: "gemini-3.5-flash"
+evaluator_model: "gemini-3.5-flash"
+tmpdir:
+  link_paths:
+    - fast
+    - modules
+    - tools
+    - skills
+    - .yamllint
+    - GEMINI.md
+    - AGENTS.md
+    - README.md
+    - FACTORIES.md
+persona:
+  initial_user_input: "Hi, please activate the fast-0-org-setup-prereqs skill and guide me through the setup."
+  context: >
+    You are a GCP developer setting up FAST in a Google Cloud Dedicated (GCD) environment.
+    Your target environment is Google Cloud Dedicated (GCD).
+    You prefer the agent to output commands for you to execute manually.
+    Since the execution mode is manual, the agent will output commands for you to run. Whenever it does, pretend you executed them successfully and tell the agent to proceed (e.g., say "Done", "I ran it", or "OK" to continue).
+    You are using a Custom GCD Universe (not S3NS or Berlin). When asked about the universe, reply that it is "Custom". Do not volunteer the universe details all at once. Wait for the agent to ask for each attribute individually, and then provide:
+      - For Universe Web Domain: custom.cloud.domain
+      - For Universe API Domain: custom-apis.domain
+      - For Universe Name: custom-gcd
+      - For Universe Prefix: cust
+      - For Universe Region: u-custom-region1
+    Confirm the compiled list of 5 universe values is correct when presented.
+    You are NOT authenticated with Google Cloud. When the agent asks for the workforce pool audience string, provide '//iam.googleapis.com/locations/global/workforcePools/my-pool/providers/my-provider'. When the agent outputs WIF login commands, pretend you run them successfully and confirm you are authenticated.
+    You want to use a Single User for the Admin Principal (Approach B). When the agent asks you to run the command to get your current principal, provide 'principal://iam.googleapis.com/locations/global/workforcePools/my-pool/subject/my-user@custom.cloud.domain'.
+    When asked for Organization ID, provide the Org ID '1092874262642' directly (and state there is no domain).
+    When asked for Billing Account ID, provide "012345-6789AB-CDEF01".
+    Your access level to the billing account is Scenario 3 (No Access). Confirm you want to proceed despite the warnings.
+    You do not have a pre-existing project for the bootstrap project. When the agent instructs you to create one, tell it you created it and the Project ID is "my-custom-bootstrap-project".
+    Confirm the configuration dataset is 'classic-gcd'.
+    Your base location is automatically set to u-custom-region1.
+    Your local path for output files is custom-fast-config.
+    You do not have any additional static context values.
+    When the agent instructs you to run fast-links.sh, pretend you run it and it outputs the linking commands. Then pretend you run those linking commands successfully.
+    When asked to check for existing organization policies, pretend the command output returned "constraints/compute.disableSerialPortAccess".
+    Do not volunteer information unless explicitly asked. Answer only the question asked by the agent.
+  max_turns: 30
+  success_criteria:
+    llm_checks:
+      - "The agent explicitly provided the final wrap-up instructions containing the commands 'terraform init' and 'terraform apply'."
+    files_exist:
+      - "custom-fast-config/0-org-setup.auto.tfvars"
+      - "custom-fast-config/providers/0-org-setup-providers.tf"
+      - "custom-fast-config/data/0-org-setup/defaults.yaml"
+    files_contain:
+      "custom-fast-config/data/0-org-setup/defaults.yaml":
+        - "billing_account: 012345-6789AB-CDEF01"
+        - "id: 1092874262642"
+        - "domain: custom-apis.domain"
+        - "prefix: cust"
+        - "primary: u-custom-region1"
+        - "gcp-organization-admins: principal://iam.googleapis.com/locations/global/workforcePools/my-pool/subject/my-user@custom.cloud.domain"
+      "custom-fast-config/providers/0-org-setup-providers.tf":
+        - "universe_domain"
+        - "custom-apis.domain"
+      "custom-fast-config/0-org-setup.auto.tfvars":
+        - "org_policies_imports"
--- a/tools/skill-turn-harness/playbooks/fast/prerequisites/gcp-dev-autonomous.yaml
+++ b/tools/skill-turn-harness/playbooks/fast/prerequisites/gcp-dev-autonomous.yaml
@@ -0,0 +1,62 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# yaml-language-server: $schema=../../playbook.schema.json
+name: "FAST 0-org-setup Prereqs - Standard GCP Developer Flow (Autonomous)"
+timeout: 300
+agent_model: "gemini-3.5-flash"
+evaluator_model: "gemini-3.5-flash"
+tmpdir:
+  link_paths:
+    - fast
+    - modules
+    - tools
+    - skills
+    - .yamllint
+    - GEMINI.md
+    - AGENTS.md
+    - README.md
+    - FACTORIES.md
+env:
+  - GROUP
+persona:
+  initial_user_input: "Hi, please activate the fast-0-org-setup-prereqs skill and guide me through the setup."
+  context: >
+    You are a GCP developer setting up FAST.
+    Your target environment is Standard GCP.
+    You prefer the agent to execute commands automatically.
+    You are already authenticated with Google Cloud and your current identity is correct.
+    You want to use a Group for the Admin Principal (Approach A).
+    The group email is ${GROUP}. You confirm you are a member of this group.
+    When asked for Organization ID, provide the keyword "fast-test" to search. When the list is presented, select the option that corresponds to "01".
+    When asked for Billing Account ID, provide the keyword "fast" to search. When the list is presented, select the option for the "TI billing account".
+    Your access level to the billing account is Scenario 2 (Billing User).
+    You have a pre-existing project to use as the bootstrap project, and it is already set as the active project in gcloud. Confirm it is the correct project.
+    When asked about checking services, you want the agent to check which services are enabled.
+    You approve the IAM role assignments.
+    You want to use the 'classic' dataset.
+    Your base location is europe-west1, with no overrides.
+    Your local path for output files is custom-fast-config.
+    You do not have any additional static context values.
+    Do not volunteer information unless explicitly asked. Answer only the question asked by the agent.
+  max_turns: 30
+  success_criteria:
+    llm_checks:
+      - "The agent explicitly provided the final wrap-up instructions containing the commands 'terraform init' and 'terraform apply'."
+    tool_calls_contain:
+      run_shell_command:
+        - "gcloud organizations add-iam-policy-binding"
+    files_exist:
+      - "custom-fast-config/0-org-setup.auto.tfvars"
+      - "custom-fast-config/data/0-org-setup/defaults.yaml"
--- a/tools/skill-turn-harness/playbooks/fast/prerequisites/gcp-dev.yaml
+++ b/tools/skill-turn-harness/playbooks/fast/prerequisites/gcp-dev.yaml
@@ -0,0 +1,97 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# yaml-language-server: $schema=../../playbook.schema.json
+tmpdir:
+  link_paths:
+    - fast
+    - modules
+    - tools
+    - skills
+    - .yamllint
+    - GEMINI.md
+    - AGENTS.md
+    - README.md
+    - FACTORIES.md
+env:
+  - BILLING_KEYWORD
+  - GROUP
+  - ORG_KEYWORD
+name: "FAST 0-org-setup Prereqs - Standard GCP Developer Flow"
+steps:
+  - user_input: "Hi, please activate the fast-0-org-setup-prereqs skill and guide me through the setup."
+    expected_outcome: "The agent should confirm the guide's activation and ask the user about their target environment (e.g., Standard GCP vs GCD)."
+
+  - user_input: "Standard GCP"
+    expected_outcome: "The agent should acknowledge the environment and ask for the user's preference on how to execute commands (e.g., automatically vs manually)."
+
+  - user_input: "Automatically"
+    expected_outcome: "The agent should acknowledge the execution preference and ask about the user's current Google Cloud authentication status."
+
+  - user_input: "Yes, I am already authenticated."
+    expected_outcome: "The agent should verify the current authenticated principal (e.g., using gcloud) and ask the user to confirm if it is the correct identity."
+
+  - user_input: "Yes, that is the correct identity."
+    expected_outcome: "The agent should move to the Admin Principal step and ask the user to choose an approach (e.g., Group vs Single User)."
+
+  - user_input: "Approach A, please."
+    expected_outcome: "The agent should ask for the group email address."
+
+  - user_input: "The group is ${GROUP}."
+    expected_outcome: "The agent should explicitly ask the user to confirm that their current identity is already a member of this group."
+
+  - user_input: "Yes, I confirm I am a member."
+    expected_outcome: "The agent should ask the user to provide their Organization ID, offering to list them automatically."
+
+  - user_input: "${ORG_KEYWORD}"
+    expected_outcome: "The agent should list the matching organizations and ask the user to select one."
+
+  - user_input: "1"
+    expected_outcome: "The agent should acknowledge the selected Organization and ask the user to provide their Billing Account ID, offering to list them automatically."
+
+  - user_input: "${BILLING_KEYWORD}"
+    expected_outcome: "The agent should list the matching billing accounts and ask the user to select one."
+
+  - user_input: "1"
+    expected_outcome: "The agent should ask the user about their access level to the selected Billing Account (e.g., scenarios 1, 2, or 3)."
+
+  - user_input: "Scenario 2 (Billing User)"
+    expected_outcome: "The agent should note the limitations of this access level (no billing.admin role assigned) and propose the IAM role assignments to be made."
+
+  - user_input: "Looks good, go ahead and assign the roles."
+    expected_outcome: "The agent should execute the IAM role assignments and then ask if the user has a pre-existing project to use as the bootstrap project."
+
+  - user_input: "Yes, I have a pre-existing project."
+    expected_outcome: "The agent should ask if the pre-existing project is already set as the active project in gcloud."
+
+  - user_input: "Yes, it is."
+    expected_outcome: "The agent should fetch the current active Project ID, ask for confirmation, and offer to check/enable required APIs."
+
+  - user_input: "Yes, that's the correct project. Yes, please check which services are enabled."
+    expected_outcome: "The agent should check and enable necessary APIs, and then ask the user to select a configuration Dataset."
+
+  - user_input: "I'll use the classic dataset."
+    expected_outcome: "The agent should ask the user for a base location for the resources and if there are any overrides."
+
+  - user_input: "europe-west1, no overrides."
+    expected_outcome: "The agent should propose a local path for the output files and ask for confirmation."
+
+  - user_input: "~/custom-fast-config"
+    expected_outcome: "The agent should ask if the user wants to provide any additional static context values."
+
+  - user_input: "No additional context."
+    expected_outcome: "The agent should scaffold the local files (copying defaults, creating tfvars), validate them (e.g., yamllint), and then initiate the Organization Policy Import Check."
+
+  - user_input: "Okay."
+    expected_outcome: "The agent should process any existing org policies and provide the final wrap-up instructions for applying the Terraform."
--- a/tools/skill-turn-harness/playbooks/playbook.schema.json
+++ b/tools/skill-turn-harness/playbooks/playbook.schema.json
@@ -0,0 +1,168 @@
+{
+  "$schema": "http://json-schema.org/draft-07/schema#",
+  "type": "object",
+  "properties": {
+    "name": {
+      "type": "string",
+      "description": "The name of the playbook."
+    },
+    "agent_model": {
+      "type": "string",
+      "description": "The model the Gemini CLI should use (e.g., gemini-2.5-pro)."
+    },
+    "evaluator_model": {
+      "type": "string",
+      "description": "The model the test harness uses to grade the test and simulate the user (e.g., gemini-2.5-flash)."
+    },
+    "tmpdir": {
+      "type": "object",
+      "description": "Configuration for running in a temporary isolated workspace with optional symlinks.",
+      "properties": {
+        "link_paths": {
+          "type": "array",
+          "description": "Relative paths to symlink from the host repository (current CWD) into the temporary workspace.",
+          "items": {
+            "type": "string"
+          }
+        }
+      },
+      "additionalProperties": false
+    },
+    "timeout": {
+      "type": "integer",
+      "description": "Timeout in seconds for each CLI invocation.",
+      "default": 60,
+      "minimum": 1
+    },
+    "env": {
+      "type": "array",
+      "description": "A list of environment variable names required by this playbook. These will be substituted in the steps and persona context.",
+      "items": {
+        "type": "string"
+      }
+    },
+    "steps": {
+      "type": "array",
+      "description": "The deterministic sequence of interactions between the user and the agent.",
+      "items": {
+        "type": "object",
+        "properties": {
+          "user_input": {
+            "type": "string",
+            "description": "The simulated input provided by the user."
+          },
+          "expected_outcome": {
+            "type": "string",
+            "description": "The expected response or behavior from the agent to evaluate against."
+          }
+        },
+        "required": [
+          "user_input",
+          "expected_outcome"
+        ],
+        "additionalProperties": false
+      }
+    },
+    "persona": {
+      "type": "object",
+      "description": "Configuration for the autonomous LLM-simulated user.",
+      "properties": {
+        "initial_user_input": {
+          "type": "string",
+          "description": "The first input to send to the agent when starting in pure autonomous mode. Variables are interpolated."
+        },
+        "context": {
+          "type": "string",
+          "description": "Freeform instructions and knowledge base for the simulated user. Variables are interpolated."
+        },
+        "max_turns": {
+          "type": "integer",
+          "description": "The maximum number of conversation turns allowed in autonomous mode before forcing a failure.",
+          "default": 10,
+          "minimum": 1
+        },
+        "success_criteria": {
+          "type": "object",
+          "description": "The conditions that must be met for the autonomous flow to be considered complete and successful.",
+          "properties": {
+            "llm_checks": {
+              "type": "array",
+              "description": "Semantic checks evaluated by the LLM (e.g., 'The agent printed a final configuration summary').",
+              "items": {
+                "type": "string"
+              }
+            },
+            "flow_contains": {
+              "type": "array",
+              "description": "Literal strings that must appear somewhere in the combined CLI stdout.",
+              "items": {
+                "type": "string"
+              }
+            },
+            "files_exist": {
+              "type": "array",
+              "description": "A list of file paths (relative to the workspace) that must exist.",
+              "items": {
+                "type": "string"
+              }
+            },
+            "files_contain": {
+              "type": "object",
+              "description": "A mapping of file paths to a list of strings that must be found within them.",
+              "patternProperties": {
+                ".*": {
+                  "type": "array",
+                  "items": {
+                    "type": "string"
+                  }
+                }
+              }
+            },
+            "tool_calls_contain": {
+              "type": "object",
+              "description": "A mapping of tool names to a list of strings that must be found within their arguments.",
+              "patternProperties": {
+                ".*": {
+                  "type": "array",
+                  "items": {
+                    "type": "string"
+                  }
+                }
+              }
+            }
+          },
+          "additionalProperties": false
+        }
+      },
+      "required": [
+        "context",
+        "success_criteria"
+      ],
+      "additionalProperties": false
+    }
+  },
+  "required": [
+    "name"
+  ],
+  "anyOf": [
+    {
+      "required": ["steps"]
+    },
+    {
+      "required": ["persona"]
+    }
+  ],
+  "if": {
+    "not": {
+      "required": ["steps"]
+    }
+  },
+  "then": {
+    "properties": {
+      "persona": {
+        "required": ["context", "success_criteria", "initial_user_input"]
+      }
+    }
+  },
+  "additionalProperties": false
+}
--- a/tools/skill-turn-harness/pytest.ini
+++ b/tools/skill-turn-harness/pytest.ini
@@ -0,0 +1,6 @@
+[pytest]
+# Ensures the root directory is on the path so tests can import 'harness'.
+# This allows the tool and its tests to be fully portable.
+pythonpath = .
+markers =
+    e2e: mark a test as an end-to-end test that requires external APIs/CLI (deselect with '-m "not e2e"')
--- a/tools/skill-turn-harness/requirements.txt
+++ b/tools/skill-turn-harness/requirements.txt
@@ -0,0 +1,6 @@
+google-genai
+pydantic
+pyyaml
+click
+jsonschema
+google-antigravity
--- a/tools/skill-turn-harness/test/fixtures/.env.test
+++ b/tools/skill-turn-harness/test/fixtures/.env.test
@@ -0,0 +1,2 @@
+MY_SECRET_ID=dummy-secret-12345
+ANOTHER_VAR=europe-west1
--- a/tools/skill-turn-harness/test/fixtures/mock-conversation-skill/SKILL.md
+++ b/tools/skill-turn-harness/test/fixtures/mock-conversation-skill/SKILL.md
@@ -0,0 +1,25 @@
+---
+name: fast-setup-poc
+description: 'A wizard to help users configure FAST (Fabric Architecture Setup Tool) step-by-step. Use this skill when asked to configure FAST, run the FAST wizard, or setup FAST.'
+---
+
+# FAST Setup Wizard
+
+## Instructions
+You are the FAST Setup Wizard. Your goal is to collect exactly 3 pieces of information from the user in this exact order:
+1. Google Cloud Project ID
+2. Preferred Region
+3. Billing Account ID
+
+Rules:
+- Ask for exactly ONE piece of information at a time. Do not ask for the next piece until the user has provided the current one.
+- Keep your responses extremely brief. Acknowledge the received information and ask the next question.
+- For the Region, validate the user's input against the [supported regions](./references/extra_content.md). If invalid, ask again.
+- Once all three pieces of information are collected, provide a final summary of the configuration.
+- Do not execute any commands or write any files. Just collect the information and print the summary.
+
+Example Workflow:
+Wizard: "Hi, let's configure FAST. Please provide your Google Cloud Project ID."
+User: "my-project-123"
+Wizard: "Got it (my-project-123). Next, what is your preferred Region?"
+...and so on.
--- a/tools/skill-turn-harness/test/fixtures/mock-conversation-skill/references/extra_content.md
+++ b/tools/skill-turn-harness/test/fixtures/mock-conversation-skill/references/extra_content.md
@@ -0,0 +1,8 @@
+# Supported Regions
+
+For the FAST Setup Wizard, only the following regions are officially supported in this PoC:
+- europe-west1
+- us-central1
+- asia-northeast1
+
+Do not accept regions outside of this list.
--- a/tools/skill-turn-harness/test/fixtures/mock-tool-use-skill/SKILL.md
+++ b/tools/skill-turn-harness/test/fixtures/mock-tool-use-skill/SKILL.md
@@ -0,0 +1,10 @@
+---
+name: tool-test-skill
+description: 'A simple skill to test if the agent correctly executes tools.'
+---
+
+# Instructions
+You are a simple file-creating agent. When the user asks you to create a file, you MUST use the `write_file` tool to create a file named `output.txt` in the current directory.
+The content of the file must be exactly: `Hello World`
+
+Once you have successfully executed the tool, tell the user that the file has been created.
--- a/tools/skill-turn-harness/test/fixtures/playbook_autonomous_conversation.yaml
+++ b/tools/skill-turn-harness/test/fixtures/playbook_autonomous_conversation.yaml
@@ -0,0 +1,32 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# yaml-language-server: $schema=../../playbooks/playbook.schema.json
+name: "FAST Setup PoC - Autonomous"
+env:
+  - MY_SECRET_ID
+persona:
+  initial_user_input: "Hi, please activate the fast-setup-poc skill and let's configure FAST."
+  context: >
+    You are a GCP developer setting up FAST.
+    Your Project ID is ${MY_SECRET_ID} and your region is europe-west1.
+    Your Billing Account ID is 012345-6789AB-CDEF01.
+    Do not volunteer information unless explicitly asked.
+    If the agent asks for something not in your knowledge base, say you don't know.
+  max_turns: 8
+  success_criteria:
+    llm_checks:
+      - "The agent provided a final summary containing the correct project ID (${MY_SECRET_ID}), region (europe-west1), and billing account."
+    flow_contains:
+      - "012345-6789AB-CDEF01"
--- a/tools/skill-turn-harness/test/fixtures/playbook_autonomous_tool_use.yaml
+++ b/tools/skill-turn-harness/test/fixtures/playbook_autonomous_tool_use.yaml
@@ -0,0 +1,32 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# yaml-language-server: $schema=../../playbooks/playbook.schema.json
+name: "Tool Call Test Autonomous"
+timeout: 30
+persona:
+  initial_user_input: "Hi, please activate the tool-test-skill and create the file as instructed."
+  context: |
+    You are a user asking the agent to create a file.
+    Wait for the agent to confirm the file has been created.
+  max_turns: 3
+  success_criteria:
+    llm_checks:
+      - "The agent confirmed the file was created."
+    tool_calls_contain:
+      write_file:
+        - "output.txt"
+        - "Hello World"
+    files_exist:
+      - "output.txt"
--- a/tools/skill-turn-harness/test/fixtures/playbook_scripted_conversation.yaml
+++ b/tools/skill-turn-harness/test/fixtures/playbook_scripted_conversation.yaml
@@ -0,0 +1,28 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# yaml-language-server: $schema=../../playbooks/playbook.schema.json
+name: "FAST Setup PoC"
+steps:
+  - user_input: "Hi, please activate the fast-setup-poc skill and let's configure FAST."
+    expected_outcome: "The agent should greet the user, confirm it is the FAST Setup Wizard, and ask for the Google Cloud Project ID."
+
+  - user_input: "my-super-project-123"
+    expected_outcome: "The agent should acknowledge the Project ID and ask for the preferred Region."
+
+  - user_input: "europe-west1"
+    expected_outcome: "The agent should acknowledge the Region and ask for the Billing Account ID."
+
+  - user_input: "012345-6789AB-CDEF01"
+    expected_outcome: "The agent should acknowledge the Billing Account ID and provide a final summary of the configuration containing all three pieces of information."
--- a/tools/skill-turn-harness/test/fixtures/playbook_scripted_env_substitution.yaml
+++ b/tools/skill-turn-harness/test/fixtures/playbook_scripted_env_substitution.yaml
@@ -0,0 +1,30 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# yaml-language-server: $schema=../../playbooks/playbook.schema.json
+name: "FAST Setup PoC with Env"
+env:
+  - MY_SECRET_ID
+steps:
+  - user_input: "Hi, please activate the fast-setup-poc skill and let's configure FAST."
+    expected_outcome: "The agent should greet the user, confirm it is the FAST Setup Wizard, and ask for the Google Cloud Project ID."
+
+  - user_input: "${MY_SECRET_ID}"
+    expected_outcome: "The agent should acknowledge the Project ID and ask for the preferred Region."
+
+  - user_input: "europe-west1"
+    expected_outcome: "The agent should acknowledge the Region and ask for the Billing Account ID."
+
+  - user_input: "012345-6789AB-CDEF01"
+    expected_outcome: "The agent should acknowledge the Billing Account ID and provide a final summary of the configuration containing all three pieces of information."
--- a/tools/skill-turn-harness/test/test_harness.py
+++ b/tools/skill-turn-harness/test/test_harness.py
@@ -0,0 +1,311 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import json
+import subprocess
+import asyncio
+from unittest.mock import patch, MagicMock, AsyncMock, PropertyMock
+import pytest
+from dataclasses import asdict
+
+import harness
+
+# --- Phase A: Data & Logging Unit Tests ---
+
+
+def test_parse_and_validate_env(monkeypatch):
+  playbook = {'env': ['TEST_KEY']}
+
+  # Missing key raises error
+  with pytest.raises(ValueError,
+                     match='Missing required environment variables: TEST_KEY'):
+    harness.parse_and_validate_env(playbook)
+
+  # Present key succeeds
+  monkeypatch.setenv('TEST_KEY', '123')
+  result = harness.parse_and_validate_env(playbook)
+  assert result['TEST_KEY'] == '123'
+
+
+def test_step_data_serialization():
+  step = harness.StepData(
+      step_index=0,
+      user_input='hello',
+      expected_outcome='greet back',
+      skill_response='hi',
+      parsed_eval={
+          'passed': True,
+          'reasoning': 'ok'
+      },
+      is_system_error=False,
+  )
+  d = asdict(step)
+  assert d['step_index'] == 0
+  assert d['user_input'] == 'hello'
+  assert d['expected_outcome'] == 'greet back'
+  assert d['parsed_eval']['passed'] is True
+
+
+def test_load_env_file(tmp_path):
+  env_file = tmp_path / '.env'
+  env_file.write_text('FOO=bar\n# comment\nBAZ=qux=123\n')
+
+  harness.load_env_file(str(env_file))
+  assert os.environ.get('FOO') == 'bar'
+  assert os.environ.get('BAZ') == 'qux=123'
+
+
+def test_markdown_logging(tmp_path):
+  log_file = tmp_path / 'test_log.md'
+  harness.init_markdown_log(str(log_file), 'Test Playbook')
+  harness.log_step_to_markdown(
+      md_log_path=str(log_file),
+      step_index=0,
+      user_input='input 1',
+      expected_outcome='outcome 1',
+      skill_response='response 1',
+      parsed_eval={
+          'passed': True,
+          'reasoning': 'Good job'
+      },
+  )
+  content = log_file.read_text()
+  assert '# Interaction Log: Test Playbook' in content
+  assert '## Step 1' in content
+  assert '**User:**\n\ninput 1' in content
+  assert '**Expected Outcome:**\n\noutcome 1' in content
+  assert '**Agent:**\n\nresponse 1' in content
+  assert '✅ PASS: Good job' in content
+
+
+def test_dump_failed_log(tmp_path):
+  interaction_log = [{'step': 1, 'error': 'test'}]
+  harness.dump_failed_log(str(tmp_path), 'test-playbook-prefix',
+                          interaction_log)
+  failed_file = tmp_path / 'test-playbook-prefix_failed.json'
+  assert failed_file.exists()
+  data = json.loads(failed_file.read_text())
+  assert len(data) == 1
+  assert data[0]['error'] == 'test'
+
+
+# --- Phase B: Execution Unit Tests (Mocked) ---
+
+
+@patch('harness.genai.Client')
+@patch('harness.Agent')
+def test_run_hybrid_tuning_loop_mocked_success(mock_agent_class,
+                                               mock_client_class, tmp_path):
+  # Mock Conversation
+  mock_conversation = MagicMock()
+  mock_conversation.send = AsyncMock()
+
+  async def mock_receive_steps():
+    yield harness.agy_types.Step(type=harness.agy_types.StepType.TEXT_RESPONSE,
+                                 source=harness.agy_types.StepSource.MODEL,
+                                 target=harness.agy_types.StepTarget.USER,
+                                 status=harness.agy_types.StepStatus.DONE,
+                                 content="Mocked Agent Response")
+
+  mock_conversation.receive_steps.return_value = mock_receive_steps()
+  type(mock_conversation).last_response = PropertyMock(
+      return_value="Mocked Agent Response")
+
+  # Mock Agent
+  mock_agent = MagicMock()
+  mock_agent.conversation = mock_conversation
+  mock_agent_class.return_value.__aenter__.return_value = mock_agent
+
+  # Mock Evaluator
+  mock_eval_client = MagicMock()
+  mock_client_class.return_value = mock_eval_client
+  mock_eval_response = MagicMock()
+  mock_eval_response.text = '{"passed": true, "reasoning": "Mocked pass"}'
+  mock_eval_client.models.generate_content.return_value = mock_eval_response
+
+  # Playbook
+  playbook_content = """
+name: "Mocked Playbook"
+steps:
+  - user_input: "Hello"
+    expected_outcome: "Greet"
+"""
+  playbook_file = tmp_path / "playbook.yaml"
+  playbook_file.write_text(playbook_content)
+
+  import asyncio
+  result = asyncio.run(
+      harness.run_hybrid_tuning_loop(str(playbook_file), log_dir=str(tmp_path)))
+
+  assert result is True
+  mock_conversation.send.assert_called_once_with("Hello")
+  mock_eval_client.models.generate_content.assert_called_once()
+
+
+@patch('harness.genai.Client')
+@patch('harness.Agent')
+def test_run_hybrid_tuning_loop_mocked_timeout(mock_agent_class,
+                                               mock_client_class, tmp_path):
+  # Mock genai.Client
+  mock_client_class.return_value = MagicMock()
+  import asyncio
+  mock_conversation = MagicMock()
+  mock_conversation.send = AsyncMock(side_effect=asyncio.TimeoutError())
+
+  async def empty_gen():
+    if False:
+      yield
+
+  mock_conversation.receive_steps.return_value = empty_gen()
+
+  mock_agent = MagicMock()
+  mock_agent.conversation = mock_conversation
+  mock_agent_class.return_value.__aenter__.return_value = mock_agent
+
+  # Playbook
+  playbook_content = """
+name: "Mocked Playbook"
+steps:
+  - user_input: "Hello"
+    expected_outcome: "Greet"
+"""
+  playbook_file = tmp_path / "playbook.yaml"
+  playbook_file.write_text(playbook_content)
+
+  result = asyncio.run(
+      harness.run_hybrid_tuning_loop(str(playbook_file), log_dir=str(tmp_path)))
+
+  assert result is False
+  mock_conversation.send.assert_called_once_with("Hello")
+
+  log_files = list(tmp_path.glob('*_log.md'))
+  assert len(log_files) == 1
+  content = log_files[0].read_text()
+  assert 'SYSTEM_ERROR: Timeout' in content
+
+
+# --- Phase C: E2E Test ---
+
+
+@pytest.mark.e2e
+def test_e2e_hybrid_tuning_loop(tmp_path):
+  '''
+    Runs the actual evaluation loop against the basic FAST Setup PoC skill.
+    Uses tmp_path for log_dir so we don't pollute the actual workspace logs.
+    '''
+  fixtures_dir = os.path.join(os.path.dirname(__file__), 'fixtures')
+  skill_dir = os.path.join(fixtures_dir, 'mock-conversation-skill')
+  playbook_path = os.path.join(fixtures_dir,
+                               'playbook_scripted_env_substitution.yaml')
+  env_file_path = os.path.join(fixtures_dir, '.env.test')
+
+  # Load env to prime the os.environ
+  harness.load_env_file(env_file_path)
+
+  result = asyncio.run(
+      harness.run_hybrid_tuning_loop(playbook_path, log_dir=str(tmp_path),
+                                     skill_src=skill_dir))
+  assert result is True
+  # Verify the log file was created in the temporary directory
+  log_files = list(tmp_path.glob('*_log.md'))
+  assert len(log_files) == 1
+  log_file = log_files[0]
+  assert log_file.exists()
+  content = log_file.read_text()
+  assert '✅ PASS' in content
+  # Verify substitution happened securely
+  assert 'dummy-secret-12345' in content
+  assert '${MY_SECRET_ID}' not in content
+
+
+@pytest.mark.e2e
+def test_e2e_autonomous_tuning_loop(tmp_path):
+  '''
+  Runs the autonomous evaluation loop against the basic FAST Setup PoC skill.
+  '''
+  fixtures_dir = os.path.join(os.path.dirname(__file__), 'fixtures')
+  skill_dir = os.path.join(fixtures_dir, 'mock-conversation-skill')
+  playbook_path = os.path.join(fixtures_dir,
+                               'playbook_autonomous_conversation.yaml')
+  env_file_path = os.path.join(fixtures_dir, '.env.test')
+
+  harness.load_env_file(env_file_path)
+
+  result = asyncio.run(
+      harness.run_hybrid_tuning_loop(playbook_path, log_dir=str(tmp_path),
+                                     skill_src=skill_dir))
+  assert result is True
+  log_files = list(tmp_path.glob('*_log.md'))
+  assert len(log_files) == 1
+  content = log_files[0].read_text()
+
+  # Check that the autonomous turns were logged
+  assert '## Autonomous Turn 1' in content
+  assert 'dummy-secret-12345' in content
+
+
+@pytest.mark.e2e
+def test_e2e_tool_calls_contain(tmp_path):
+  '''
+  Runs an autonomous evaluation loop to verify tool_calls_contain deterministic checks.
+  '''
+  fixtures_dir = os.path.join(os.path.dirname(__file__), 'fixtures')
+  skill_dir = os.path.join(fixtures_dir, 'mock-tool-use-skill')
+  playbook_path = os.path.join(fixtures_dir,
+                               'playbook_autonomous_tool_use.yaml')
+
+  result = asyncio.run(
+      harness.run_hybrid_tuning_loop(playbook_path, log_dir=str(tmp_path),
+                                     skill_src=skill_dir))
+
+  assert result is True
+  # Verify that the session JSON was saved
+  session_files = list(tmp_path.glob('*_session.json'))
+  assert len(session_files) == 1
+  assert session_files[0].exists()
+
+
+@pytest.mark.e2e
+def test_e2e_working_dir(tmp_path):
+  '''
+  Runs an evaluation loop to verify working_dir functionality.
+  '''
+  fixtures_dir = os.path.join(os.path.dirname(__file__), 'fixtures')
+  skill_dir = os.path.join(fixtures_dir, 'mock-tool-use-skill')
+
+  # Create a specific subdirectory in tmp_path
+  workdir_target = tmp_path / "workdir_target"
+  workdir_target.mkdir()
+
+  # Dynamically create a playbook YAML file
+  playbook_content = f"""# yaml-language-server: $schema=../../playbooks/playbook.schema.json
+name: "Tool Test with Workdir"
+working_dir: "{workdir_target.resolve()}"
+steps:
+  - user_input: "Hi, please activate tool-test-skill and create the file output.txt."
+    expected_outcome: "The agent confirms it has created the file."
+"""
+  playbook_path = tmp_path / "playbook_workdir.yaml"
+  playbook_path.write_text(playbook_content)
+
+  result = asyncio.run(
+      harness.run_hybrid_tuning_loop(str(playbook_path), log_dir=str(tmp_path),
+                                     skill_src=skill_dir))
+
+  assert result is True
+  # Verify that output.txt was created INSIDE workdir_target
+  output_file = workdir_target / "output.txt"
+  assert output_file.exists()
+  assert output_file.read_text().strip() == "Hello World"