Files

Ludovico Magnocavallo 81f72e8068 Add FAST Prerequisites Skill and Gemini Skill Test Harness (#3979 )

* initial version of a FAST pre-install skill

* first round of testing

* Update fast-0-org-setup-prereqs skill with improved UX and local path handling

- Add explicit lockout warning and stop condition if the user is not a member of the provided Admin Principal group.
- Streamline bootstrap project selection to only prompt for an override if the active gcloud project is rejected.
- Restrict dataset discovery strictly to the `fast/stages/0-org-setup/datasets/` directory.
- Improve location handling by referencing `defaults.schema.json` for Standard GCP and auto-configuring fixed regions for GCD.
- Add comprehensive `local_path` management: prompt for customization, create directories, move `defaults.yaml` to the local data folder, and symlink `0-org-setup.auto.tfvars` back to the stage directory.

* add testing scenarios, implement initial changes for scenario 2

* move skills

* move to a skills/fast subfolder

* Refactor fast-0-org-setup prereqs skill

* Add skill-turn-harness utility tool

* Use relative markdown links for skill references

* Use descriptive titles for markdown links in skill references

* Add descriptions to each phase in the prerequisites workflow map

* Use backslash for markdown line breaks in skill map

* Update README security warning to mention default .gitignore

* shebang

* Update fast prereqs skill rules to force sequential question flow and refine harness tool with proper ctrl+c handling and slugified log paths

* Move playbook-gcp-dev.yaml to fast/prerequisites/gcp-dev.yaml and update fast prerequisites

* docs(skill-turn-harness): detail autonomous pond testing approach

* docs(skill-turn-harness): add final_state_checks to pond architecture and update toc

* Refine fast prereqs SKILL and gcp-dev playbook to strictly align with one-question-at-a-time rule

* feat(skill-turn-harness): update playbook schema for autonomous persona mode

* feat(skill-turn-harness): implement autonomous persona testing mode and fallback logic

* docs(skill-turn-harness): document the three modes of testing and update ToC

* implement timeout, schema validation, configurable cli

* chore: remove accidentally committed log files

* chore: ignore logs directory

* feat(skill-harness): implement tool execution interception, configurable workspace, and modularized validation

* feat(skill-harness): add model configuration and update README

* fix(skill-harness): automatically inject -y flag to gemini commands

* docs(skill-harness): add TODO.md with analysis for skill environment dependencies

* feat(skill-harness): add working_dir support and clean up fixtures

- Implement working_dir in harness to run tests in specific directories.
- Rename test fixtures and playbooks to be more descriptive.
- Add E2E test for working_dir.
- Apply code quality improvements to harness.py (imports, linting).
- Update README with working directory considerations and usage notes.
- Update phase3-bootstrap-and-iam.md skill doc to add execution rule against creating temp scripts.

* fix: capture customer_id and respect relative paths

* Implement isolated temp workspace sandboxing with symlinks in test harness

* Configure GCD manual autonomous playbook and align Phase 3/4 steps order

* Fix linting and schema tests failures

- Add missing license headers to tools/skill-turn-harness files.

- Fix trailing spaces and newlines in playbooks.

- Ignore tools directory in schema tests workflow.

TAG=agy

CONV=1bb75453-c3e2-448b-bae9-8e332a068012

* Fix Python formatting with yapf

TAG=agy

CONV=1bb75453-c3e2-448b-bae9-8e332a068012

* Refactor skill-turn-harness to use Antigravity SDK

- Migrated harness from gemini-cli subprocesses to Antigravity SDK.
- Implemented real-time step streaming and console logging.
- Added color-coded terminal output (dark gray headers, blue inputs, pink outputs).
- Collapsed excessive newlines in streamed thoughts.
- Excluded harness codebase from workspace copy to prevent agent cheating.
- Enabled skills folder copy to resolve agent lookup loops.
- Added key validation and CLI --debug flag.

* Fix autonomous turn layout: print Turn ID before execution

- Moved the [Autonomous Turn X] header print to before running the agent turn.
- This groups the real-time thinking and tool calls under the correct Turn ID block, instead of displaying them before the label.

* Remove obsolete .log.md from prerequisites skill directory

2026-05-22 17:16:54 +00:00

5.3 KiB

Executable File

Raw Blame History

Architecture Document: Hybrid Python/CLI Test Harness

This document outlines the architecture for testing the Fabric FAST configuration skill. It uses a hybrid approach, executing the skill in its native CLI environment while maintaining deterministic control via a Python orchestration loop.

1. The Approach: Hybrid Isolation

To accurately test the skill in its target environment while ensuring the reliability of the test harness, the execution and evaluation layers are strictly separated:

Orchestrator (Python): A Python script acts as the absolute authority. It maintains the state machine, reads the playbook, injects inputs, captures outputs, and triggers evaluations.
Execution Target (Gemini CLI): The skill is run via the gemini CLI using Python's subprocess module. This ensures the test reflects the actual user environment. State is maintained across steps using the CLI's session management flags (e.g., --resume).
Evaluator (Gemini API): The semantic evaluation of the CLI's output is performed via direct API calls to Gemini 1.5 Flash. This bypasses the string-parsing unreliability of a CLI and guarantees structured JSON output via Pydantic schemas.

2. The Execution Loop

The Python orchestrator executes the following rigid sequence for each step in a defined playbook:

Injection: Read the mocked user input and expected outcome from the playbook step.
Subprocess Execution: Invoke the Gemini CLI with the user input and the designated session_id. Capture stdout and trap stderr to handle hangs or crashes.
Prompt Assembly: Construct a strict evaluation prompt combining the exact playbook expectation with the raw string response captured from the CLI.
Stateless Evaluation: Call the Gemini API with the evaluation prompt, enforcing a structured output schema (Boolean Pass/Fail and Reasoning).
Verdict Enforcement: If the evaluator returns True, proceed to the next step. If False, immediately halt the loop, dump the interaction trace to a JSON file, and alert the developer.

3. Implementation Code

The following Python script implements the hybrid harness:

import subprocess
import json
import sys
from pydantic import BaseModel
from google import genai
from google.genai import types

# 1. Define Strict Evaluator Schema
class EvaluationResult(BaseModel):
passed: bool
reasoning: str

evaluator_client = genai.Client()

def invoke_skill_cli(user_input: str, session_id: str) -> str:
# Requires the CLI to support a session resume flag for state
command = ["gemini", "--resume", session_id, "-p", user_input]
try:
result = subprocess.run(command, capture_output=True, text=True, timeout=30)
if result.returncode != 0:
print(f"⚠️ [CLI ERROR]: {result.stderr}", file=sys.stderr)
return f"SYSTEM_ERROR: {result.stderr}"
return result.stdout.strip()
except subprocess.TimeoutExpired:
print("⚠️ [CLI TIMEOUT]", file=sys.stderr)
return "SYSTEM_ERROR: Timeout"

def run_hybrid_tuning_loop(playbook_name: str, playbook_steps: list, session_id: str):
print(f"--- Tuning: {playbook_name} | Session: {session_id} ---")
interaction_log = []

for step\_index, step in enumerate(playbook\_steps):  
    user\_input, expected\_outcome \= step\['user\_input'\], step\['expected\_outcome'\]  
      
    skill\_response \= invoke\_skill\_cli(user\_input, session\_id)  
    if skill\_response.startswith("SYSTEM\_ERROR"): break  
          
    eval\_prompt \= f"""  
    OBJECTIVE: {expected\_outcome}  
    ACTUAL RESPONSE: {skill\_response}  
    Evaluate if the agent fulfilled the objective.  
    """

    eval\_response \= evaluator\_client.models.generate\_content(  
        model="gemini-1.5-flash", contents=eval\_prompt,  
        config=types.GenerateContentConfig(  
            response\_mime\_type="application/json",  
            response\_schema=EvaluationResult, temperature=0.0  
        )  
    )  
      
    parsed\_eval \= json.loads(eval\_response.text)  
    interaction\_log.append({"step": step\_index \+ 1, "input": user\_input, "evaluation": parsed\_eval})

    if not parsed\_eval\['passed'\]:  
        print(f"❌ \[FAILURE Step {step\_index \+ 1}\]: {parsed\_eval\['reasoning'\]}")  
        with open(f"{playbook\_name}\_failed.json", "w") as f: json.dump(interaction\_log, f)  
        return False 

print(f"✅ \[SUCCESS\]")  
return True

4. Critical Implementation Warnings

Session Data Persistence: The CLI likely persists session states to disk (e.g., in a local database or JSON file). If you reuse the same session_id for consecutive test runs without manually deleting the cache file, the skill will inherit the context of the previous run, causing immediate test failures. You must either generate a UUID for every run or build a cache-clearing mechanism into the Python script.
Context Window Discipline: The evaluation prompt is strictly limited to the current playbook objective and the immediate CLI response. Do not feed the entire CLI conversation history to the Evaluator API, as this significantly increases the risk of hallucinated grading.

5.3 KiB Executable File Raw Blame History