Add FAST Prerequisites Skill and Gemini Skill Test Harness (#3979)

* initial version of a FAST pre-install skill

* first round of testing

* Update fast-0-org-setup-prereqs skill with improved UX and local path handling

- Add explicit lockout warning and stop condition if the user is not a member of the provided Admin Principal group.
- Streamline bootstrap project selection to only prompt for an override if the active gcloud project is rejected.
- Restrict dataset discovery strictly to the `fast/stages/0-org-setup/datasets/` directory.
- Improve location handling by referencing `defaults.schema.json` for Standard GCP and auto-configuring fixed regions for GCD.
- Add comprehensive `local_path` management: prompt for customization, create directories, move `defaults.yaml` to the local data folder, and symlink `0-org-setup.auto.tfvars` back to the stage directory.

* add testing scenarios, implement initial changes for scenario 2

* move skills

* move to a skills/fast subfolder

* Refactor fast-0-org-setup prereqs skill

* Add skill-turn-harness utility tool

* Use relative markdown links for skill references

* Use descriptive titles for markdown links in skill references

* Add descriptions to each phase in the prerequisites workflow map

* Use backslash for markdown line breaks in skill map

* Update README security warning to mention default .gitignore

* shebang

* Update fast prereqs skill rules to force sequential question flow and refine harness tool with proper ctrl+c handling and slugified log paths

* Move playbook-gcp-dev.yaml to fast/prerequisites/gcp-dev.yaml and update fast prerequisites

* docs(skill-turn-harness): detail autonomous pond testing approach

* docs(skill-turn-harness): add final_state_checks to pond architecture and update toc

* Refine fast prereqs SKILL and gcp-dev playbook to strictly align with one-question-at-a-time rule

* feat(skill-turn-harness): update playbook schema for autonomous persona mode

* feat(skill-turn-harness): implement autonomous persona testing mode and fallback logic

* docs(skill-turn-harness): document the three modes of testing and update ToC

* implement timeout, schema validation, configurable cli

* chore: remove accidentally committed log files

* chore: ignore logs directory

* feat(skill-harness): implement tool execution interception, configurable workspace, and modularized validation

* feat(skill-harness): add model configuration and update README

* fix(skill-harness): automatically inject -y flag to gemini commands

* docs(skill-harness): add TODO.md with analysis for skill environment dependencies

* feat(skill-harness): add working_dir support and clean up fixtures

- Implement working_dir in harness to run tests in specific directories.
- Rename test fixtures and playbooks to be more descriptive.
- Add E2E test for working_dir.
- Apply code quality improvements to harness.py (imports, linting).
- Update README with working directory considerations and usage notes.
- Update phase3-bootstrap-and-iam.md skill doc to add execution rule against creating temp scripts.

* fix: capture customer_id and respect relative paths

* Implement isolated temp workspace sandboxing with symlinks in test harness

* Configure GCD manual autonomous playbook and align Phase 3/4 steps order

* Fix linting and schema tests failures

- Add missing license headers to tools/skill-turn-harness files.

- Fix trailing spaces and newlines in playbooks.

- Ignore tools directory in schema tests workflow.

TAG=agy

CONV=1bb75453-c3e2-448b-bae9-8e332a068012

* Fix Python formatting with yapf

TAG=agy

CONV=1bb75453-c3e2-448b-bae9-8e332a068012

* Refactor skill-turn-harness to use Antigravity SDK

- Migrated harness from gemini-cli subprocesses to Antigravity SDK.
- Implemented real-time step streaming and console logging.
- Added color-coded terminal output (dark gray headers, blue inputs, pink outputs).
- Collapsed excessive newlines in streamed thoughts.
- Excluded harness codebase from workspace copy to prevent agent cheating.
- Enabled skills folder copy to resolve agent lookup loops.
- Added key validation and CLI --debug flag.

* Fix autonomous turn layout: print Turn ID before execution

- Moved the [Autonomous Turn X] header print to before running the agent turn.
- This groups the real-time thinking and tool calls under the correct Turn ID block, instead of displaying them before the label.

* Remove obsolete .log.md from prerequisites skill directory
This commit is contained in:
Ludovico Magnocavallo
2026-05-22 19:16:54 +02:00
committed by GitHub
parent 1594a01c6f
commit 81f72e8068
32 changed files with 2653 additions and 1 deletions

View File

@@ -274,7 +274,7 @@ jobs:
- name: Run schema tests - name: Run schema tests
run: | run: |
pytest -vv --tb=line --junit-xml=test-results-raw.xml \ pytest -vv --tb=line --junit-xml=test-results-raw.xml --ignore=tools \
-k "(tests and schemas) or (fast and schema) or (examples and yaml)" -k "(tests and schemas) or (fast and schema) or (examples and yaml)"
- name: Create report - name: Create report

1
.gitignore vendored
View File

@@ -32,5 +32,6 @@ node_modules
fast/**/globals.auto.tfvars.json fast/**/globals.auto.tfvars.json
cloud_sql_proxy cloud_sql_proxy
env/ env/
logs/*
.jetskicli .jetskicli
.agents .agents

View File

@@ -0,0 +1,54 @@
---
name: fast-0-org-setup-prereqs
description: Guides the user step-by-step through the prerequisites for the FAST 0-org-setup stage, supporting both Standard GCP and Google Cloud Dedicated (GCD) environments. Use when a user asks to prepare or run prerequisites for 0-org-setup or bootstrap the FAST landing zone.
---
# FAST 0-org-setup Prerequisites Guide
## Core Principles & Execution Rules
> [!CRITICAL]
> **Understanding Turn Boundaries:** You are running in a turn-based execution environment.
> - You receive one user message, then you can think and run tools.
> - Once you decide you need input from the user (e.g. to choose an environment or confirm a principal), you MUST output your question in a text response and **STOP execution immediately (do not call more tools)**.
> - You **CANNOT** receive the user's answer in the same turn.
> - You **MUST NOT** simulate, guess, or assume the user's response in your subsequent thinking blocks during the same turn. You must wait for the next turn to receive the actual input.
>
> **Do NOT Skip Steps or Make Assumptions:** You MUST NOT skip any phases or steps, even if you think they are redundant or if you find information on the system (like active credentials) that suggests a step is already complete. You MUST execute every step sequentially, in order, and wait for explicit user input/confirmation at each step boundary.
>
> **Strictly One Question at a Time:** You MUST NOT bundle multiple questions or steps together in a single response. Ask exactly one question, wait for the user's answer, and only then proceed to the next question or action.
>
> **Step-by-Step Execution:** Never implement a single "magical" flow. Go through each step one at a time, explaining context and asking for confirmation.
3. **Execution Choice:** Respect the user's execution preference (automatic via `run_shell_command` vs. manual copy/paste) throughout the entire workflow unless the user explicitly instructs you to change it. This preference will be gathered during Phase 1.
4. **File Modifications:** Always use `replace` or `write_file`. **Never** use opaque shell commands (like `sed`, `echo >>`, or `cat <<EOF >>`). Show proposed edits and ask for confirmation before applying them so the user can see what we're doing.
5. **YAML Validation:** Validate generated YAML using `yamllint -c .yamllint --no-warnings <file>`. If the command is not available, ask the user if they prefer to a) install it, b) have you install it (`pip install yamllint`), or c) skip validation.
6. **Resuming Mid-Flow:** If the user is resuming a previously interrupted session, ask them which Phase or Step they left off at, or assess the current state by reading previously generated files (e.g., `defaults.yaml`, `0-org-setup.auto.tfvars`). Read the corresponding reference file for the identified phase and resume execution directly from that point.
## Workflow Map
Guide the user through the following sequence strictly in order. **Before starting a phase, read its corresponding reference document to get the exact instructions, commands, and logic.**
### Phase 1: Environment & Authentication
*Description:* Determine the target environment (Standard GCP or GCD) and ensure the user is properly authenticated.\
*Reference: [Environment & Authentication](references/phase1-env-and-auth.md)*
- **Step 1:** Environment Assessment & Initialization (Standard vs GCD)
- **Step 2:** Authentication
### Phase 2: Admin Principal & Baseline Info
*Description:* Define the core administrative identity and gather essential baseline data like Organization and Billing IDs.\
*Reference: [Admin Principal & Baseline Info](references/phase2-admin-and-baseline.md)*
- **Step 3:** Admin Principal Definition (Group vs Single User)
- **Step 4:** Baseline Information Gathering (Org ID, Billing ID, Billing Access Scenarios)
### Phase 3: Bootstrap Project & IAM
*Description:* Assign the necessary organization-level IAM roles and set up a temporary project for API quota tracking.\
*Reference: [Bootstrap Project & IAM](references/phase3-bootstrap-and-iam.md)*
- **Step 5:** IAM Role Assignments
- **Step 6:** Bootstrap Project Setup (Creation and API enablement)
### Phase 4: Configuration & Wrap-up
*Description:* Generate the FAST dataset configuration, handle existing organization policies, and prepare for the final Terraform apply.\
*Reference: [Configuration & Wrap-up](references/phase4-config-and-wrapup.md)*
- **Step 7:** Configuration Generation (Datasets, defaults.yaml, local paths)
- **Step 8:** Organization Policy Import Check
- **Step 9:** Wrap-up & Apply

View File

@@ -0,0 +1,67 @@
# FAST 0-org-setup Prerequisites - Testing Workflows
This document outlines the various testing workflows and combinations for the `fast-0-org-setup-prereqs` skill to ensure all branches of the skill's logic are covered.
## Core Variables & Decision Points
1. **Environment:** Standard GCP vs. Google Cloud Dedicated (GCD)
2. **GCD Universe (if applicable):** S3NS (France), Berlin (Germany), or Custom
3. **Execution Mode:** Auto (Gemini CLI runs commands) vs. Manual (Output commands for user)
4. **Authentication State:** Already authenticated vs. Needs authentication
5. **Admin Principal Approach:** Group (Approach A) vs. Single User (Approach B)
6. **Billing Account Access Level:** Billing Administrator vs. Billing User vs. No Access
7. **Bootstrap Project:** Pre-existing (with or without API check) vs. New project created by user
8. **Configuration / Dataset:** Standard GCP datasets vs. `classic-gcd` dataset
9. **Org Policy Import Check:** Pre-existing policies found vs. No policies found
## Comprehensive Test Scenarios
### Scenario 1: Standard GCP - The "Simple User" Flow
* **Environment:** Standard GCP
* **Execution Mode:** Manual
* **Authentication:** Needs authentication (Standard auth flow)
* **Admin Principal:** Single User (current authenticated user)
* **Information Gathering:** Direct input of Org ID and Billing ID
* **Billing Access:** Billing Administrator
* **Bootstrap Project:** New project (user creates, CLI outputs full API enable command)
* **Configuration:** Default local path, standard dataset, additional context added
* **Org Policies:** Some policies exist (to test tfvars update)
### Scenario 2: Standard GCP - The "Developer" Flow
* **Environment:** Standard GCP
* **Execution Mode:** Auto
* **Authentication:** Already authenticated
* **Admin Principal:** Group (`foo-owners@example.com`)
* **Information Gathering:** Uses "list" to fetch Org ID and Billing ID
* **Billing Access:** Billing User (requires disabling billing YAML)
* **Bootstrap Project:** Pre-existing project (auto-check and enable missing APIs)
* **Configuration:** Custom local path, standard dataset, no additional context
* **Org Policies:** Some policies exist (to test tfvars update)
#### Notes
### Scenario 3: GCD (S3NS) - Automated Setup
* **Environment:** Google Cloud Dedicated (GCD)
* **Universe:** S3NS (France)
* **Execution Mode:** Auto
* **Authentication:** Needs authentication (WIF login flow)
* **Admin Principal:** Group
* **Billing Access:** Billing Administrator
* **Bootstrap Project:** Pre-existing project (skip API check, enable all)
* **Configuration:** `classic-gcd` dataset (fixed), auto-mapped universe regions, providers.tf generated
* **Org Policies:** No policies exist
### Scenario 4: GCD (Custom) - Manual Setup with Restrictions
* **Environment:** Google Cloud Dedicated (GCD)
* **Universe:** Custom (manually input all 5 variables)
* **Execution Mode:** Manual
* **Authentication:** Already authenticated
* **Admin Principal:** Single User
* **Billing Access:** No Access (requires warning and manual continuation)
* **Bootstrap Project:** New project
* **Configuration:** `classic-gcd` dataset, custom local path
* **Org Policies:** Some policies exist

View File

@@ -0,0 +1,56 @@
# Phase 1: Environment & Authentication
### Step 1: Environment Assessment & Initialization
> [!IMPORTANT]
> **Do NOT Automate Environment Choice**: You MUST explicitly ask the user to clarify their target environment (Standard GCP or GCD) and wait for their response. Do NOT assume or guess based on local config files or active credentials.
>
> **Do NOT Automate Command Execution Preference**: You MUST ask how they prefer to run commands (automatic vs manual) and wait for their response.
1. Ask the user to clarify their target environment: **Standard GCP** or **Google Cloud Dedicated (GCD)**. **Wait for their response.**
2. Once the environment is confirmed, ask how they prefer to run commands: Should you (Gemini CLI) run them automatically, or should you output them for manual execution? **Remember this preference for the rest of the workflow. Wait for their response.**
3. *If GCD is selected*, ask the user if they are working in one of the known universes: **S3NS (France)** or **Berlin (Germany)**.
- *If S3NS:* Pre-fill the following values:
- Universe Web Domain: `cloud.s3nscloud.fr`
- Universe API Domain: `s3nsapis.fr`
- Universe Name: `s3ns`
- Universe Prefix: `s3ns`
- Universe Region: `u-france-east1`
- *If Berlin:* Pre-fill the following values:
- Universe Web Domain: `cloud.berlin-build0.goog`
- Universe API Domain: `apis-berlin-build0.goog`
- Universe Name: `berlin`
- Universe Prefix: `eu0`
- Universe Region: `u-germany-northeast1`
- *If neither (Custom):* Gather the 5 universe-specific details manually from the user.
- *Action:* Present the final list of the 5 universe values to the user for review. Ask for explicit confirmation and offer them the opportunity to change any of the values before proceeding.
### Step 2: Authentication
1. Ask the user if they are already authenticated with Google Cloud using the correct principal.
- *If yes:* Run (or ask the user to run) `gcloud config list account --format="value(core.account)"` to retrieve the current authenticated principal. Show this principal to the user and explicitly ask them to confirm if this is the correct identity they want to use.
- *If they confirm:* Proceed directly to Phase 2 (Step 3).
- *If they do not confirm:* Proceed with the authentication steps below.
- *If no:* Proceed with the authentication steps below.
2. *Standard GCP:* Provide or execute the command:
```bash
gcloud auth login
gcloud auth application-default login
```
3. *GCD:* Automate or guide the user through WIF login. Ask for the workforce pool audience string first, then generate the configuration:
```bash
# (Use the gathered GCD variables to fill placeholders)
gcloud config configurations create <UNIVERSE_NAME>
gcloud config configurations activate <UNIVERSE_NAME>
gcloud config set universe_domain <UNIVERSE_API_DOMAIN>
gcloud iam workforce-pools create-login-config <AUDIENCE> \
--universe-cloud-web-domain="<UNIVERSE_WEB_DOMAIN>" \
--universe-domain="<UNIVERSE_API_DOMAIN>" \
--output-file="/tmp/wif-login-config-<UNIVERSE_NAME>.json" \
--activate
gcloud auth login --login-config=/tmp/wif-login-config-<UNIVERSE_NAME>.json --no-launch-browser
gcloud auth application-default login --login-config=/tmp/wif-login-config-<UNIVERSE_NAME>.json
```
4. Explicitly ask the user to confirm they have successfully authenticated before moving to the next phase.

View File

@@ -0,0 +1,35 @@
# Phase 2: Admin Principal & Baseline Info
### Step 3: Admin Principal Definition
1. Explain the concept of the **Admin Principal**. This is the identity (or group of identities) that will be granted the necessary FAST roles to deploy the foundation and manage critical organization-level configurations and policies thereafter.
2. Determine the Admin Principal approach by asking the user to choose between two options:
- **Approach A (Preferred): Use a pre-created Group.**
- *Action:* Explain that using a group (e.g., `group:gcp-organization-admins@example.com`) is the standard and preferred way. **Crucially, clarify that the group provided MUST be a group that the user's current authenticated identity belongs to**, otherwise they will lock themselves out.
- *Action:* Ask the user to provide the group email address.
- *Action:* Explicitly ask the user to confirm that their current identity (the one they just authenticated with) is already a member of this group.
- *Action:* If the user answers "No" to the membership confirmation, **DO NOT PROCEED**. Inform the user that proceeding will lock them out. Ask them to either authenticate with an identity that *is* a member of the group (and restart the authentication step), or provide a different group that their current identity belongs to.
- **Approach B (Fallback): Use a Single User.**
- *Action:* Explain that this flow uses a single user as the sole GCP Org Admin, but more can be added later.
- *Action:* Run (or ask the user to run) `gcloud config list account --format="value(core.account)"` to retrieve their current authenticated principal.
- *Action:* Show the user their current principal and explicitly ask them to confirm this is the identity they want to use as the Admin Principal.
### Step 4: Baseline Information Gathering
1. Gather baseline information required for `0-org-setup`:
- Organization ID (and the associated Directory Customer ID, which is important for Standard GCP but not required for GCD)
- Billing Account ID (Mandatory for subsequent stages, even if not required for the GCD temporary project)
*Action:* When prompting the user for the Organization ID and Billing Account ID, explicitly instruct them in the prompt/question that they can leave the field blank (or type "list") to have you automatically run the relevant `gcloud` command (`gcloud organizations list --format="json"` or `gcloud beta billing accounts list`). Also, instruct them that they can type a keyword to filter the list.
*Action:* If the user leaves the field blank or types "list", run the `gcloud` command without filters. If the user types a keyword that is not a valid Organization ID (numeric) or Billing Account ID (e.g., `012345-6789AB-CDEF01`), run the `gcloud` command and use that keyword to filter the results using a **case-insensitive** regex match. Ensure the regex pattern is enclosed in single quotes within the filter argument (e.g., `--filter="displayName~'(?i)KEYWORD'"`). Do NOT use `*` wildcards in the filter.
*Action:* If the filtered `gcloud` command returns no results, inform the user and use the `ask_user` tool to ask if they want to provide a different keyword or fetch all items (run without a filter).
*Action:* Once you have results for Organizations (filtered or unfiltered), extract the Organization ID, the Display Name (domain), and the Directory Customer ID (found in `owner.directoryCustomerId` in the JSON output). Sort them alphabetically by the display name, and then output the sorted results as a clearly formatted numbered list in the chat. Then, use the `ask_user` tool (type: text) to ask the user to enter the number corresponding to their selection. Note the Organization ID, Domain, and Customer ID for Phase 4. For Billing Accounts, do the same sorting and prompting.
2. Determine the **Admin Principal's** access level to the provided Billing Account ID. Ask which of the following three scenarios applies to the Admin Principal (not necessarily the current user):
- **Scenario 1 (Billing Administrator):** The Admin Principal has `roles/billing.admin`.
- *Action:* Ask a follow-up question: "Is your billing account managed by the same organization where we are installing FAST, or outside of it? (You can check this in the Google Cloud Console by going to Billing -> using the organization picker on top -> checking if the account is listed under this organization)."
- *If Inside the Org:* Note that `roles/billing.admin` WILL be assigned at the Organization level in Step 6. Instruct the user that we will deactivate the billing factories path for now, but if account-level IAM also needs to be managed via FAST later, they can reactivate the path and use the billing YAML to do it.
- *If Outside the Org:* Note that `roles/billing.admin` WILL NOT be assigned at the Organization level in Step 6.
- **Scenario 2 (Billing User):** The Admin Principal has `roles/billing.user` but NOT admin rights.
- *Action:* Note that `roles/billing.admin` WILL NOT be assigned at the Organization level in Step 6. Either disable the billing YAML via the `factories_config` variable or comment it out, since the Admin Principal cannot control IAM on the account.
- *Action:* **Explain to the user:** The service accounts for IaC (and therefore the provider switch and subsequent stages, except for VPC-SC) will not be operative until the correct billing permissions have been assigned to them outside of FAST.
- **Scenario 3 (No Access):** The Admin Principal has absolutely no rights on the billing account.
- *Action:* **Clearly state:** This scenario is mostly used for development purposes, is strongly discouraged, and requires advanced Terraform skills and FAST knowledge to proceed.

View File

@@ -0,0 +1,43 @@
# Phase 3: Bootstrap Project & IAM
### Step 5: IAM Role Assignments
1. Grant the following roles to the chosen Admin Principal at the Organization level. **CRITICAL:** Only include `roles/billing.admin` in this list if the user selected Scenario 1 (Billing Administrator) AND confirmed the billing account is managed Inside the Organization in Step 4.
**CRITICAL EXECUTION RULE:** Do NOT create temporary bash scripts (e.g., `assign_roles.sh`) to execute this loop. You MUST execute the `for` loop inline directly using the `run_shell_command` tool, or output the exact inline loop for the user to copy/paste.
```bash
# Roles to assign:
# [roles/billing.admin] <-- CONDITIONAL (See above)
# roles/logging.admin
# roles/iam.organizationRoleAdmin
# roles/orgpolicy.policyAdmin
# roles/resourcemanager.folderAdmin
# roles/resourcemanager.organizationAdmin
# roles/resourcemanager.projectCreator
# roles/resourcemanager.tagAdmin
# roles/owner
# Loop example for the user or tool execution:
for role in [ROLES_LIST]; do
gcloud organizations add-iam-policy-binding <ORG_ID> \
--member="<ADMIN_PRINCIPAL>" --role="$role" --condition=None
done
```
### Step 6: Bootstrap Project Setup
1. Explain that a temporary bootstrap project is required to track API quota before organization policies are fully established.
2. Ask the user if they already have a suitable project they can use for this purpose.
- *If yes:* Ask if this project is already configured as the active project in `gcloud`. If the user does not know, run `gcloud config list project --format="value(core.project)"` to check for them.
- If it is already configured, fetch the Project ID using `gcloud config list project --format="value(core.project)"`. Explicitly ask the user to confirm if this fetched Project ID is the one they want to use. **Only if they answer "No" to this confirmation**, ask them to provide the correct Project ID.
- If it is not configured, ask the user to provide the Project ID.
- *If no:* Ask the user to use the Cloud Console to create a temporary project (must be linked to the billing account). Ask them to provide the new Project ID once created.
3. Once the Project ID is provided or fetched, ensure it is set as the default project. If it is not already set, run:
```bash
gcloud config set project <TEMP_PROJECT_ID>
```
4. Enable the required baseline APIs on the project:
- The required APIs are: `bigquery.googleapis.com`, `cloudbilling.googleapis.com`, `cloudresourcemanager.googleapis.com`, `essentialcontacts.googleapis.com`, `iam.googleapis.com`, `logging.googleapis.com`, `orgpolicy.googleapis.com`, `serviceusage.googleapis.com`.
- *If the project was pre-existing:* Ask the user if they want you to check which services are already enabled.
- *If yes:* Run `gcloud services list --enabled --format="value(config.name)"` to get the current list. Compute the delta between the enabled services and the required list. Only run `gcloud services enable <MISSING_APIS>` for the ones that are missing.
- *If no:* Run the full `gcloud services enable` command for all required APIs.
- *If the project is new:* Run the full `gcloud services enable` command for all required APIs.

View File

@@ -0,0 +1,54 @@
# Phase 4: Configuration & Wrap-up
### Step 7: Configuration Generation
1. **Explain Datasets:** Briefly explain to the user that FAST uses "datasets" (collections of YAML files) that fully describe the design, architecture, resources, and policies applied.
2. **Select the Dataset:** Ask the user to select the dataset they want to use for their landing zone design.
- *If GCD:* Explicitly state that the `classic-gcd` dataset must be used for GCD installations.
- *If Standard GCP:* Offer the available datasets and ask them to choose one. **Crucially, only search for available datasets within the `fast/stages/0-org-setup/datasets/` directory.** Do not search across the entire repository or other FAST stages. Provide a one-line description below each dataset when presenting the options (e.g., "classic: The standard FAST landing zone architecture", "hardened: A more restrictive, hardened landing zone architecture").
3. **Explain Defaults Configuration:** Explain to the user that we are starting the configuration of the `defaults.yaml` file, which drives the static configuration of the dataset by providing the Org ID, Billing ID, user-specified locations, and any static values they need to bring in from the outside (like additional IAM principals used in the YAML files to assign IAM roles).
4. **Determine Locations:** FAST uses a set of locations for different services.
- *If GCD:* The region is fixed based on the universe selected in Phase 1. Set the `logging` location to `global` and all other required locations to the Universe Region. Do not ask the user to choose; simply show them the configured locations.
- *If Standard GCP:* Check the `fast/stages/0-org-setup/schemas/defaults.schema.json` to identify the required location keys (e.g., `bq`, `gcs`, `logging`, `pubsub`). First, ask the user to provide a "base location" (e.g., `europe-west1`), explaining that submitting without answering will confirm the default value. Even if the user confirms the base location, you must then ask if they need to override the location for any individual services.
5. **Determine Local Path:** Explain to the user that FAST generates provider configurations and other files that need to be stored outside the repository. This is defined by the `output_files.local_path` setting. Propose a default path based on the chosen prefix (e.g., `~/fast-config/<prefix>`) and ask the user to confirm or provide a different path. **Crucially, if the user provides a relative path (e.g., `custom-fast-config` or `./custom-fast-config`), use it exactly as provided relative to the current working directory. Do not automatically prepend `~/` to their input.** Once confirmed, create this directory using `mkdir -p <LOCAL_PATH>`.
6. **Ask for Additional Context:** Ask the user if there are any other static values they want to bring in from outside to be referenced in the YAML files. Show them the available context keys (excluding `condition_vars`). Provide examples showing that prefixes are mandatory for IAM principals (e.g., `user:foo@example.com`, `group:bar@example.com`). For GCD, also show a `principalSet:` example. **Do not use shell commands like `echo` or `cat` to show this list; output it directly in your chat message or the `ask_user` prompt.**
7. **Create Local Directories and Copy Defaults:**
- Create the `data/0-org-setup/` and `providers/` directories inside the confirmed `local_path` (`mkdir -p <LOCAL_PATH>/data/0-org-setup/ <LOCAL_PATH>/providers/`).
- **Copy** (do not move) the `defaults.yaml` from the chosen dataset folder to `<LOCAL_PATH>/data/0-org-setup/defaults.yaml` using `cp`.
8. **Edit the Copied Defaults File:** Use the `replace` tool to edit the *copied* `<LOCAL_PATH>/data/0-org-setup/defaults.yaml` file.
- Populate `global.billing_account`, `global.organization.id`, `global.organization.domain`, and `global.organization.customer_id` (using the values gathered in Phase 2). Note that `customer_id` may not be present for GCD.
- Populate `context.iam_principals.gcp-organization-admins` (using the Admin Principal determined in Phase 2).
- Populate the service-specific locations gathered in the previous step into the `context.locations` block and map them appropriately in `projects.defaults.locations`.
- Populate `output_files.local_path` with the confirmed path.
- If additional context was provided, add it to the `context` block.
- *If GCD*, ensure the `overrides.universe` block is present with `domain`, `prefix`, and specific identity overrides.
9. Run YAML validation: `yamllint -c .yamllint --no-warnings <LOCAL_PATH>/data/0-org-setup/defaults.yaml`. Handle missing tool errors as described in the Core Principles.
10. **Create Configuration Files:**
- Use `write_file` to create `0-org-setup.auto.tfvars` inside the `local_path` (`<LOCAL_PATH>/0-org-setup.auto.tfvars`).
- In `0-org-setup.auto.tfvars`, set the `factories_config` variable. The `dataset` should point to the original dataset folder (e.g., `"datasets/classic"`), but the `paths.defaults` must point to the absolute path of the copied defaults file.
- *If GCD*, also: Create a temporary `0-org-setup-providers.tf` file containing the specific `universe_domain` configuration using `write_file` at `<LOCAL_PATH>/providers/0-org-setup-providers.tf`.
### Step 8: Organization Policy Import Check
1. Explain that pre-existing organization policies can cause `409 Conflict` errors during the first apply if not imported.
2. Execute (or provide) the command to list current policies.
```bash
gcloud org-policies list --organization="<ORG_ID>" --format="value(constraint)"
```
3. **Update `0-org-setup.auto.tfvars`:** If any policies are returned, capture the output, format it as an HCL list in memory, and use the `replace` tool to append the `org_policies_imports` variable to the `0-org-setup.auto.tfvars` file. **ABSOLUTELY NEVER use shell redirection like `echo >>`, `awk >>`, or `cat <<EOF >>` to edit files.** Explain to the user that this tells Terraform to import these existing policies rather than attempting to recreate them.
### Step 9: Wrap-up & Apply
1. **Link Configuration:** Instruct the user to link the final configuration files to the stage directory using the `fast-links.sh` script. They should navigate to `fast/stages/0-org-setup/` and run:
```bash
../../fast-links.sh <LOCAL_PATH>
```
Explain that they should then copy and run the commands printed by the script to create the links.
**Note:** If they are not on GCD, they can ignore the provider file linking command printed by the script for the first run.
2. Remind the user that the prerequisite phase is complete.
3. Instruct them to run `terraform init` and `terraform apply`.
4. Remind them to delete the temporary project and reset their gcloud default project to `iac-0` after a successful apply. Provide the exact commands for them to copy and use later:
```bash
gcloud projects delete <TEMP_PROJECT_ID>
gcloud config set project <IAC_PROJECT_ID>
```

6
tools/skill-turn-harness/.gitignore vendored Normal file
View File

@@ -0,0 +1,6 @@
.env*
!*.env.example
!*.env.test
__pycache__/
.pytest_cache/
.ruff_cache/

View File

@@ -0,0 +1,4 @@
[style]
based_on_style=google
indent_width=2
split_before_named_assigns=false

View File

@@ -0,0 +1,216 @@
# Hybrid Python Test Harness for Antigravity Skills
## Overview
This project provides a robust, hybrid test harness for developing and evaluating Antigravity skills. It solves the "inner dev loop" problem by allowing you to test local, unpacked skills directly against the Antigravity engine (via the Python SDK), while using an LLM to deterministically grade the agent's behavior.
## Table of Contents
- [Prerequisites](#prerequisites)
- [How to Use](#how-to-use)
- [Basic Usage](#basic-usage)
- [Command Line Options](#command-line-options)
- [Expected Output](#expected-output)
- [Testing Local Skills (Inner Dev Loop)](#testing-local-skills-inner-dev-loop)
- [Writing Playbooks](#writing-playbooks)
- [Running the Pytest Suite](#running-the-pytest-suite)
- [Writing Playbooks: Three Modes of Testing](#writing-playbooks-three-modes-of-testing)
The architecture relies on three main components:
- **Orchestrator (Python):** Drives the execution loop, reads YAML playbooks, and manages isolated workspaces for each test run to prevent session caching issues.
- **Execution Target (Antigravity SDK):** The skill is executed using the `google-antigravity` Python SDK, which manages the localharness engine in-process. This eliminates the dependency on a globally installed CLI.
- **Evaluator (Gemini API):** The semantic evaluation of the agent's output is performed via direct API calls to `gemini-2.5-flash` using `google-genai`. This bypasses brittle string-parsing and guarantees structured JSON output (Pass/Fail + Reasoning).
## Prerequisites
You can run the harness directly using `uv` (recommended), which will automatically handle downloading and running with the required dependencies:
```bash
uv run harness.py playbooks/my-playbook.yaml
```
Alternatively, ensure you have a Python virtual environment set up with the required dependencies:
```bash
pip install google-antigravity google-genai pydantic pyyaml click pytest
```
You also need your Gemini API key available in your environment, or stored in `~/.gemini/key.env`:
```bash
export GEMINI_API_KEY="your_api_key_here"
```
## How to Use
The main entry point is the `harness.py` CLI tool.
> [!NOTE]
> All commands in this guide assume you are running from the `tools/skill-turn-harness` directory. If running from the repository root, prefix paths accordingly (e.g., `python3 tools/skill-turn-harness/harness.py ...`).
### Basic Usage
To run a test, provide a YAML playbook:
```bash
python3 harness.py playbooks/my-playbook.yaml
```
### Command Line Options
- `playbook` (Required): The path to the YAML playbook defining the test steps.
- `--log-dir <path>` (Optional): The directory where the harness will write detailed Markdown logs, session JSONs, and JSON failure dumps. Defaults to `./logs`.
- `--skill-src <path>` (Optional): The path to a local, unpacked skill directory. See the "Testing Local Skills" section below for details.
- `--env-file <path>` (Optional): The path to a standard `.env` file containing key-value pairs (e.g. `MY_SECRET=123`). This is used for secure string substitution within your playbook steps.
- `--keep-workspace` (Optional): Preserve the temporary workspace directory (`/tmp/gemini_harness_*`) after execution to inspect files generated by the agent.
- `--agent-model <model>` (Optional): Override the model the agent uses (e.g., `gemini-2.5-pro`). Overrides playbook definition.
- `--evaluator-model <model>` (Optional): Override the model the test harness uses to grade and simulate (e.g., `gemini-2.5-flash`). Overrides playbook definition.
- `--debug` (Optional): Enable verbose debug logging for the SDK (e.g., WebSocket traffic).
⚠️ **Security Warning regarding Logs:**
If your playbooks require secrets (like API keys or passwords) via the `env` array, the harness will substitute them before executing the CLI. Because the harness traces all inputs and outputs for debugging, **these substituted secrets will be written in plain text** to your `logs/` directory.
A default `.gitignore` is provided in the `logs/` directory to prevent committing these files, but care should still be taken to avoid leaking secrets into your repository.
### Expected Output
The harness executes the CLI steps, evaluates the responses, and streams the results to the console:
```text
--- Tuning: FAST Setup PoC | Workspace: /tmp/gemini_harness_abc123 ---
[Step 1] Input: Hi, please activate the fast-setup-poc skill and let's configure FAST.
[Step 1] Output: Hi, let's configure FAST. Please provide your Google Cloud Project ID.
✅ [PASS Step 1]: The agent greeted the user ('Hi'), confirmed it was configuring FAST, and asked for the Project ID. All parts of the objective were fulfilled.
...
✅ [SUCCESS] Playbook 'FAST Setup PoC' completed successfully.
📄 Markdown log saved to: logs/FAST_Setup_PoC_log.md
```
If a step fails, the harness halts immediately and dumps the full interaction trace to a JSON file (e.g., `logs/FAST_Setup_PoC_failed.json`) for debugging.
## Testing Local Skills (Inner Dev Loop)
When developing a complex skill (with multiple markdown files, prompt templates, or tools), you don't want to package and globally install it just to run a test.
The harness supports testing local skills directly using the `--skill-src` flag:
```bash
python3 harness.py playbooks/my-playbook.yaml --skill-src ./my-local-skill/
```
**How it works under the hood:**
The harness passes the skill path to the SDK's `LocalAgentConfig(skills_paths=[...])`. The Antigravity engine loads the skill dynamically for the duration of the session. Unlike the old CLI-based linking, this is completely isolated and does not modify your global environment.
## Workspace Management & Copying Behavior
By default, the test harness executes the agent in an isolated temporary workspace (e.g., `/tmp/gemini_harness_<hash>`) to prevent session caching and protect your repository from accidental file modifications.
### Copying vs. Symlinking
To ensure both safety and compatibility with local search tools (such as the agent's built-in `search_directory` / grep tool, which can crash when traversing directory symlinks), **the harness copies the configured playbook directories recursively instead of symlinking them**.
To keep the workspace lightweight and prevent the agent from "cheating" by reading the test definitions, the copy operation implements strict exclusion rules:
- **Excluded Dependencies/Cache**: `.terraform`, `.git`, `.venv`, `venv`, `__pycache__`, `.pytest_cache` are skipped. This reduces the copied size of directories like `fast/` from 1.4GB to a few megabytes, making workspace setup near-instant.
- **Excluded Harness**: The `skill-turn-harness` directory itself is **strictly excluded** from the copy. This prevents the agent under test from walking the workspace, reading the playbook YAML definitions, and "cheating" by peeking at the expected inputs/outcomes.
### Linking the `skills` Directory
If your autonomous playbook instructs the agent to "activate" or "inspect" the skill, the model may attempt to search the workspace for the skill's source files (like `SKILL.md`). For these playbooks, ensure you add `skills` to the playbook's `link_paths` so the agent can resolve the lookup locally:
```yaml
tmpdir:
link_paths:
- fast
- modules
- tools
- skills # <-- Make sure to include this so the agent can find skill files
```
### Isolated Chat History
The harness configures the SDK to write raw session state directly to the configured `--log-dir` (under `log_dir/chats/`). This ensures that test execution conversations remain isolated and do not pollute your global Antigravity/Jetski desktop history.
## Writing Playbooks
Playbooks are written in YAML. For autocompletion and validation in VS Code, add the schema annotation to the top of your playbook.
If your playbook requires environment variables (e.g., secrets), declare them in the `env` array. You can then reference them in your `steps` using `${VAR_NAME}`. If a variable is declared but not found in the environment (or passed via `--env-file`), the harness will safely halt before execution.
To run the test in a specific directory (e.g., the repository root), specify `working_dir`. If omitted, a temporary isolated workspace is created.
```yaml
# yaml-language-server: $schema=../playbooks/playbook.schema.json
name: "My Test Playbook"
timeout: 120
agent_model: "gemini-2.5-pro"
evaluator_model: "gemini-2.5-flash"
working_dir: "." # Run in the directory where harness is executed
env:
- MY_API_KEY
steps:
- user_input: "Hi, activate my-skill and use this key: ${MY_API_KEY}"
expected_outcome: "The agent should greet the user and acknowledge the key."
```
## Running the Pytest Suite
This repository includes a `pytest` suite in the `test/` directory to test the harness itself.
To run the fast unit tests (which mock the CLI execution):
```bash
python3 -m pytest test/test_harness.py -m "not e2e" -v
```
To run the full End-to-End (E2E) test (which dynamically links the fixture skill and hits the real Gemini API):
```bash
python3 -m pytest test/test_harness.py -m "e2e" -v
```
## Writing Playbooks: Three Modes of Testing
The harness supports three modes of execution depending on how you structure your YAML playbook. The mode is inferred automatically based on the presence of the `steps` and/or `persona` keys.
### 1. Scripted Mode (Unit / Regression)
**Best for:** Ensuring the exact, rigid state machine of a skill hasn't broken.
You define a strict, sequential list of `steps`. The harness feeds the `user_input` and checks if the agent's response satisfies the `expected_outcome` via an LLM evaluation.
```yaml
name: "FAST Setup PoC - Scripted"
steps:
- user_input: "Hi, let's configure FAST."
expected_outcome: "The agent should greet the user and ask for the Project ID."
- user_input: "my-super-project-123"
expected_outcome: "The agent should acknowledge the Project ID and ask for the preferred Region."
```
### 2. Autonomous "Pond" Mode (E2E / Fuzz Testing)
**Best for:** Testing how the skill handles the messy reality of natural language and conversational drift.
Instead of providing a rigid script, you define a declarative **Persona** with a "Pond" of knowledge and explicit `success_criteria`. A secondary LLM agent acts as the simulated user, dynamically reading the CLI's outputs, fishing data from the "pond," and generating the next input until the success criteria are met or the `max_turns` limit is reached.
```yaml
name: "FAST Setup PoC - Autonomous"
persona:
initial_user_input: "Hi, let's configure FAST."
context: >
You are a GCP developer. Your Project ID is my-project-123 and region is europe-west1.
Do not volunteer information until the agent explicitly asks for it.
max_turns: 10
success_criteria:
llm_checks:
- "The agent provided a final configuration summary containing the correct project_id and region."
tool_calls_contain:
run_shell_command:
- "gcloud organizations add-iam-policy-binding"
files_exist:
- "0-org-setup.auto.tfvars"
```
### 3. Hybrid Fallback Mode
**Best for:** Testing happy-path compliance while ensuring the agent can recover from unexpected deviations.
If a playbook defines **both** `steps` and a `persona`, the harness runs in Hybrid mode. It attempts to execute the rigid `steps` first. If the skill deviates or fails a step evaluation, instead of failing the test outright, the harness **falls back** to the autonomous persona. The simulated user takes over the conversation history and attempts to guide the agent back on track to meet the `success_criteria`. If successful, the test returns a `PASS WITH WARNINGS`.

View File

@@ -0,0 +1,40 @@
# Test Harness TODOs & Analysis
## The Problem: Skill Environment Dependencies
During E2E testing of the `fast-0-org-setup-prereqs` skill, the autonomous agent hallucinated that it needed to `git clone` the `cloud-foundation-fabric` repository.
This occurred because the test harness executes the Gemini CLI in a completely empty, isolated temporary workspace (`/tmp/gemini_harness_*`). However, the FAST skill *assumes* it is being executed from the root of the `cloud-foundation-fabric` repository because it needs to:
1. Read available datasets from `fast/stages/0-org-setup/datasets/*/defaults.yaml`.
2. Create a symbolic link from the generated `0-org-setup.auto.tfvars` file to `fast/stages/0-org-setup/0-org-setup.auto.tfvars`.
Because these directories did not exist in the isolated workspace, the agent failed to complete Phase 4 of the setup.
## Proposed Solutions for Next Session
We need a way to provide the agent with the expected repository structure during tests without compromising the safety and isolation of the test harness.
### Option 1: Add a `working_dir` Playbook Attribute
Allow the playbook to specify a `working_dir` (e.g., the actual path to the `cloud-foundation-fabric` repo) where the CLI should be executed, bypassing the temporary workspace creation.
**Risks & Impacts:**
* **Chat History Pollution:** The test's conversation will be saved to the global `~/.gemini/tmp/cloud-foundation-fabric/chats/` directory, mixing test runs with the developer's actual day-to-day CLI usage.
* **Session Retrieval:** The harness will need to be updated to find the *newest* `session-*.json` file in that directory, rather than assuming it's the only one.
* **File Modification Risk:** If the agent hallucinates or a test is poorly written, it could modify or delete real files in the repository instead of sandboxed test files.
* **Cleanup:** The harness cannot safely clean up the workspace after the test completes.
### Option 2: The "Symlink Sandbox" (Recommended)
Keep the isolated temporary workspace (`/tmp/gemini_harness_*`), but add a `symlink_paths` array to the playbook schema.
Before the test starts, the harness would dynamically create symbolic links from the real repository (e.g., `fast/`, `modules/`) into the temporary workspace.
**Benefits:**
* **Total Isolation:** The agent's chat history remains isolated in a temporary `.gemini/tmp/gemini_harness_*/` directory.
* **Safe Execution:** The agent sees the directory structure it expects (and can read the `defaults.yaml` files), but any new files it creates (like the `custom-fast-config` directory or the `0-org-setup.auto.tfvars` symlink) are created safely inside the temporary workspace.
* **Automatic Cleanup:** The entire workspace (including the symlinks and generated files) is safely deleted when the test finishes.
## Next Steps
1. Decide between Option 1 (`working_dir`) and Option 2 (`symlink_paths`).
2. Implement the chosen solution in `harness.py`.
3. Update `playbook.schema.json` and `README.md`.
4. Re-run the `gcp-dev-autonomous.yaml` E2E test to verify Phase 4 completes successfully.

View File

@@ -0,0 +1,42 @@
# Design Decisions: Test Harness Architecture
## Context
This document captures architectural decisions and considerations for the `harness.py` test harness.
## LangChain Integration Analysis
*Date: April 15, 2026*
We evaluated whether to integrate LangChain into the `harness.py` script. The script currently acts as a lightweight testing harness that uses `subprocess` to interact with the Gemini CLI and the native `google.genai` SDK for evaluation using structured outputs (Pydantic).
### Potential Benefits of LangChain
1. **Model-Agnostic Evaluators (Avoiding Self-Bias):**
Currently, the harness uses Gemini 2.5 Flash to evaluate the Gemini CLI. To avoid "self-preference bias", it is often best practice to use a different model family for evaluation. LangChain's `ChatModel` abstractions would allow swapping the evaluator model easily without rewriting API call logic.
2. **Built-in Evaluation Frameworks:**
LangChain provides a dedicated evaluation module (`langchain.evaluation`). Instead of custom prompts, we could leverage pre-built evaluators (like `CriteriaEvalChain`) that are prompt-engineered to reduce hallucinations and false positives.
3. **Observability and Tracing (LangSmith):**
Integration provides seamless access to LangSmith for logging evaluation runs, inspecting prompts, latency, token usage, and tracking pass/fail rates over time.
4. **Prompt Management:**
LangChain's `PromptTemplate` system offers robust handling for complex evaluation criteria (e.g., few-shot examples, dynamic context).
### Drawbacks and Limitations
1. **Overkill for Current Scope:**
The current script is lightweight and readable. LangChain is a heavy dependency that introduces complex abstractions (like LCEL/Runnables), adding bloat and a steeper learning curve.
2. **Native Structured Outputs are Sufficient:**
The native `google.genai` SDK already handles structured JSON outputs via `response_schema=EvaluationResult` efficiently and reliably. LangChain's structured output would merely wrap this existing capability.
3. **External Agent Execution:**
LangChain excels at managing agent memory, tools, and reasoning loops. Since our harness tests an external CLI tool via `subprocess.run`, LangChain cannot orchestrate the agent and is relegated strictly to the role of a grader.
### Conclusion & Recommendation
**Recommendation: Hold off on LangChain for now.**
The current architecture is elegant, dependency-light, and perfectly suited for its job. The native `google.genai` SDK handles the structured Pydantic evaluation flawlessly.
**When to reconsider LangChain:**
- We need to evaluate the CLI using non-Google models (e.g., Claude, GPT-4) to ensure unbiased grading.
- We require visual tracking of test runs, prompt versions, and token costs using LangSmith.

View File

@@ -0,0 +1,92 @@
# **Architecture Document: Hybrid Python/CLI Test Harness**
This document outlines the architecture for testing the Fabric FAST configuration skill. It uses a hybrid approach, executing the skill in its native CLI environment while maintaining deterministic control via a Python orchestration loop.
## **1\. The Approach: Hybrid Isolation**
To accurately test the skill in its target environment while ensuring the reliability of the test harness, the execution and evaluation layers are strictly separated:
* **Orchestrator (Python):** A Python script acts as the absolute authority. It maintains the state machine, reads the playbook, injects inputs, captures outputs, and triggers evaluations.
* **Execution Target (Gemini CLI):** The skill is run via the gemini CLI using Python's subprocess module. This ensures the test reflects the actual user environment. State is maintained across steps using the CLI's session management flags (e.g., \--resume).
* **Evaluator (Gemini API):** The semantic evaluation of the CLI's output is performed via direct API calls to Gemini 1.5 Flash. This bypasses the string-parsing unreliability of a CLI and guarantees structured JSON output via Pydantic schemas.
## **2\. The Execution Loop**
The Python orchestrator executes the following rigid sequence for each step in a defined playbook:
1. **Injection:** Read the mocked user input and expected outcome from the playbook step.
2. **Subprocess Execution:** Invoke the Gemini CLI with the user input and the designated session\_id. Capture stdout and trap stderr to handle hangs or crashes.
3. **Prompt Assembly:** Construct a strict evaluation prompt combining the exact playbook expectation with the raw string response captured from the CLI.
4. **Stateless Evaluation:** Call the Gemini API with the evaluation prompt, enforcing a structured output schema (Boolean Pass/Fail and Reasoning).
5. **Verdict Enforcement:** If the evaluator returns True, proceed to the next step. If False, immediately halt the loop, dump the interaction trace to a JSON file, and alert the developer.
## **3\. Implementation Code**
The following Python script implements the hybrid harness:
import subprocess
import json
import sys
from pydantic import BaseModel
from google import genai
from google.genai import types
\# 1\. Define Strict Evaluator Schema
class EvaluationResult(BaseModel):
passed: bool
reasoning: str
evaluator\_client \= genai.Client()
def invoke\_skill\_cli(user\_input: str, session\_id: str) \-\> str:
\# Requires the CLI to support a session resume flag for state
command \= \["gemini", "--resume", session\_id, "-p", user\_input\]
try:
result \= subprocess.run(command, capture\_output=True, text=True, timeout=30)
if result.returncode \!= 0:
print(f"⚠️ \[CLI ERROR\]: {result.stderr}", file=sys.stderr)
return f"SYSTEM\_ERROR: {result.stderr}"
return result.stdout.strip()
except subprocess.TimeoutExpired:
print("⚠️ \[CLI TIMEOUT\]", file=sys.stderr)
return "SYSTEM\_ERROR: Timeout"
def run\_hybrid\_tuning\_loop(playbook\_name: str, playbook\_steps: list, session\_id: str):
print(f"--- Tuning: {playbook\_name} | Session: {session\_id} \---")
interaction\_log \= \[\]
for step\_index, step in enumerate(playbook\_steps):
user\_input, expected\_outcome \= step\['user\_input'\], step\['expected\_outcome'\]
skill\_response \= invoke\_skill\_cli(user\_input, session\_id)
if skill\_response.startswith("SYSTEM\_ERROR"): break
eval\_prompt \= f"""
OBJECTIVE: {expected\_outcome}
ACTUAL RESPONSE: {skill\_response}
Evaluate if the agent fulfilled the objective.
"""
eval\_response \= evaluator\_client.models.generate\_content(
model="gemini-1.5-flash", contents=eval\_prompt,
config=types.GenerateContentConfig(
response\_mime\_type="application/json",
response\_schema=EvaluationResult, temperature=0.0
)
)
parsed\_eval \= json.loads(eval\_response.text)
interaction\_log.append({"step": step\_index \+ 1, "input": user\_input, "evaluation": parsed\_eval})
if not parsed\_eval\['passed'\]:
print(f"❌ \[FAILURE Step {step\_index \+ 1}\]: {parsed\_eval\['reasoning'\]}")
with open(f"{playbook\_name}\_failed.json", "w") as f: json.dump(interaction\_log, f)
return False
print(f"✅ \[SUCCESS\]")
return True
## **4\. Critical Implementation Warnings**
* **Session Data Persistence:** The CLI likely persists session states to disk (e.g., in a local database or JSON file). If you reuse the same session\_id for consecutive test runs without manually deleting the cache file, the skill will inherit the context of the previous run, causing immediate test failures. You must either generate a UUID for every run or build a cache-clearing mechanism into the Python script.
* **Context Window Discipline:** The evaluation prompt is strictly limited to the current playbook objective and the immediate CLI response. Do not feed the entire CLI conversation history to the Evaluator API, as this significantly increases the risk of hallucinated grading.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,2 @@
*
!.gitignore

View File

@@ -0,0 +1 @@
**/*.env

View File

@@ -0,0 +1,78 @@
# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# yaml-language-server: $schema=../../playbook.schema.json
name: "FAST 0-org-setup Prereqs - GCD Custom Manual Flow (Autonomous)"
timeout: 300
agent_model: "gemini-3.5-flash"
evaluator_model: "gemini-3.5-flash"
tmpdir:
link_paths:
- fast
- modules
- tools
- skills
- .yamllint
- GEMINI.md
- AGENTS.md
- README.md
- FACTORIES.md
persona:
initial_user_input: "Hi, please activate the fast-0-org-setup-prereqs skill and guide me through the setup."
context: >
You are a GCP developer setting up FAST in a Google Cloud Dedicated (GCD) environment.
Your target environment is Google Cloud Dedicated (GCD).
You prefer the agent to output commands for you to execute manually.
Since the execution mode is manual, the agent will output commands for you to run. Whenever it does, pretend you executed them successfully and tell the agent to proceed (e.g., say "Done", "I ran it", or "OK" to continue).
You are using a Custom GCD Universe (not S3NS or Berlin). When asked about the universe, reply that it is "Custom". Do not volunteer the universe details all at once. Wait for the agent to ask for each attribute individually, and then provide:
- For Universe Web Domain: custom.cloud.domain
- For Universe API Domain: custom-apis.domain
- For Universe Name: custom-gcd
- For Universe Prefix: cust
- For Universe Region: u-custom-region1
Confirm the compiled list of 5 universe values is correct when presented.
You are NOT authenticated with Google Cloud. When the agent asks for the workforce pool audience string, provide '//iam.googleapis.com/locations/global/workforcePools/my-pool/providers/my-provider'. When the agent outputs WIF login commands, pretend you run them successfully and confirm you are authenticated.
You want to use a Single User for the Admin Principal (Approach B). When the agent asks you to run the command to get your current principal, provide 'principal://iam.googleapis.com/locations/global/workforcePools/my-pool/subject/my-user@custom.cloud.domain'.
When asked for Organization ID, provide the Org ID '1092874262642' directly (and state there is no domain).
When asked for Billing Account ID, provide "012345-6789AB-CDEF01".
Your access level to the billing account is Scenario 3 (No Access). Confirm you want to proceed despite the warnings.
You do not have a pre-existing project for the bootstrap project. When the agent instructs you to create one, tell it you created it and the Project ID is "my-custom-bootstrap-project".
Confirm the configuration dataset is 'classic-gcd'.
Your base location is automatically set to u-custom-region1.
Your local path for output files is custom-fast-config.
You do not have any additional static context values.
When the agent instructs you to run fast-links.sh, pretend you run it and it outputs the linking commands. Then pretend you run those linking commands successfully.
When asked to check for existing organization policies, pretend the command output returned "constraints/compute.disableSerialPortAccess".
Do not volunteer information unless explicitly asked. Answer only the question asked by the agent.
max_turns: 30
success_criteria:
llm_checks:
- "The agent explicitly provided the final wrap-up instructions containing the commands 'terraform init' and 'terraform apply'."
files_exist:
- "custom-fast-config/0-org-setup.auto.tfvars"
- "custom-fast-config/providers/0-org-setup-providers.tf"
- "custom-fast-config/data/0-org-setup/defaults.yaml"
files_contain:
"custom-fast-config/data/0-org-setup/defaults.yaml":
- "billing_account: 012345-6789AB-CDEF01"
- "id: 1092874262642"
- "domain: custom-apis.domain"
- "prefix: cust"
- "primary: u-custom-region1"
- "gcp-organization-admins: principal://iam.googleapis.com/locations/global/workforcePools/my-pool/subject/my-user@custom.cloud.domain"
"custom-fast-config/providers/0-org-setup-providers.tf":
- "universe_domain"
- "custom-apis.domain"
"custom-fast-config/0-org-setup.auto.tfvars":
- "org_policies_imports"

View File

@@ -0,0 +1,62 @@
# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# yaml-language-server: $schema=../../playbook.schema.json
name: "FAST 0-org-setup Prereqs - Standard GCP Developer Flow (Autonomous)"
timeout: 300
agent_model: "gemini-3.5-flash"
evaluator_model: "gemini-3.5-flash"
tmpdir:
link_paths:
- fast
- modules
- tools
- skills
- .yamllint
- GEMINI.md
- AGENTS.md
- README.md
- FACTORIES.md
env:
- GROUP
persona:
initial_user_input: "Hi, please activate the fast-0-org-setup-prereqs skill and guide me through the setup."
context: >
You are a GCP developer setting up FAST.
Your target environment is Standard GCP.
You prefer the agent to execute commands automatically.
You are already authenticated with Google Cloud and your current identity is correct.
You want to use a Group for the Admin Principal (Approach A).
The group email is ${GROUP}. You confirm you are a member of this group.
When asked for Organization ID, provide the keyword "fast-test" to search. When the list is presented, select the option that corresponds to "01".
When asked for Billing Account ID, provide the keyword "fast" to search. When the list is presented, select the option for the "TI billing account".
Your access level to the billing account is Scenario 2 (Billing User).
You have a pre-existing project to use as the bootstrap project, and it is already set as the active project in gcloud. Confirm it is the correct project.
When asked about checking services, you want the agent to check which services are enabled.
You approve the IAM role assignments.
You want to use the 'classic' dataset.
Your base location is europe-west1, with no overrides.
Your local path for output files is custom-fast-config.
You do not have any additional static context values.
Do not volunteer information unless explicitly asked. Answer only the question asked by the agent.
max_turns: 30
success_criteria:
llm_checks:
- "The agent explicitly provided the final wrap-up instructions containing the commands 'terraform init' and 'terraform apply'."
tool_calls_contain:
run_shell_command:
- "gcloud organizations add-iam-policy-binding"
files_exist:
- "custom-fast-config/0-org-setup.auto.tfvars"
- "custom-fast-config/data/0-org-setup/defaults.yaml"

View File

@@ -0,0 +1,97 @@
# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# yaml-language-server: $schema=../../playbook.schema.json
tmpdir:
link_paths:
- fast
- modules
- tools
- skills
- .yamllint
- GEMINI.md
- AGENTS.md
- README.md
- FACTORIES.md
env:
- BILLING_KEYWORD
- GROUP
- ORG_KEYWORD
name: "FAST 0-org-setup Prereqs - Standard GCP Developer Flow"
steps:
- user_input: "Hi, please activate the fast-0-org-setup-prereqs skill and guide me through the setup."
expected_outcome: "The agent should confirm the guide's activation and ask the user about their target environment (e.g., Standard GCP vs GCD)."
- user_input: "Standard GCP"
expected_outcome: "The agent should acknowledge the environment and ask for the user's preference on how to execute commands (e.g., automatically vs manually)."
- user_input: "Automatically"
expected_outcome: "The agent should acknowledge the execution preference and ask about the user's current Google Cloud authentication status."
- user_input: "Yes, I am already authenticated."
expected_outcome: "The agent should verify the current authenticated principal (e.g., using gcloud) and ask the user to confirm if it is the correct identity."
- user_input: "Yes, that is the correct identity."
expected_outcome: "The agent should move to the Admin Principal step and ask the user to choose an approach (e.g., Group vs Single User)."
- user_input: "Approach A, please."
expected_outcome: "The agent should ask for the group email address."
- user_input: "The group is ${GROUP}."
expected_outcome: "The agent should explicitly ask the user to confirm that their current identity is already a member of this group."
- user_input: "Yes, I confirm I am a member."
expected_outcome: "The agent should ask the user to provide their Organization ID, offering to list them automatically."
- user_input: "${ORG_KEYWORD}"
expected_outcome: "The agent should list the matching organizations and ask the user to select one."
- user_input: "1"
expected_outcome: "The agent should acknowledge the selected Organization and ask the user to provide their Billing Account ID, offering to list them automatically."
- user_input: "${BILLING_KEYWORD}"
expected_outcome: "The agent should list the matching billing accounts and ask the user to select one."
- user_input: "1"
expected_outcome: "The agent should ask the user about their access level to the selected Billing Account (e.g., scenarios 1, 2, or 3)."
- user_input: "Scenario 2 (Billing User)"
expected_outcome: "The agent should note the limitations of this access level (no billing.admin role assigned) and propose the IAM role assignments to be made."
- user_input: "Looks good, go ahead and assign the roles."
expected_outcome: "The agent should execute the IAM role assignments and then ask if the user has a pre-existing project to use as the bootstrap project."
- user_input: "Yes, I have a pre-existing project."
expected_outcome: "The agent should ask if the pre-existing project is already set as the active project in gcloud."
- user_input: "Yes, it is."
expected_outcome: "The agent should fetch the current active Project ID, ask for confirmation, and offer to check/enable required APIs."
- user_input: "Yes, that's the correct project. Yes, please check which services are enabled."
expected_outcome: "The agent should check and enable necessary APIs, and then ask the user to select a configuration Dataset."
- user_input: "I'll use the classic dataset."
expected_outcome: "The agent should ask the user for a base location for the resources and if there are any overrides."
- user_input: "europe-west1, no overrides."
expected_outcome: "The agent should propose a local path for the output files and ask for confirmation."
- user_input: "~/custom-fast-config"
expected_outcome: "The agent should ask if the user wants to provide any additional static context values."
- user_input: "No additional context."
expected_outcome: "The agent should scaffold the local files (copying defaults, creating tfvars), validate them (e.g., yamllint), and then initiate the Organization Policy Import Check."
- user_input: "Okay."
expected_outcome: "The agent should process any existing org policies and provide the final wrap-up instructions for applying the Terraform."

View File

@@ -0,0 +1,168 @@
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "The name of the playbook."
},
"agent_model": {
"type": "string",
"description": "The model the Gemini CLI should use (e.g., gemini-2.5-pro)."
},
"evaluator_model": {
"type": "string",
"description": "The model the test harness uses to grade the test and simulate the user (e.g., gemini-2.5-flash)."
},
"tmpdir": {
"type": "object",
"description": "Configuration for running in a temporary isolated workspace with optional symlinks.",
"properties": {
"link_paths": {
"type": "array",
"description": "Relative paths to symlink from the host repository (current CWD) into the temporary workspace.",
"items": {
"type": "string"
}
}
},
"additionalProperties": false
},
"timeout": {
"type": "integer",
"description": "Timeout in seconds for each CLI invocation.",
"default": 60,
"minimum": 1
},
"env": {
"type": "array",
"description": "A list of environment variable names required by this playbook. These will be substituted in the steps and persona context.",
"items": {
"type": "string"
}
},
"steps": {
"type": "array",
"description": "The deterministic sequence of interactions between the user and the agent.",
"items": {
"type": "object",
"properties": {
"user_input": {
"type": "string",
"description": "The simulated input provided by the user."
},
"expected_outcome": {
"type": "string",
"description": "The expected response or behavior from the agent to evaluate against."
}
},
"required": [
"user_input",
"expected_outcome"
],
"additionalProperties": false
}
},
"persona": {
"type": "object",
"description": "Configuration for the autonomous LLM-simulated user.",
"properties": {
"initial_user_input": {
"type": "string",
"description": "The first input to send to the agent when starting in pure autonomous mode. Variables are interpolated."
},
"context": {
"type": "string",
"description": "Freeform instructions and knowledge base for the simulated user. Variables are interpolated."
},
"max_turns": {
"type": "integer",
"description": "The maximum number of conversation turns allowed in autonomous mode before forcing a failure.",
"default": 10,
"minimum": 1
},
"success_criteria": {
"type": "object",
"description": "The conditions that must be met for the autonomous flow to be considered complete and successful.",
"properties": {
"llm_checks": {
"type": "array",
"description": "Semantic checks evaluated by the LLM (e.g., 'The agent printed a final configuration summary').",
"items": {
"type": "string"
}
},
"flow_contains": {
"type": "array",
"description": "Literal strings that must appear somewhere in the combined CLI stdout.",
"items": {
"type": "string"
}
},
"files_exist": {
"type": "array",
"description": "A list of file paths (relative to the workspace) that must exist.",
"items": {
"type": "string"
}
},
"files_contain": {
"type": "object",
"description": "A mapping of file paths to a list of strings that must be found within them.",
"patternProperties": {
".*": {
"type": "array",
"items": {
"type": "string"
}
}
}
},
"tool_calls_contain": {
"type": "object",
"description": "A mapping of tool names to a list of strings that must be found within their arguments.",
"patternProperties": {
".*": {
"type": "array",
"items": {
"type": "string"
}
}
}
}
},
"additionalProperties": false
}
},
"required": [
"context",
"success_criteria"
],
"additionalProperties": false
}
},
"required": [
"name"
],
"anyOf": [
{
"required": ["steps"]
},
{
"required": ["persona"]
}
],
"if": {
"not": {
"required": ["steps"]
}
},
"then": {
"properties": {
"persona": {
"required": ["context", "success_criteria", "initial_user_input"]
}
}
},
"additionalProperties": false
}

View File

@@ -0,0 +1,6 @@
[pytest]
# Ensures the root directory is on the path so tests can import 'harness'.
# This allows the tool and its tests to be fully portable.
pythonpath = .
markers =
e2e: mark a test as an end-to-end test that requires external APIs/CLI (deselect with '-m "not e2e"')

View File

@@ -0,0 +1,6 @@
google-genai
pydantic
pyyaml
click
jsonschema
google-antigravity

View File

@@ -0,0 +1,2 @@
MY_SECRET_ID=dummy-secret-12345
ANOTHER_VAR=europe-west1

View File

@@ -0,0 +1,25 @@
---
name: fast-setup-poc
description: 'A wizard to help users configure FAST (Fabric Architecture Setup Tool) step-by-step. Use this skill when asked to configure FAST, run the FAST wizard, or setup FAST.'
---
# FAST Setup Wizard
## Instructions
You are the FAST Setup Wizard. Your goal is to collect exactly 3 pieces of information from the user in this exact order:
1. Google Cloud Project ID
2. Preferred Region
3. Billing Account ID
Rules:
- Ask for exactly ONE piece of information at a time. Do not ask for the next piece until the user has provided the current one.
- Keep your responses extremely brief. Acknowledge the received information and ask the next question.
- For the Region, validate the user's input against the [supported regions](./references/extra_content.md). If invalid, ask again.
- Once all three pieces of information are collected, provide a final summary of the configuration.
- Do not execute any commands or write any files. Just collect the information and print the summary.
Example Workflow:
Wizard: "Hi, let's configure FAST. Please provide your Google Cloud Project ID."
User: "my-project-123"
Wizard: "Got it (my-project-123). Next, what is your preferred Region?"
...and so on.

View File

@@ -0,0 +1,8 @@
# Supported Regions
For the FAST Setup Wizard, only the following regions are officially supported in this PoC:
- europe-west1
- us-central1
- asia-northeast1
Do not accept regions outside of this list.

View File

@@ -0,0 +1,10 @@
---
name: tool-test-skill
description: 'A simple skill to test if the agent correctly executes tools.'
---
# Instructions
You are a simple file-creating agent. When the user asks you to create a file, you MUST use the `write_file` tool to create a file named `output.txt` in the current directory.
The content of the file must be exactly: `Hello World`
Once you have successfully executed the tool, tell the user that the file has been created.

View File

@@ -0,0 +1,32 @@
# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# yaml-language-server: $schema=../../playbooks/playbook.schema.json
name: "FAST Setup PoC - Autonomous"
env:
- MY_SECRET_ID
persona:
initial_user_input: "Hi, please activate the fast-setup-poc skill and let's configure FAST."
context: >
You are a GCP developer setting up FAST.
Your Project ID is ${MY_SECRET_ID} and your region is europe-west1.
Your Billing Account ID is 012345-6789AB-CDEF01.
Do not volunteer information unless explicitly asked.
If the agent asks for something not in your knowledge base, say you don't know.
max_turns: 8
success_criteria:
llm_checks:
- "The agent provided a final summary containing the correct project ID (${MY_SECRET_ID}), region (europe-west1), and billing account."
flow_contains:
- "012345-6789AB-CDEF01"

View File

@@ -0,0 +1,32 @@
# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# yaml-language-server: $schema=../../playbooks/playbook.schema.json
name: "Tool Call Test Autonomous"
timeout: 30
persona:
initial_user_input: "Hi, please activate the tool-test-skill and create the file as instructed."
context: |
You are a user asking the agent to create a file.
Wait for the agent to confirm the file has been created.
max_turns: 3
success_criteria:
llm_checks:
- "The agent confirmed the file was created."
tool_calls_contain:
write_file:
- "output.txt"
- "Hello World"
files_exist:
- "output.txt"

View File

@@ -0,0 +1,28 @@
# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# yaml-language-server: $schema=../../playbooks/playbook.schema.json
name: "FAST Setup PoC"
steps:
- user_input: "Hi, please activate the fast-setup-poc skill and let's configure FAST."
expected_outcome: "The agent should greet the user, confirm it is the FAST Setup Wizard, and ask for the Google Cloud Project ID."
- user_input: "my-super-project-123"
expected_outcome: "The agent should acknowledge the Project ID and ask for the preferred Region."
- user_input: "europe-west1"
expected_outcome: "The agent should acknowledge the Region and ask for the Billing Account ID."
- user_input: "012345-6789AB-CDEF01"
expected_outcome: "The agent should acknowledge the Billing Account ID and provide a final summary of the configuration containing all three pieces of information."

View File

@@ -0,0 +1,30 @@
# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# yaml-language-server: $schema=../../playbooks/playbook.schema.json
name: "FAST Setup PoC with Env"
env:
- MY_SECRET_ID
steps:
- user_input: "Hi, please activate the fast-setup-poc skill and let's configure FAST."
expected_outcome: "The agent should greet the user, confirm it is the FAST Setup Wizard, and ask for the Google Cloud Project ID."
- user_input: "${MY_SECRET_ID}"
expected_outcome: "The agent should acknowledge the Project ID and ask for the preferred Region."
- user_input: "europe-west1"
expected_outcome: "The agent should acknowledge the Region and ask for the Billing Account ID."
- user_input: "012345-6789AB-CDEF01"
expected_outcome: "The agent should acknowledge the Billing Account ID and provide a final summary of the configuration containing all three pieces of information."

View File

@@ -0,0 +1,311 @@
# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import json
import subprocess
import asyncio
from unittest.mock import patch, MagicMock, AsyncMock, PropertyMock
import pytest
from dataclasses import asdict
import harness
# --- Phase A: Data & Logging Unit Tests ---
def test_parse_and_validate_env(monkeypatch):
playbook = {'env': ['TEST_KEY']}
# Missing key raises error
with pytest.raises(ValueError,
match='Missing required environment variables: TEST_KEY'):
harness.parse_and_validate_env(playbook)
# Present key succeeds
monkeypatch.setenv('TEST_KEY', '123')
result = harness.parse_and_validate_env(playbook)
assert result['TEST_KEY'] == '123'
def test_step_data_serialization():
step = harness.StepData(
step_index=0,
user_input='hello',
expected_outcome='greet back',
skill_response='hi',
parsed_eval={
'passed': True,
'reasoning': 'ok'
},
is_system_error=False,
)
d = asdict(step)
assert d['step_index'] == 0
assert d['user_input'] == 'hello'
assert d['expected_outcome'] == 'greet back'
assert d['parsed_eval']['passed'] is True
def test_load_env_file(tmp_path):
env_file = tmp_path / '.env'
env_file.write_text('FOO=bar\n# comment\nBAZ=qux=123\n')
harness.load_env_file(str(env_file))
assert os.environ.get('FOO') == 'bar'
assert os.environ.get('BAZ') == 'qux=123'
def test_markdown_logging(tmp_path):
log_file = tmp_path / 'test_log.md'
harness.init_markdown_log(str(log_file), 'Test Playbook')
harness.log_step_to_markdown(
md_log_path=str(log_file),
step_index=0,
user_input='input 1',
expected_outcome='outcome 1',
skill_response='response 1',
parsed_eval={
'passed': True,
'reasoning': 'Good job'
},
)
content = log_file.read_text()
assert '# Interaction Log: Test Playbook' in content
assert '## Step 1' in content
assert '**User:**\n\ninput 1' in content
assert '**Expected Outcome:**\n\noutcome 1' in content
assert '**Agent:**\n\nresponse 1' in content
assert '✅ PASS: Good job' in content
def test_dump_failed_log(tmp_path):
interaction_log = [{'step': 1, 'error': 'test'}]
harness.dump_failed_log(str(tmp_path), 'test-playbook-prefix',
interaction_log)
failed_file = tmp_path / 'test-playbook-prefix_failed.json'
assert failed_file.exists()
data = json.loads(failed_file.read_text())
assert len(data) == 1
assert data[0]['error'] == 'test'
# --- Phase B: Execution Unit Tests (Mocked) ---
@patch('harness.genai.Client')
@patch('harness.Agent')
def test_run_hybrid_tuning_loop_mocked_success(mock_agent_class,
mock_client_class, tmp_path):
# Mock Conversation
mock_conversation = MagicMock()
mock_conversation.send = AsyncMock()
async def mock_receive_steps():
yield harness.agy_types.Step(type=harness.agy_types.StepType.TEXT_RESPONSE,
source=harness.agy_types.StepSource.MODEL,
target=harness.agy_types.StepTarget.USER,
status=harness.agy_types.StepStatus.DONE,
content="Mocked Agent Response")
mock_conversation.receive_steps.return_value = mock_receive_steps()
type(mock_conversation).last_response = PropertyMock(
return_value="Mocked Agent Response")
# Mock Agent
mock_agent = MagicMock()
mock_agent.conversation = mock_conversation
mock_agent_class.return_value.__aenter__.return_value = mock_agent
# Mock Evaluator
mock_eval_client = MagicMock()
mock_client_class.return_value = mock_eval_client
mock_eval_response = MagicMock()
mock_eval_response.text = '{"passed": true, "reasoning": "Mocked pass"}'
mock_eval_client.models.generate_content.return_value = mock_eval_response
# Playbook
playbook_content = """
name: "Mocked Playbook"
steps:
- user_input: "Hello"
expected_outcome: "Greet"
"""
playbook_file = tmp_path / "playbook.yaml"
playbook_file.write_text(playbook_content)
import asyncio
result = asyncio.run(
harness.run_hybrid_tuning_loop(str(playbook_file), log_dir=str(tmp_path)))
assert result is True
mock_conversation.send.assert_called_once_with("Hello")
mock_eval_client.models.generate_content.assert_called_once()
@patch('harness.genai.Client')
@patch('harness.Agent')
def test_run_hybrid_tuning_loop_mocked_timeout(mock_agent_class,
mock_client_class, tmp_path):
# Mock genai.Client
mock_client_class.return_value = MagicMock()
import asyncio
mock_conversation = MagicMock()
mock_conversation.send = AsyncMock(side_effect=asyncio.TimeoutError())
async def empty_gen():
if False:
yield
mock_conversation.receive_steps.return_value = empty_gen()
mock_agent = MagicMock()
mock_agent.conversation = mock_conversation
mock_agent_class.return_value.__aenter__.return_value = mock_agent
# Playbook
playbook_content = """
name: "Mocked Playbook"
steps:
- user_input: "Hello"
expected_outcome: "Greet"
"""
playbook_file = tmp_path / "playbook.yaml"
playbook_file.write_text(playbook_content)
result = asyncio.run(
harness.run_hybrid_tuning_loop(str(playbook_file), log_dir=str(tmp_path)))
assert result is False
mock_conversation.send.assert_called_once_with("Hello")
log_files = list(tmp_path.glob('*_log.md'))
assert len(log_files) == 1
content = log_files[0].read_text()
assert 'SYSTEM_ERROR: Timeout' in content
# --- Phase C: E2E Test ---
@pytest.mark.e2e
def test_e2e_hybrid_tuning_loop(tmp_path):
'''
Runs the actual evaluation loop against the basic FAST Setup PoC skill.
Uses tmp_path for log_dir so we don't pollute the actual workspace logs.
'''
fixtures_dir = os.path.join(os.path.dirname(__file__), 'fixtures')
skill_dir = os.path.join(fixtures_dir, 'mock-conversation-skill')
playbook_path = os.path.join(fixtures_dir,
'playbook_scripted_env_substitution.yaml')
env_file_path = os.path.join(fixtures_dir, '.env.test')
# Load env to prime the os.environ
harness.load_env_file(env_file_path)
result = asyncio.run(
harness.run_hybrid_tuning_loop(playbook_path, log_dir=str(tmp_path),
skill_src=skill_dir))
assert result is True
# Verify the log file was created in the temporary directory
log_files = list(tmp_path.glob('*_log.md'))
assert len(log_files) == 1
log_file = log_files[0]
assert log_file.exists()
content = log_file.read_text()
assert '✅ PASS' in content
# Verify substitution happened securely
assert 'dummy-secret-12345' in content
assert '${MY_SECRET_ID}' not in content
@pytest.mark.e2e
def test_e2e_autonomous_tuning_loop(tmp_path):
'''
Runs the autonomous evaluation loop against the basic FAST Setup PoC skill.
'''
fixtures_dir = os.path.join(os.path.dirname(__file__), 'fixtures')
skill_dir = os.path.join(fixtures_dir, 'mock-conversation-skill')
playbook_path = os.path.join(fixtures_dir,
'playbook_autonomous_conversation.yaml')
env_file_path = os.path.join(fixtures_dir, '.env.test')
harness.load_env_file(env_file_path)
result = asyncio.run(
harness.run_hybrid_tuning_loop(playbook_path, log_dir=str(tmp_path),
skill_src=skill_dir))
assert result is True
log_files = list(tmp_path.glob('*_log.md'))
assert len(log_files) == 1
content = log_files[0].read_text()
# Check that the autonomous turns were logged
assert '## Autonomous Turn 1' in content
assert 'dummy-secret-12345' in content
@pytest.mark.e2e
def test_e2e_tool_calls_contain(tmp_path):
'''
Runs an autonomous evaluation loop to verify tool_calls_contain deterministic checks.
'''
fixtures_dir = os.path.join(os.path.dirname(__file__), 'fixtures')
skill_dir = os.path.join(fixtures_dir, 'mock-tool-use-skill')
playbook_path = os.path.join(fixtures_dir,
'playbook_autonomous_tool_use.yaml')
result = asyncio.run(
harness.run_hybrid_tuning_loop(playbook_path, log_dir=str(tmp_path),
skill_src=skill_dir))
assert result is True
# Verify that the session JSON was saved
session_files = list(tmp_path.glob('*_session.json'))
assert len(session_files) == 1
assert session_files[0].exists()
@pytest.mark.e2e
def test_e2e_working_dir(tmp_path):
'''
Runs an evaluation loop to verify working_dir functionality.
'''
fixtures_dir = os.path.join(os.path.dirname(__file__), 'fixtures')
skill_dir = os.path.join(fixtures_dir, 'mock-tool-use-skill')
# Create a specific subdirectory in tmp_path
workdir_target = tmp_path / "workdir_target"
workdir_target.mkdir()
# Dynamically create a playbook YAML file
playbook_content = f"""# yaml-language-server: $schema=../../playbooks/playbook.schema.json
name: "Tool Test with Workdir"
working_dir: "{workdir_target.resolve()}"
steps:
- user_input: "Hi, please activate tool-test-skill and create the file output.txt."
expected_outcome: "The agent confirms it has created the file."
"""
playbook_path = tmp_path / "playbook_workdir.yaml"
playbook_path.write_text(playbook_content)
result = asyncio.run(
harness.run_hybrid_tuning_loop(str(playbook_path), log_dir=str(tmp_path),
skill_src=skill_dir))
assert result is True
# Verify that output.txt was created INSIDE workdir_target
output_file = workdir_target / "output.txt"
assert output_file.exists()
assert output_file.read_text().strip() == "Hello World"