Add FAST Prerequisites Skill and Gemini Skill Test Harness (#3979)

* initial version of a FAST pre-install skill * first round of testing * Update fast-0-org-setup-prereqs skill with improved UX and local path handling - Add explicit lockout warning and stop condition if the user is not a member of the provided Admin Principal group. - Streamline bootstrap project selection to only prompt for an override if the active gcloud project is rejected. - Restrict dataset discovery strictly to the `fast/stages/0-org-setup/datasets/` directory. - Improve location handling by referencing `defaults.schema.json` for Standard GCP and auto-configuring fixed regions for GCD. - Add comprehensive `local_path` management: prompt for customization, create directories, move `defaults.yaml` to the local data folder, and symlink `0-org-setup.auto.tfvars` back to the stage directory. * add testing scenarios, implement initial changes for scenario 2 * move skills * move to a skills/fast subfolder * Refactor fast-0-org-setup prereqs skill * Add skill-turn-harness utility tool * Use relative markdown links for skill references * Use descriptive titles for markdown links in skill references * Add descriptions to each phase in the prerequisites workflow map * Use backslash for markdown line breaks in skill map * Update README security warning to mention default .gitignore * shebang * Update fast prereqs skill rules to force sequential question flow and refine harness tool with proper ctrl+c handling and slugified log paths * Move playbook-gcp-dev.yaml to fast/prerequisites/gcp-dev.yaml and update fast prerequisites * docs(skill-turn-harness): detail autonomous pond testing approach * docs(skill-turn-harness): add final_state_checks to pond architecture and update toc * Refine fast prereqs SKILL and gcp-dev playbook to strictly align with one-question-at-a-time rule * feat(skill-turn-harness): update playbook schema for autonomous persona mode * feat(skill-turn-harness): implement autonomous persona testing mode and fallback logic * docs(skill-turn-harness): document the three modes of testing and update ToC * implement timeout, schema validation, configurable cli * chore: remove accidentally committed log files * chore: ignore logs directory * feat(skill-harness): implement tool execution interception, configurable workspace, and modularized validation * feat(skill-harness): add model configuration and update README * fix(skill-harness): automatically inject -y flag to gemini commands * docs(skill-harness): add TODO.md with analysis for skill environment dependencies * feat(skill-harness): add working_dir support and clean up fixtures - Implement working_dir in harness to run tests in specific directories. - Rename test fixtures and playbooks to be more descriptive. - Add E2E test for working_dir. - Apply code quality improvements to harness.py (imports, linting). - Update README with working directory considerations and usage notes. - Update phase3-bootstrap-and-iam.md skill doc to add execution rule against creating temp scripts. * fix: capture customer_id and respect relative paths * Implement isolated temp workspace sandboxing with symlinks in test harness * Configure GCD manual autonomous playbook and align Phase 3/4 steps order * Fix linting and schema tests failures - Add missing license headers to tools/skill-turn-harness files. - Fix trailing spaces and newlines in playbooks. - Ignore tools directory in schema tests workflow. TAG=agy CONV=1bb75453-c3e2-448b-bae9-8e332a068012 * Fix Python formatting with yapf TAG=agy CONV=1bb75453-c3e2-448b-bae9-8e332a068012 * Refactor skill-turn-harness to use Antigravity SDK - Migrated harness from gemini-cli subprocesses to Antigravity SDK. - Implemented real-time step streaming and console logging. - Added color-coded terminal output (dark gray headers, blue inputs, pink outputs). - Collapsed excessive newlines in streamed thoughts. - Excluded harness codebase from workspace copy to prevent agent cheating. - Enabled skills folder copy to resolve agent lookup loops. - Added key validation and CLI --debug flag. * Fix autonomous turn layout: print Turn ID before execution - Moved the [Autonomous Turn X] header print to before running the agent turn. - This groups the real-time thinking and tool calls under the correct Turn ID block, instead of displaying them before the label. * Remove obsolete .log.md from prerequisites skill directory
2026-05-22 19:16:54 +02:00
parent 1594a01c6f
commit 81f72e8068
32 changed files with 2653 additions and 1 deletions
--- a/tools/skill-turn-harness/docs/DESIGN.md
+++ b/tools/skill-turn-harness/docs/DESIGN.md
@@ -0,0 +1,42 @@
+# Design Decisions: Test Harness Architecture
+
+## Context
+
+This document captures architectural decisions and considerations for the `harness.py` test harness.
+
+## LangChain Integration Analysis
+
+*Date: April 15, 2026*
+
+We evaluated whether to integrate LangChain into the `harness.py` script. The script currently acts as a lightweight testing harness that uses `subprocess` to interact with the Gemini CLI and the native `google.genai` SDK for evaluation using structured outputs (Pydantic).
+
+### Potential Benefits of LangChain
+
+1. **Model-Agnostic Evaluators (Avoiding Self-Bias):**
+   Currently, the harness uses Gemini 2.5 Flash to evaluate the Gemini CLI. To avoid "self-preference bias", it is often best practice to use a different model family for evaluation. LangChain's `ChatModel` abstractions would allow swapping the evaluator model easily without rewriting API call logic.
+2. **Built-in Evaluation Frameworks:**
+   LangChain provides a dedicated evaluation module (`langchain.evaluation`). Instead of custom prompts, we could leverage pre-built evaluators (like `CriteriaEvalChain`) that are prompt-engineered to reduce hallucinations and false positives.
+3. **Observability and Tracing (LangSmith):**
+   Integration provides seamless access to LangSmith for logging evaluation runs, inspecting prompts, latency, token usage, and tracking pass/fail rates over time.
+4. **Prompt Management:**
+   LangChain's `PromptTemplate` system offers robust handling for complex evaluation criteria (e.g., few-shot examples, dynamic context).
+
+### Drawbacks and Limitations
+
+1. **Overkill for Current Scope:**
+   The current script is lightweight and readable. LangChain is a heavy dependency that introduces complex abstractions (like LCEL/Runnables), adding bloat and a steeper learning curve.
+2. **Native Structured Outputs are Sufficient:**
+   The native `google.genai` SDK already handles structured JSON outputs via `response_schema=EvaluationResult` efficiently and reliably. LangChain's structured output would merely wrap this existing capability.
+3. **External Agent Execution:**
+   LangChain excels at managing agent memory, tools, and reasoning loops. Since our harness tests an external CLI tool via `subprocess.run`, LangChain cannot orchestrate the agent and is relegated strictly to the role of a grader.
+
+### Conclusion & Recommendation
+
+**Recommendation: Hold off on LangChain for now.**
+
+The current architecture is elegant, dependency-light, and perfectly suited for its job. The native `google.genai` SDK handles the structured Pydantic evaluation flawlessly.
+
+**When to reconsider LangChain:**
+
+- We need to evaluate the CLI using non-Google models (e.g., Claude, GPT-4) to ensure unbiased grading.
+- We require visual tracking of test runs, prompt versions, and token costs using LangSmith.