Files
hunfabric/tools/skill-turn-harness/docs/DESIGN.md
Ludovico Magnocavallo 81f72e8068 Add FAST Prerequisites Skill and Gemini Skill Test Harness (#3979)
* initial version of a FAST pre-install skill

* first round of testing

* Update fast-0-org-setup-prereqs skill with improved UX and local path handling

- Add explicit lockout warning and stop condition if the user is not a member of the provided Admin Principal group.
- Streamline bootstrap project selection to only prompt for an override if the active gcloud project is rejected.
- Restrict dataset discovery strictly to the `fast/stages/0-org-setup/datasets/` directory.
- Improve location handling by referencing `defaults.schema.json` for Standard GCP and auto-configuring fixed regions for GCD.
- Add comprehensive `local_path` management: prompt for customization, create directories, move `defaults.yaml` to the local data folder, and symlink `0-org-setup.auto.tfvars` back to the stage directory.

* add testing scenarios, implement initial changes for scenario 2

* move skills

* move to a skills/fast subfolder

* Refactor fast-0-org-setup prereqs skill

* Add skill-turn-harness utility tool

* Use relative markdown links for skill references

* Use descriptive titles for markdown links in skill references

* Add descriptions to each phase in the prerequisites workflow map

* Use backslash for markdown line breaks in skill map

* Update README security warning to mention default .gitignore

* shebang

* Update fast prereqs skill rules to force sequential question flow and refine harness tool with proper ctrl+c handling and slugified log paths

* Move playbook-gcp-dev.yaml to fast/prerequisites/gcp-dev.yaml and update fast prerequisites

* docs(skill-turn-harness): detail autonomous pond testing approach

* docs(skill-turn-harness): add final_state_checks to pond architecture and update toc

* Refine fast prereqs SKILL and gcp-dev playbook to strictly align with one-question-at-a-time rule

* feat(skill-turn-harness): update playbook schema for autonomous persona mode

* feat(skill-turn-harness): implement autonomous persona testing mode and fallback logic

* docs(skill-turn-harness): document the three modes of testing and update ToC

* implement timeout, schema validation, configurable cli

* chore: remove accidentally committed log files

* chore: ignore logs directory

* feat(skill-harness): implement tool execution interception, configurable workspace, and modularized validation

* feat(skill-harness): add model configuration and update README

* fix(skill-harness): automatically inject -y flag to gemini commands

* docs(skill-harness): add TODO.md with analysis for skill environment dependencies

* feat(skill-harness): add working_dir support and clean up fixtures

- Implement working_dir in harness to run tests in specific directories.
- Rename test fixtures and playbooks to be more descriptive.
- Add E2E test for working_dir.
- Apply code quality improvements to harness.py (imports, linting).
- Update README with working directory considerations and usage notes.
- Update phase3-bootstrap-and-iam.md skill doc to add execution rule against creating temp scripts.

* fix: capture customer_id and respect relative paths

* Implement isolated temp workspace sandboxing with symlinks in test harness

* Configure GCD manual autonomous playbook and align Phase 3/4 steps order

* Fix linting and schema tests failures

- Add missing license headers to tools/skill-turn-harness files.

- Fix trailing spaces and newlines in playbooks.

- Ignore tools directory in schema tests workflow.

TAG=agy

CONV=1bb75453-c3e2-448b-bae9-8e332a068012

* Fix Python formatting with yapf

TAG=agy

CONV=1bb75453-c3e2-448b-bae9-8e332a068012

* Refactor skill-turn-harness to use Antigravity SDK

- Migrated harness from gemini-cli subprocesses to Antigravity SDK.
- Implemented real-time step streaming and console logging.
- Added color-coded terminal output (dark gray headers, blue inputs, pink outputs).
- Collapsed excessive newlines in streamed thoughts.
- Excluded harness codebase from workspace copy to prevent agent cheating.
- Enabled skills folder copy to resolve agent lookup loops.
- Added key validation and CLI --debug flag.

* Fix autonomous turn layout: print Turn ID before execution

- Moved the [Autonomous Turn X] header print to before running the agent turn.
- This groups the real-time thinking and tool calls under the correct Turn ID block, instead of displaying them before the label.

* Remove obsolete .log.md from prerequisites skill directory
2026-05-22 17:16:54 +00:00

2.8 KiB

Design Decisions: Test Harness Architecture

Context

This document captures architectural decisions and considerations for the harness.py test harness.

LangChain Integration Analysis

Date: April 15, 2026

We evaluated whether to integrate LangChain into the harness.py script. The script currently acts as a lightweight testing harness that uses subprocess to interact with the Gemini CLI and the native google.genai SDK for evaluation using structured outputs (Pydantic).

Potential Benefits of LangChain

  1. Model-Agnostic Evaluators (Avoiding Self-Bias): Currently, the harness uses Gemini 2.5 Flash to evaluate the Gemini CLI. To avoid "self-preference bias", it is often best practice to use a different model family for evaluation. LangChain's ChatModel abstractions would allow swapping the evaluator model easily without rewriting API call logic.
  2. Built-in Evaluation Frameworks: LangChain provides a dedicated evaluation module (langchain.evaluation). Instead of custom prompts, we could leverage pre-built evaluators (like CriteriaEvalChain) that are prompt-engineered to reduce hallucinations and false positives.
  3. Observability and Tracing (LangSmith): Integration provides seamless access to LangSmith for logging evaluation runs, inspecting prompts, latency, token usage, and tracking pass/fail rates over time.
  4. Prompt Management: LangChain's PromptTemplate system offers robust handling for complex evaluation criteria (e.g., few-shot examples, dynamic context).

Drawbacks and Limitations

  1. Overkill for Current Scope: The current script is lightweight and readable. LangChain is a heavy dependency that introduces complex abstractions (like LCEL/Runnables), adding bloat and a steeper learning curve.
  2. Native Structured Outputs are Sufficient: The native google.genai SDK already handles structured JSON outputs via response_schema=EvaluationResult efficiently and reliably. LangChain's structured output would merely wrap this existing capability.
  3. External Agent Execution: LangChain excels at managing agent memory, tools, and reasoning loops. Since our harness tests an external CLI tool via subprocess.run, LangChain cannot orchestrate the agent and is relegated strictly to the role of a grader.

Conclusion & Recommendation

Recommendation: Hold off on LangChain for now.

The current architecture is elegant, dependency-light, and perfectly suited for its job. The native google.genai SDK handles the structured Pydantic evaluation flawlessly.

When to reconsider LangChain:

  • We need to evaluate the CLI using non-Google models (e.g., Claude, GPT-4) to ensure unbiased grading.
  • We require visual tracking of test runs, prompt versions, and token costs using LangSmith.