Files
hunfabric/tools/skill-turn-harness/test/test_harness.py
Ludovico Magnocavallo 81f72e8068 Add FAST Prerequisites Skill and Gemini Skill Test Harness (#3979)
* initial version of a FAST pre-install skill

* first round of testing

* Update fast-0-org-setup-prereqs skill with improved UX and local path handling

- Add explicit lockout warning and stop condition if the user is not a member of the provided Admin Principal group.
- Streamline bootstrap project selection to only prompt for an override if the active gcloud project is rejected.
- Restrict dataset discovery strictly to the `fast/stages/0-org-setup/datasets/` directory.
- Improve location handling by referencing `defaults.schema.json` for Standard GCP and auto-configuring fixed regions for GCD.
- Add comprehensive `local_path` management: prompt for customization, create directories, move `defaults.yaml` to the local data folder, and symlink `0-org-setup.auto.tfvars` back to the stage directory.

* add testing scenarios, implement initial changes for scenario 2

* move skills

* move to a skills/fast subfolder

* Refactor fast-0-org-setup prereqs skill

* Add skill-turn-harness utility tool

* Use relative markdown links for skill references

* Use descriptive titles for markdown links in skill references

* Add descriptions to each phase in the prerequisites workflow map

* Use backslash for markdown line breaks in skill map

* Update README security warning to mention default .gitignore

* shebang

* Update fast prereqs skill rules to force sequential question flow and refine harness tool with proper ctrl+c handling and slugified log paths

* Move playbook-gcp-dev.yaml to fast/prerequisites/gcp-dev.yaml and update fast prerequisites

* docs(skill-turn-harness): detail autonomous pond testing approach

* docs(skill-turn-harness): add final_state_checks to pond architecture and update toc

* Refine fast prereqs SKILL and gcp-dev playbook to strictly align with one-question-at-a-time rule

* feat(skill-turn-harness): update playbook schema for autonomous persona mode

* feat(skill-turn-harness): implement autonomous persona testing mode and fallback logic

* docs(skill-turn-harness): document the three modes of testing and update ToC

* implement timeout, schema validation, configurable cli

* chore: remove accidentally committed log files

* chore: ignore logs directory

* feat(skill-harness): implement tool execution interception, configurable workspace, and modularized validation

* feat(skill-harness): add model configuration and update README

* fix(skill-harness): automatically inject -y flag to gemini commands

* docs(skill-harness): add TODO.md with analysis for skill environment dependencies

* feat(skill-harness): add working_dir support and clean up fixtures

- Implement working_dir in harness to run tests in specific directories.
- Rename test fixtures and playbooks to be more descriptive.
- Add E2E test for working_dir.
- Apply code quality improvements to harness.py (imports, linting).
- Update README with working directory considerations and usage notes.
- Update phase3-bootstrap-and-iam.md skill doc to add execution rule against creating temp scripts.

* fix: capture customer_id and respect relative paths

* Implement isolated temp workspace sandboxing with symlinks in test harness

* Configure GCD manual autonomous playbook and align Phase 3/4 steps order

* Fix linting and schema tests failures

- Add missing license headers to tools/skill-turn-harness files.

- Fix trailing spaces and newlines in playbooks.

- Ignore tools directory in schema tests workflow.

TAG=agy

CONV=1bb75453-c3e2-448b-bae9-8e332a068012

* Fix Python formatting with yapf

TAG=agy

CONV=1bb75453-c3e2-448b-bae9-8e332a068012

* Refactor skill-turn-harness to use Antigravity SDK

- Migrated harness from gemini-cli subprocesses to Antigravity SDK.
- Implemented real-time step streaming and console logging.
- Added color-coded terminal output (dark gray headers, blue inputs, pink outputs).
- Collapsed excessive newlines in streamed thoughts.
- Excluded harness codebase from workspace copy to prevent agent cheating.
- Enabled skills folder copy to resolve agent lookup loops.
- Added key validation and CLI --debug flag.

* Fix autonomous turn layout: print Turn ID before execution

- Moved the [Autonomous Turn X] header print to before running the agent turn.
- This groups the real-time thinking and tool calls under the correct Turn ID block, instead of displaying them before the label.

* Remove obsolete .log.md from prerequisites skill directory
2026-05-22 17:16:54 +00:00

312 lines
10 KiB
Python

# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import json
import subprocess
import asyncio
from unittest.mock import patch, MagicMock, AsyncMock, PropertyMock
import pytest
from dataclasses import asdict
import harness
# --- Phase A: Data & Logging Unit Tests ---
def test_parse_and_validate_env(monkeypatch):
playbook = {'env': ['TEST_KEY']}
# Missing key raises error
with pytest.raises(ValueError,
match='Missing required environment variables: TEST_KEY'):
harness.parse_and_validate_env(playbook)
# Present key succeeds
monkeypatch.setenv('TEST_KEY', '123')
result = harness.parse_and_validate_env(playbook)
assert result['TEST_KEY'] == '123'
def test_step_data_serialization():
step = harness.StepData(
step_index=0,
user_input='hello',
expected_outcome='greet back',
skill_response='hi',
parsed_eval={
'passed': True,
'reasoning': 'ok'
},
is_system_error=False,
)
d = asdict(step)
assert d['step_index'] == 0
assert d['user_input'] == 'hello'
assert d['expected_outcome'] == 'greet back'
assert d['parsed_eval']['passed'] is True
def test_load_env_file(tmp_path):
env_file = tmp_path / '.env'
env_file.write_text('FOO=bar\n# comment\nBAZ=qux=123\n')
harness.load_env_file(str(env_file))
assert os.environ.get('FOO') == 'bar'
assert os.environ.get('BAZ') == 'qux=123'
def test_markdown_logging(tmp_path):
log_file = tmp_path / 'test_log.md'
harness.init_markdown_log(str(log_file), 'Test Playbook')
harness.log_step_to_markdown(
md_log_path=str(log_file),
step_index=0,
user_input='input 1',
expected_outcome='outcome 1',
skill_response='response 1',
parsed_eval={
'passed': True,
'reasoning': 'Good job'
},
)
content = log_file.read_text()
assert '# Interaction Log: Test Playbook' in content
assert '## Step 1' in content
assert '**User:**\n\ninput 1' in content
assert '**Expected Outcome:**\n\noutcome 1' in content
assert '**Agent:**\n\nresponse 1' in content
assert '✅ PASS: Good job' in content
def test_dump_failed_log(tmp_path):
interaction_log = [{'step': 1, 'error': 'test'}]
harness.dump_failed_log(str(tmp_path), 'test-playbook-prefix',
interaction_log)
failed_file = tmp_path / 'test-playbook-prefix_failed.json'
assert failed_file.exists()
data = json.loads(failed_file.read_text())
assert len(data) == 1
assert data[0]['error'] == 'test'
# --- Phase B: Execution Unit Tests (Mocked) ---
@patch('harness.genai.Client')
@patch('harness.Agent')
def test_run_hybrid_tuning_loop_mocked_success(mock_agent_class,
mock_client_class, tmp_path):
# Mock Conversation
mock_conversation = MagicMock()
mock_conversation.send = AsyncMock()
async def mock_receive_steps():
yield harness.agy_types.Step(type=harness.agy_types.StepType.TEXT_RESPONSE,
source=harness.agy_types.StepSource.MODEL,
target=harness.agy_types.StepTarget.USER,
status=harness.agy_types.StepStatus.DONE,
content="Mocked Agent Response")
mock_conversation.receive_steps.return_value = mock_receive_steps()
type(mock_conversation).last_response = PropertyMock(
return_value="Mocked Agent Response")
# Mock Agent
mock_agent = MagicMock()
mock_agent.conversation = mock_conversation
mock_agent_class.return_value.__aenter__.return_value = mock_agent
# Mock Evaluator
mock_eval_client = MagicMock()
mock_client_class.return_value = mock_eval_client
mock_eval_response = MagicMock()
mock_eval_response.text = '{"passed": true, "reasoning": "Mocked pass"}'
mock_eval_client.models.generate_content.return_value = mock_eval_response
# Playbook
playbook_content = """
name: "Mocked Playbook"
steps:
- user_input: "Hello"
expected_outcome: "Greet"
"""
playbook_file = tmp_path / "playbook.yaml"
playbook_file.write_text(playbook_content)
import asyncio
result = asyncio.run(
harness.run_hybrid_tuning_loop(str(playbook_file), log_dir=str(tmp_path)))
assert result is True
mock_conversation.send.assert_called_once_with("Hello")
mock_eval_client.models.generate_content.assert_called_once()
@patch('harness.genai.Client')
@patch('harness.Agent')
def test_run_hybrid_tuning_loop_mocked_timeout(mock_agent_class,
mock_client_class, tmp_path):
# Mock genai.Client
mock_client_class.return_value = MagicMock()
import asyncio
mock_conversation = MagicMock()
mock_conversation.send = AsyncMock(side_effect=asyncio.TimeoutError())
async def empty_gen():
if False:
yield
mock_conversation.receive_steps.return_value = empty_gen()
mock_agent = MagicMock()
mock_agent.conversation = mock_conversation
mock_agent_class.return_value.__aenter__.return_value = mock_agent
# Playbook
playbook_content = """
name: "Mocked Playbook"
steps:
- user_input: "Hello"
expected_outcome: "Greet"
"""
playbook_file = tmp_path / "playbook.yaml"
playbook_file.write_text(playbook_content)
result = asyncio.run(
harness.run_hybrid_tuning_loop(str(playbook_file), log_dir=str(tmp_path)))
assert result is False
mock_conversation.send.assert_called_once_with("Hello")
log_files = list(tmp_path.glob('*_log.md'))
assert len(log_files) == 1
content = log_files[0].read_text()
assert 'SYSTEM_ERROR: Timeout' in content
# --- Phase C: E2E Test ---
@pytest.mark.e2e
def test_e2e_hybrid_tuning_loop(tmp_path):
'''
Runs the actual evaluation loop against the basic FAST Setup PoC skill.
Uses tmp_path for log_dir so we don't pollute the actual workspace logs.
'''
fixtures_dir = os.path.join(os.path.dirname(__file__), 'fixtures')
skill_dir = os.path.join(fixtures_dir, 'mock-conversation-skill')
playbook_path = os.path.join(fixtures_dir,
'playbook_scripted_env_substitution.yaml')
env_file_path = os.path.join(fixtures_dir, '.env.test')
# Load env to prime the os.environ
harness.load_env_file(env_file_path)
result = asyncio.run(
harness.run_hybrid_tuning_loop(playbook_path, log_dir=str(tmp_path),
skill_src=skill_dir))
assert result is True
# Verify the log file was created in the temporary directory
log_files = list(tmp_path.glob('*_log.md'))
assert len(log_files) == 1
log_file = log_files[0]
assert log_file.exists()
content = log_file.read_text()
assert '✅ PASS' in content
# Verify substitution happened securely
assert 'dummy-secret-12345' in content
assert '${MY_SECRET_ID}' not in content
@pytest.mark.e2e
def test_e2e_autonomous_tuning_loop(tmp_path):
'''
Runs the autonomous evaluation loop against the basic FAST Setup PoC skill.
'''
fixtures_dir = os.path.join(os.path.dirname(__file__), 'fixtures')
skill_dir = os.path.join(fixtures_dir, 'mock-conversation-skill')
playbook_path = os.path.join(fixtures_dir,
'playbook_autonomous_conversation.yaml')
env_file_path = os.path.join(fixtures_dir, '.env.test')
harness.load_env_file(env_file_path)
result = asyncio.run(
harness.run_hybrid_tuning_loop(playbook_path, log_dir=str(tmp_path),
skill_src=skill_dir))
assert result is True
log_files = list(tmp_path.glob('*_log.md'))
assert len(log_files) == 1
content = log_files[0].read_text()
# Check that the autonomous turns were logged
assert '## Autonomous Turn 1' in content
assert 'dummy-secret-12345' in content
@pytest.mark.e2e
def test_e2e_tool_calls_contain(tmp_path):
'''
Runs an autonomous evaluation loop to verify tool_calls_contain deterministic checks.
'''
fixtures_dir = os.path.join(os.path.dirname(__file__), 'fixtures')
skill_dir = os.path.join(fixtures_dir, 'mock-tool-use-skill')
playbook_path = os.path.join(fixtures_dir,
'playbook_autonomous_tool_use.yaml')
result = asyncio.run(
harness.run_hybrid_tuning_loop(playbook_path, log_dir=str(tmp_path),
skill_src=skill_dir))
assert result is True
# Verify that the session JSON was saved
session_files = list(tmp_path.glob('*_session.json'))
assert len(session_files) == 1
assert session_files[0].exists()
@pytest.mark.e2e
def test_e2e_working_dir(tmp_path):
'''
Runs an evaluation loop to verify working_dir functionality.
'''
fixtures_dir = os.path.join(os.path.dirname(__file__), 'fixtures')
skill_dir = os.path.join(fixtures_dir, 'mock-tool-use-skill')
# Create a specific subdirectory in tmp_path
workdir_target = tmp_path / "workdir_target"
workdir_target.mkdir()
# Dynamically create a playbook YAML file
playbook_content = f"""# yaml-language-server: $schema=../../playbooks/playbook.schema.json
name: "Tool Test with Workdir"
working_dir: "{workdir_target.resolve()}"
steps:
- user_input: "Hi, please activate tool-test-skill and create the file output.txt."
expected_outcome: "The agent confirms it has created the file."
"""
playbook_path = tmp_path / "playbook_workdir.yaml"
playbook_path.write_text(playbook_content)
result = asyncio.run(
harness.run_hybrid_tuning_loop(str(playbook_path), log_dir=str(tmp_path),
skill_src=skill_dir))
assert result is True
# Verify that output.txt was created INSIDE workdir_target
output_file = workdir_target / "output.txt"
assert output_file.exists()
assert output_file.read_text().strip() == "Hello World"