Files
hunfabric/adrs/modules/20260323-compute-vm-refactoring.md
Ludovico Magnocavallo a4eb4d24fd Compute VM module refactor (#3805)
* add ad for compute-vm refactor

* Exclue nic_type from validated fields, add split of main.tf and template.tf

* boot disk

* fix examples and fixtures

* attached disks

* fix further examples and module-level tests

* remove extra file

* fix mig examples

* finish refactoring variables

* align fast and other modules

* refactor(compute-vm): align examples and ADR with the newly implemented interface

This commit addresses the remaining references of the `instance_type` and `confidential_compute` parameters in the testing environment and updates the ADR.

* feat(compute-vm): add network_performance_config to instance and templates

This change implements the usage of the `network_performance_tier` variable we added earlier into the actual Terraform resources.

---------

Co-authored-by: Wiktor Niesiobędzki <wiktorn@google.com>
2026-03-26 11:31:40 +00:00

24 KiB

Refactor compute-vm module variables and add new resource attributes

authors: Ludo Wiktor
date: Mar 23, 2026

Status

Draft

Context

The compute-vm module currently uses variable schemas that diverge from the modern standards adopted by newer Cloud Foundation Fabric modules. The current design of boot_disk and attached_disks uses different schemas, lacks polymorphic source structures, and utilizes lists instead of maps, causing for_each stability issues. Furthermore, several modern google_compute_instance attributes (e.g., queue_count, network_performance_config, advanced scheduling, SEV-SNP) are missing.

Decision

1. List vs. Map for Interfaces and Disks

  • Network Interfaces: The order of network interfaces is critical in GCP VMs (e.g., nic0 is the primary interface, nic1 is secondary, etc.). Terraform's google_compute_instance resource processes the network_interface blocks in the order they are defined. Using a map would lose this explicit ordering (since map keys are sorted alphabetically in Terraform), making it impossible to guarantee which interface becomes nic0. Therefore, network_interfaces must remain a list.
  • Attached Disks: While disks also have an implicit order when attached, their identity is more strongly tied to their device_name or name rather than their strict numerical index. The current approach of using a list and generating keys based on index ("disk-${i}") causes issues with for_each loops (like in google_compute_disk_resource_policy_attachment) when dynamic values or conditional creations are involved, leading to the "value of count cannot be computed" or "invalid for_each argument" errors. Switching attached_disks to a map(object({...})) where the key is the logical name or device_name solves these for_each stability issues and aligns with modern Fabric patterns.

2. Disk Refactoring Strategy

Disambiguating Disk "Names"

Disks in GCP and Terraform have several identifiers which often cause confusion. We will explicitly disambiguate them as follows:

  1. Map Key (The Identifier): In the new attached_disks map, the key itself will act as the primary logical identifier for the disk within the module's Terraform state.
  2. Device Name (device_name): This is the name exposed to the Guest OS (e.g., visible in /dev/disk/by-id/google-<device_name>).
    • Rule: We will default the device_name to the Map Key. Users can override it explicitly if needed, but the map key provides a safe, predictable default.
  3. Resource Name (name): This is the actual name of the google_compute_disk resource created in the GCP API.
    • Rule: To ensure uniqueness across a project, we will default the resource name to ${var.name}-${each.key} (the VM name hyphenated with the Map Key). Users can provide an explicit name attribute to override this (e.g., when attaching an existing disk or requiring a specific naming convention).

Unifying boot_disk and attached_disks Structures

Currently, boot_disk uses an initialize_params block (mirroring Terraform's native syntax), while attached_disks uses an options block and keeps size at the top level. We will align them to use a consistent schema:

  • Adopt initialize_params: Both boot_disk and attached_disks will use an initialize_params block for creation-specific attributes (size, type, image, architecture, provisioned iops/throughput). This clearly separates attributes used for creating a disk from attributes used for attaching a disk.
  • Top-level attributes: Attributes relevant to the attachment or lifecycle (e.g., source, auto_delete, mode) will live at the top level of the disk object.
  • Source Type Handling: For attached_disks, we will keep a mechanism to distinguish between creating from an image/snapshot vs. attaching an existing disk (e.g., keeping source_type or inferring it from the presence of initialize_params vs source).

Polymorphic source Object

To eliminate the confusing source (string) and source_type (string) variables, we will use a polymorphic source object. This pattern ensures mutual exclusivity and clearly defines the origin of the disk.

Type Definition:

source = optional(object({
  attach   = optional(string)
  disk     = optional(string)
  image    = optional(string)
  snapshot = optional(string)
}))

Validation: A validation rule will ensure that if source is provided, exactly one of its attributes is non-null. If source is omitted entirely (for attached disks), it implies creating a blank disk.

Updated Variable Structures:

variable "boot_disk" {
  type = object({
    architecture      = optional(string)
    auto_delete       = optional(bool, true)
    snapshot_schedule = optional(list(string))
    initialize_params = optional(object({
      size = optional(number, 10)
      type = optional(string, "pd-balanced")
      hyperdisk = optional(object({
        provisioned_iops       = optional(number)
        provisioned_throughput = optional(number) # in MiB/s
        storage_pool           = optional(string)
      }), {})
    }), {})
    source = optional(object({
      attach   = optional(string)
      disk     = optional(string)
      image    = optional(string)
      snapshot = optional(string)
    }), { image = "projects/debian-cloud/global/images/family/debian-11" })
    use_independent_disk = optional(object({
      name = optional(string)
    }))
  })
}

variable "attached_disks" {
  type = map(object({
    auto_delete = optional(bool, true) # applies only to vm templates
    device_name = optional(string)
    mode        = optional(string, "READ_WRITE")
    name        = optional(string)
    initialize_params = optional(object({
      replica_zone = optional(string)
      size         = optional(number, 10)
      type         = optional(string, "pd-balanced")
      hyperdisk = optional(object({
        provisioned_iops       = optional(number)
        provisioned_throughput = optional(number) # in MiB/s
        storage_pool           = optional(string)
      }), {})
    }), {})
    snapshot_schedule = optional(list(string))
    source = optional(object({
      attach   = optional(string)
      image    = optional(string)
      snapshot = optional(string)
    }), {})
  }))
}

Example Usage: Polymorphic Disks

Here is how the proposed boot_disk and attached_disks variables look in practice using the polymorphic source object.

1. Boot Disk Examples
# Default boot disk (from image)
boot_disk = {
  auto_delete = true
  source = {
    image = "projects/debian-cloud/global/images/family/debian-11"
  }
  initialize_params = {
    size = 20
    type = "pd-ssd"
  }
}

# Booting from an existing attached disk
boot_disk = {
  auto_delete = false
  source = {
    attach = "projects/my-project/zones/europe-west1-b/disks/my-existing-boot-disk"
  }
  # initialize_params are omitted/ignored when attaching
}
2. Attached Disks Examples
attached_disks = {
  # 1. Create a blank disk
  # The map key ("data-disk") is the primary identifier.
  data-disk = {
    auto_delete = false
    mode        = "READ_WRITE"
    initialize_params = {
      size = 100
      type = "pd-balanced"
    }
    # source is omitted entirely
  }

  # 2. Create a disk from a snapshot
  restored-data = {
    source = {
      snapshot = "projects/my-project/global/snapshots/my-snapshot"
    }
    initialize_params = {
      size = 500
      type = "pd-ssd"
    }
  }

  # 3. Attach an existing disk (overriding defaults)
  existing-backup = {
    device_name = "backup-mount" # Explicitly set device name for OS
    mode        = "READ_ONLY"
    source = {
      attach = "projects/my-project/zones/europe-west1-b/disks/my-existing-disk"
    }
  }
}

3. Feature and Options Grouping Strategy

Currently, compute-vm has a mix of top-level boolean toggles (confidential_compute, can_ip_forward, enable_display) and a catch-all options variable that houses advanced_machine_features, scheduling/spot configurations, and operational toggles.

To align with modern Fabric patterns, we will decompose these into logical *_config objects and structured variables. Crucially, almost all string field that maps to an API enum (e.g., provisioning_model, maintenance_interval) will include strict Terraform validation rules. nic_type will be an exception from this rule as the valid values depend on machine_type and there might be new types introduced in the future.

Proposed Groupings and Type Definitions

1. machine_type / Machine Configuration

Currently named instance_type, we will rename it to machine_type to align with GCP console and gcloud terminology. min_cpu_platform remains top-level. advanced_machine_features will be extracted from options into its own top-level block or kept flat.

2. scheduling_config (Replaces parts of options)

The google_compute_instance.scheduling block in Terraform handles spot instances, maintenance, and run durations. We will extract these from the current options variable into a dedicated scheduling_config object and add the missing modern attributes.

variable "scheduling_config" {
  description = "Scheduling configuration for the instance."
  type = object({
    automatic_restart    = optional(bool)   # Defaults to !spot
    maintenance_interval = optional(string) # NEW
    min_node_cpus        = optional(number) # NEW
    on_host_maintenance  = optional(string) # Defaults to MIGRATE or TERMINATE based on GPU/Spot
    provisioning_model   = optional(string) # "SPOT" or "STANDARD"
    termination_action   = optional(string)
    local_ssd_recovery_timeout = optional(object({ # NEW
      nanos   = optional(number)
      seconds = number
    }))
    max_run_duration = optional(object({
      nanos   = optional(number)
      seconds = number
    }))
    node_affinities = optional(map(object({
      values = list(string)
      in     = optional(bool, true)
    })), {})
  })
  default = {}
}

Example Usage:

scheduling_config = {
  provisioning_model          = "SPOT"
  termination_action   = "STOP"
  maintenance_interval        = "PERIODIC"
  node_affinities = {
    "compute.googleapis.com/node-group-name" = {
      values = ["my-node-group"]
    }
  }
}
3. confidential_compute (Updating to support SEV-SNP)

Currently a boolean. Since the Terraform block (confidential_instance_config) effectively only needs to know the type (SEV or SEV_SNP) when enabled, we will change this to a simple string to avoid a single-field object.

variable "confidential_compute" {
  description = "Confidential Compute configuration. Set to 'SEV' or 'SEV_SNP' to enable."
  type        = string
  default     = null # If null, feature is disabled
  validation {
    condition     = var.confidential_compute == null || contains(["SEV", "SEV_SNP"], coalesce(var.confidential_compute, "-"))
    error_message = "Allowed values are 'SEV' or 'SEV_SNP'."
  }
}

Example Usage:

confidential_compute = "SEV_SNP"
4. shielded_config

Remains an object, but we ensure its type signature uses strict optional() defaults mirroring current behavior.

variable "shielded_config" {
  description = "Shielded VM configuration of the instances."
  type = object({
    enable_secure_boot          = optional(bool, true)
    enable_vtpm                 = optional(bool, true)
    enable_integrity_monitoring = optional(bool, true)
  })
  default = null
}

Example Usage:

shielded_config = {
  enable_secure_boot = true
  enable_vtpm        = false
}
5. network_interfaces Enhancements

We will add the missing modern attributes directly to the existing list of objects:

  • queue_count
  • internal_ipv6_prefix_length
variable "network_interfaces" {
  description = "Network interfaces configuration. Use self links for Shared VPC, set addresses to null if not needed."
  type = list(object({
    network                     = string
    subnetwork                  = string
    alias_ips                   = optional(map(string), {})
    nat                         = optional(bool, false)
    network_tier                = optional(string)
    nic_type                    = optional(string)
    stack_type                  = optional(string)
    queue_count                 = optional(number) # NEW
    internal_ipv6_prefix_length = optional(number) # NEW
    addresses = optional(object({
      internal = optional(string)
      external = optional(string)
    }), null)
  }))
}

Example Usage:

network_interfaces = [{
  network                     = "my-vpc"
  subnetwork                  = "my-subnet"
  queue_count                 = 4
  internal_ipv6_prefix_length = 96
}]
6. network_performance_tier (NEW)

Since the network_performance_config block only contains a single field (total_egress_bandwidth_tier), we will implement it as a flat string variable to avoid unnecessary complex objects.

variable "network_performance_tier" {
  description = "Network performance total egress bandwidth tier."
  type        = string
  default     = null
  validation {
    condition     = var.network_performance_tier == null || contains(["DEFAULT", "TIER_1"], coalesce(var.network_performance_tier, "-"))
    error_message = "Allowed values are 'DEFAULT' or 'TIER_1'."
  }
}

Example Usage:

network_performance_tier = "TIER_1"
7. lifecycle_config (Replaces residual options)

Operational toggles will be grouped into a lifecycle_config object. key_revocation_action_type dictates whether the VM stops when its CMEK is revoked, which fits well within lifecycle management.

variable "lifecycle_config" {
  description = "Instance lifecycle and operational configurations."
  type = object({
    allow_stopping_for_update  = optional(bool, true)
    deletion_protection        = optional(bool, false)
    key_revocation_action_type = optional(string, "NONE")
    graceful_shutdown = optional(object({
      enabled           = optional(bool, false)
      max_duration_secs = optional(number)
    }))
  })
  default = {}
}

Example Usage:

lifecycle_config = {
  deletion_protection        = true
  allow_stopping_for_update  = false
  key_revocation_action_type = "STOP"
  graceful_shutdown = {
    enabled           = true
    max_duration_secs = 60
  }
}

4. Instance Groups and Policies

Instance Groups (group)

Currently, the module can only create an unmanaged instance group and add the VM to it. We will expand this to support adding the VM to an existing unmanaged instance group using the google_compute_instance_group_membership resource.

To manage this cleanly, we will update the group variable to support both modes:

variable "group" {
  description = "Instance group configuration. Set 'named_ports' to create a new unmanaged instance group, or provide an existing group self_link/id in 'membership' to join one."
  type = object({
    named_ports = optional(map(number))
    membership  = optional(string) # ID of an existing unmanaged group to join
  })
  default = null
}

Note: If named_ports is provided, a new group is created. If membership is provided, the VM joins the specified existing group. They are mutually exclusive.

Resource Policies (Snapshots and Schedules)

The module currently supports creating snapshot_schedules and an instance_schedule.

  • Snapshot Schedules: The existing snapshot_schedules variable is already well-structured using modern optionals. We will retain this structure. The primary refactoring here will be updating the attachment logic (google_compute_disk_resource_policy_attachment) to iterate over the new attached_disks map instead of the old list.
  • Instance Schedule: The instance_schedule variable is also well-structured using strict optionals and will be retained.
  • Placement Policies: The existing resource_policies list variable already allows attaching externally created placement policies (Collocated/Spread) or other custom policies. We will keep this as-is for flexibility, as placement policies are typically shared across multiple standalone VMs.

5. Templates (create_template) Strategy

Currently, create_template is an object type = object({ regional = optional(bool, false) }) that defaults to null. It creates either a google_compute_instance_template or google_compute_region_instance_template depending on the regional flag.

While this pattern is somewhat unusual in the Fabric codebase, we will keep the create_template variable structure but ensure it is strictly integrated with the new disk schemas.

Key Refactoring Points for Templates

  1. Disk Schema Alignment: The template.tf file currently maps the old options block to the template's disk block. This mapping will be updated to reflect the new initialize_params and polymorphic source blocks.
    • Constraint: Templates do not allow specifying source_image alongside disk_name or disk_size_gb in the same way standalone instances do (some fields are mutually exclusive).
    • Solution Map:
      • source.image -> source_image
      • source.snapshot -> source_snapshot
      • source.attach -> source (attaching an existing disk)
      • source == null -> creates a blank disk
  2. Attribute Parity: All newly refactored attributes (network_performance_tier, scheduling_config, updated confidential_compute, and network interface enhancements) will be mapped directly into the respective blocks within both regional and global template resources.
  3. Tags and Labels: No architectural change here, but we will ensure that tag_bindings_immutable continues to map correctly to resource_manager_tags.

TODO

Example tests will be adapted and run as part of each task iteration.

  • Task 1: Update variables.tf to implement the new disk structures (boot_disk and attached_disks), polymorphic source, and disambiguate disk names.
  • Task 2: Refactor variables.tf for feature grouping: rename instance_type to machine_type, add scheduling_config, lifecycle_config, network_performance_tier, and update confidential_compute.
  • Task 3: Add new attributes to network_interfaces (queue_count, internal_ipv6_prefix_length).
  • Task 4: Split template.tf into template-zonal.tf and template-regional.tf, extract instance.tf from main.tf to allow easy comparison of feature coverage.
  • Task 5: Expand the group variable to support the membership attribute.
  • Task 6: Update instance.tf and outputs.tf to consume the new variables (standalone VM implementation).
  • Task 7: Update tags.tf and resource-policies.tf to work with the new attached_disks map instead of a list.
  • Task 8: Update template-zonal.tf and template-regional.tf to align with the new disk schemas and map the new feature attributes.
  • Task 9: Run integration tests and regenerate documentation (python3 tools/tfdoc.py and YAML test files updates).
  • Task 10: Assess if disk-level encryption key overrides make sense, and if so implement them.

Addendum: Missing Disk Attributes

Based on a review of the latest terraform-provider-google documentation for google_compute_disk, google_compute_region_disk, and google_compute_instance disk attachments, the following attributes are currently missing from the proposed disk type definitions and should be considered for inclusion:

1. Metadata and Organization

  • description (string): An optional description of the disk resource.
  • labels (map(string)): Key/value pairs to label the disk.
  • params / resource_manager_tags (map(string)): Resource manager tags to be bound to the disk.
  • licenses (list(string)): Applicable license URIs to apply to the disk.

2. Encryption and Security

  • disk_encryption_key (object): Used to encrypt the disk with a customer-supplied (CSEK) or customer-managed (CMEK) key.
  • source_image_encryption_key (object): Required to decrypt the source image if it is protected by a CSEK/CMEK.
  • source_snapshot_encryption_key (object): Required to decrypt the source snapshot if it is protected by a CSEK/CMEK.
  • enable_confidential_compute (bool): Whether the disk uses confidential compute mode (supported on certain Hyperdisk SKUs).
  • disk_encryption_key_raw / kms_key_self_link: Required on the attached_disk block of google_compute_instance to mount an existing encrypted disk.

3. Advanced Disk Features & Hyperdisk

  • access_mode (string): Specifically for Hyperdisks (e.g., READ_WRITE_SINGLE, READ_WRITE_MANY, READ_ONLY_SINGLE).
  • multi_writer (bool): Indicates whether a persistent disk can be read/write attached to more than one instance.
  • physical_block_size_bytes (number): Allows specifying physical block size (usually 4096 or 16384).
  • guest_os_features (list(object)): Features to enable on the guest OS (e.g., UEFI_COMPATIBLE, SECURE_BOOT, MULTI_IP_SUBNET).
  • async_primary_disk (object): Primary disk configuration for asynchronous disk replication.

4. Source Creation Options

  • source_disk (string): Allows creating a new disk by cloning an existing google_compute_disk (supported by both zonal and regional disks).
  • source_instant_snapshot (string): Allows creating a disk from a Google Compute instant snapshot.
  • source_storage_object (string): Allows creating a disk directly from a GCS URI tarball/vmdk.
  • erase_windows_vss_signature (bool): Specifies whether the disk restored from a source snapshot should erase the Windows-specific VSS signature.
  • Note on Regional Disks: google_compute_region_disk does not support initialization directly from an image. The source.image attribute will only work for zonal disks.

5. Disk Lifecycle

  • create_snapshot_before_destroy (bool): If true, creates a snapshot of the disk before Terraform destroys it.
  • create_snapshot_before_destroy_prefix (string): A custom prefix for the snapshot name created prior to destruction.