Reorganize ADRs and new versioning ADR (#2642)

* Reorganize ADRs and new versioning ADR

* Workflow examples

* Fix ADR links

* Changes discussed with ludoo

* Fix image reference

* Update image

* Fix typo

* Complet decision section

---------

Co-authored-by: Ludovico Magnocavallo <ludomagno@google.com>
This commit is contained in:
Julio Castillo
2024-10-30 12:39:53 +01:00
committed by GitHub
parent dafb8d246d
commit f5d05b3c3f
12 changed files with 98 additions and 1 deletions

View File

@@ -1,36 +0,0 @@
# Remove initial gcloud commands needed to bootstrap
**authors:** [Ludo](https://github.com/ludoo)\
**date:** July 13, 2023
## Status
Rejected.
## Context
The initial `gcloud` commands that grant IAM roles to the user running `apply` for the first time, are sometimes seen as an extra hurdle and an unnecessary complication.
These are the roles in question
- `roles/logging.admin`
- `roles/owner`
- `roles/resourcemanager.organizationAdmin`
- `roles/resourcemanager.projectCreator`
One proposal we investigated was internalizing those IAM bindings in the actual Terraform code, either via bare resources or an additional organization module invocation, and depending subsequent resources on it.
On further investigation, this poses a few challenges
- the roles in question are managed authoritatively, and it would be best they remained so (e.g. to clear the Project Creator role, or ensure Organization Administrators match what is in the code)
- project creation depends on those roles, but this creates a cycle dependency as the service accounts created are also assigned those roles, and they cannot implicitly depend (via the project) on the same roles
Working around this issue would require a substantial amount of hoops and a lot of development effort. It would also result in potentially less safe and more fragile code.
## Decision
What we decided is to leave those external commands in place, as the hurdle is minimal and not worth the expense and risks of removing it.
## Consequences
Nothing changes due to this decision.

View File

@@ -1,85 +0,0 @@
# Add new service accounts for CI/CD with plan-only permissions
**authors:** [Ludo](https://github.com/ludoo) \
**date:** December 3, 2023
## Status
In development.
## Context
The current CI/CD workflows are inherently insecure, as the same service account is used to run `terraform plan` in PR checks, and `terraform apply` in merges.
The current repository configuration variable allows setting a branch which could be used to only allow using the service account in merges, but that only has the consequence of preventing PR checks to work so it's not working as desired.
## Proposal
The proposal is to create a separate "chain" of less privileged service accounts that can only run `plan`, used only when a repository configuration sets a branch for merges in the `cicd_repositories` variable.
### Use cases
#### Merge branch set in repository configuration
```hcl
cicd_repositories = {
bootstrap = {
branch = "main"
identity_provider = "github-example"
name = "example/bootstrap"
type = "github"
}
}
# tftest skip
```
When a merge branch is set as in the example above, the CI/CD workflow will have two separate flows:
- for PR checks, the OIDC token will be exchanged with credentials for the `plan`-only CI/CD service account, which can only impersonate the `plan`-only automation service account
- for merges, the current flow that enables credential exchange and impersonation of the `apply`-enabled service account will be used
#### No merge branch set in repository configuration
```hcl
cicd_repositories = {
bootstrap = {
identity_provider = "github-example"
name = "example/bootstrap"
type = "github"
}
}
# tftest skip
```
If no merge branch is set in the repository configuration as in the example above, the current behaviour will be preserved allowing exchange and impersonation of the `apply`-enabled service account from any branch.
### Implementation
No changes to variables will be needed other than a lightweight refactor with `optional`.
The following resource changes will need to be implemented:
- define the set of read-only roles for each stage
- create a new automation service account in each stage and assign the identified roles
- create a new CI/CD service account with `roles/iam.serviceAccountTokenCreator` on the new automation service account
- if a merge branch is set in the repository configuration
- grant `roles/iam.workloadIdentityUser` on the new CI/CD service account to the `principalSet:` matching any branch
- define a new provider file that impersonates the new automation service account and use it in the workflow for checks
- keep the existing token exchange via `principal:`, impersonation and provider file for the `apply` part of the workflow only matching the specified merge branch
- if a branch is not set the current behaviour will be kept
Implementation will modify in stages 0 and 1
- the `automation.tf` files
- any file where IAM roles are assigned to the automation service account
- the `cicd-*.tf` files
- the `templates/workflow-*.yaml` files to implement the new workflow logic
- the `outputs.tf` files to generate the additional provider files
## Decision
This has been surfaced a while ago and implementation was only pending actual time for development. Development has started.
## Consequences
Existing CI/CD workflows will need to be replaced when a merge branch is already defined in the repository configuration (unlikely to happen as the current workflow would not work).

View File

@@ -1,142 +0,0 @@
# Support for domain-less organizations
**authors:** [Ludo](https://github.com/ludoo) \
**reviewed by:** [Julio](https://github.com/juliocc) \
**date:** Feb 12, 2024
## Status
Implemented in [#2064](https://github.com/GoogleCloudPlatform/cloud-foundation-fabric/pull/2064).
## Context
The current FAST design assumes that operational groups come from the same Cloud Identity instance connected to the GCP organization.
While this approach has worked well in the past, there are already designs that cannot be easily mapped (for example groups coming from a separate CI), and the situation will only get worse once domain-less organizations start to be in wider use.
Removing the assumption that FAST logical principals (e.g. `gcp-organization-admins`) always map directly to groups is not entirely trivial, since FAST uses data from the `groups` variable in different places:
- to define authoritative IAM bindings via the module-level `group_iam` interface
- to define additive IAM bindings via the module-level `iam_bindings_additive` interface
- to set essential contacts at the folder and project level
This proposal removes the dependency from groups by allowing to pass in to FAST any principal type, while still trying to preserve the current default behaviour and code readability in IAM bindings.
## Proposal
### FAST variable type change and optional interpolation
The current `groups` variable was meant as a simple mapping between logical profile names used internally by FAST, and actual group names. The default case was furthermore made easier by interpolating the organization domain when no domain was specified, and adding the `group:` principal prefix for IAM bindings.
The new proposed variable maintains the legacy behaviour, but slightly changes it so that no interpolation happens if the variable attributes have a principal prefix. The variable type is also updated to use `optional`, so that individual logical profile / principal mappings can be specified without having to override the whole block.
```hcl
variable "groups" {
type = object({
gcp-billing-admins = optional(string, "gcp-billing-admins")
gcp-devops = optional(string, "gcp-devops")
gcp-network-admins = optional(string, "gcp-network-admins")
gcp-organization-admins = optional(string, "gcp-organization-admins")
gcp-security-admins = optional(string, "gcp-security-admins")
gcp-support = optional(string, "gcp-support")
})
nullable = false
default = {}
}
```
Passing in different principals is intuitive:
```hcl
groups = {
gcp-devops = "principalSet://iam.googleapis.com/locations/global/workforcePools/mypool/group/abc123"
gcp-organization-admins = "group:gcp-organization-admins@other.domain"
}
```
Internally, interpolation is fairly straightforward:
```hcl
locals {
groups = {
for k, v in var.group_principals : k => (
can(regex("^[a-zA-Z]+:", v))
? v
: "group:${v}@${var.organization.domain}"
)
}
}
```
### FAST IAM additive bindings and module interface change
FAST leverages the `group_iam` module-level interface to improve code readability for authoritative bindings, which is a primary goal of the framework. Introducing support for any principal type prevents us from using this interface, with a non-trivial impact on the overall readability of IAM roles in FAST.
This is an example use in the IaC project:
```hcl
# human (groups) IAM bindings
group_iam = {
(local.groups.gcp-devops) = [
"roles/iam.serviceAccountAdmin",
"roles/iam.serviceAccountTokenCreator",
]
(local.groups.gcp-organization-admins) = [
"roles/iam.serviceAccountTokenCreator",
"roles/iam.workloadIdentityPoolAdmin"
]
}
```
This proposal addresses the issue by changing the module-level interface to support different principal types. The original goal for `group_iam` -- to allow for better readability -- is preserved at the cost of the slight increase in verbosity due to having to specify the principal type.
The trade-off in verbosity seems acceptable as it makes the new interface more flexible, and allows using the interface for `principal:` and `principalSet:` types, which are becoming more and more important to support.
FAST code remains unchanged, as the `groups` local already contains a prefix for each principal, either interpolated or passed in by the user.
The module-level variable definition changes only its name and description:
```hcl
variable "iam_by_principals" {
description = "Authoritative IAM binding in {PRINCIPAL => [ROLES]} format. Principals need to be statically defined to avoid cycle errors. Merged internally with the `iam` variable."
type = map(list(string))
default = {}
nullable = false
}
```
Actual use is basically unchanged from the current `group_iam` interface:
```hcl
# current interface
group_iam = {
"app1-admins@example.org" = [
"roles/owner",
"roles/resourcemanager.folderAdmin",
"roles/resourcemanager.projectCreator"
]
}
# proposed interface
iam_by_principals = {
"group:app1-admins@example.org" = [
"roles/owner",
"roles/resourcemanager.folderAdmin",
"roles/resourcemanager.projectCreator"
]
"principalSet://iam.googleapis.com/locations/global/workforcePools/mypool/group/abc123": = [
"roles/owner",
"roles/resourcemanager.folderAdmin",
"roles/resourcemanager.projectCreator"
]
}
```
### FAST essential contacts
Having `group_principals` support different type of principals will make it impossible to use the same variable to set essential contacts, as the principal might not be a group.
This will require introduction of a new `essential_contacts` top-level variable keyed by folder/project (the individual contexts on which to set contacts), with the added benefit of being able to specify different and potentially multiple contacts compared to now.
## Decision
Rolled out.

View File

@@ -1,54 +0,0 @@
# Move organization policies to bootstrap stage
**authors:** [Julio](https://github.com/juliocc), [Ludo](https://github.com/ludoo), [Roberto](https://github.com/drebes) \
**date:** September 13, 2023
## Status
Implemented.
## Context
Three different requirements drive this proposal.
### Organization policies deployed at bootstrap time
Many organizations take security seriously, and would like to have organization policies (for example `iam.automaticIamGrantsForDefaultServiceAccounts`) deployed right from the beginning at bootstrap time. This is currently extremely cumbersome, as organization policies are managed in stage 1.
As an additional benefit, managing some or all organization policies in stage 0 will enable to turn off undesired resource configuration for the initial projects (for example `compute.skipDefaultNetworkCreation`).
### Simplify and limit delegation of Organization Policy Administrator role
Automation service accounts are currently assigned the Organization Policy Administrator role at the organization level, scoped via resource management tags. This is cumbersome as bindings are distributed between stage 0 that delegates role control to the stage 1 service account, and stage 1 that creates the automation service accounts, tags and folder bindings used for scoping.
A more secure way of doing this is via a dedicated resource management tag value hierarchy, and conditions on the organization policies that alter behaviour based on tags. This would allow centrally defining allowed exceptions to organization policies, and selectively granting access to specific exceptions to individual automation service accounts via tag values.
The project factory will need to retain scoped grants, to set policies that enforce lists of resources which would be too cumbersome to maintain in stage 0.
### Reduce stage 1 complexity to allow simpler creation of hierarchy templates
Stage 1 is currently too complex to allow easy cloning into different resource hierarchy templates, which are needed to account for all landing zone designs.
Removing complexity from stage 1 by moving organization policy and its related IAM to stage 0 will be an initial step towards stage 1 simplification.
## Proposal
The proposal is to
- move management of organization policies to stage 0
- move management of the `org_policies` tag key and associated values to stage 0
- remove delegated/conditional grants for the Organization Policy Administrator role from stage 0 and 1
The approach fattens stage 0 and lessens its decoupling role in the overall FAST design, but looks preferable compared to the complexity of splitting organization policy management between stage 0 and 1, or worse delegating control of specific policies to external commands run before stage 0.
## Decision
Decision is to implement this.
## Consequences
Organization policies and related tags will need to be moved from stage 1 to stage 0 state. One approach is to
- switch both states to local state
- use `terraform state mv -state-out` to temporarily move resources from stage 1 to stage 0
- push stage 0 and stage 1 state

View File

@@ -1,41 +0,0 @@
# IP ranges for network stages
**authors:** [Ludo](https://github.com/ludoo), [Roberto](https://github.com/drebes), [Julio](https://github.com/jccb) \
**date:** Sept 20, 2023
## Status
Implemented
## Context
Adding or changing subnets to networking stages is a mistake-prone process because there is no clear IP plan. The problem was made worse when we began supporting GKE, which requires secondary ranges and a large number of IP addresses for pods and services.
This was not an issue when there were only a few networking stages, but as FAST expands, it becomes more difficult to keep track of IP ranges for different regions and environments.
## Decision
We adopted an IP plan based on regions and environments with the following key points:
- Large ranges for the 3 environments we have out of the box (landing, dev, prod)
- Support for 2 regions
- Leave enough space to easily grow either the number of environments or regions
- Allocate large blocks from the CG-NAT range to use as secondary ranges, primarily for GKE pods and services.
The following table summarizes the agreed IP plan:
| | aggregate | landing | dev | prod |
|----------------------------|--------------:|-------------------------------------------------------------------:|--------------:|--------------:|
| Region 1, primary ranges | 10.64.0.0/12 | 10.64.0.0/16<br>Trusted: 10.64.0.0/17<br>Untrusted: 10.64.128.0/17 | 10.68.0.0/16 | 10.72.0.0/16 |
| Region 2, primary ranges | 10.80.0.0/12 | 10.80.0.0/16<br>Trusted: 10.80.0.0/17<br>Untrusted: 10.80.128.0/17 | 10.84.0.0/16 | 10.88.0.0/16 |
| Region 1, secondary ranges | 100.64.0.0/12 | 100.64.0.0/14 | 100.68.0.0/14 | 100.72.0.0/14 |
| Region 2, secondary ranges | 100.80.0.0/12 | 100.80.0.0/14 | 100.84.0.0/16 | 100.88.0.0/14 |
To allocate additional secondary ranges for GKE clusters:
- For the pods range, use the next available /16 in the secondary range of its region/environment pair.
- For the service range, use the next available /24 in the last /16 of its region/environment pair.
## Consequences
Default subnets for networking stages were updated to reflect the new ranges.

View File

@@ -1,3 +0,0 @@
# FAST architectural documents
This folder contains assorted bits of documentation used to log current architectural choices, or past decisions. Format is inspired by [Michael Nygard's decision record template](https://github.com/joelparkerhenderson/architecture-decision-record/blob/main/templates/decision-record-template-by-michael-nygard/index.md).