* var definitions * skeleton, untested * fix errors, test with existing cluster * test vpc creation, todo notes * initial variables for AR and image * initial variables for AR and image * Add support for remote repositories to artifact-registry * Add support for virtual repositories to artifact-registry * Add support for extra config options to artifact-registry * artifact registry module: add validation and precondition, fix tests * ar module id/name * registry * service accoutn and roles * fetch pods, remove image prefix * small changes * use additive IAM at project level * use additive IAM at project level * configmaps * manifests * fix statefulset manifest * service manifest * fix configmap mode * add todo * job (broken) * job * wait on manifest, endpoints datasource * fix job * Fix local * sa * Update README.md * Restructure gke bp * refactor tree and infra variables * no create test * simplify cluster SA * test cluster and vpc creation * project creation fixes * use iam_members variable * nits * readme with examples * readme with examples * outputs * variables, provider configuration * variables, manifests * start cluster job * fix redis cluster creation Co-authored-by: Julio Castillo <juliocc@users.noreply.github.com> * Revert changes in autopilot cluster * Default templates path, use namespace for node names * Update readmes * Fix IAM bindings * Make STABLE the default release channel * Use Cloud DNS as default DNS provider * Allow optional Cloud NAT creation * Allow backup agent and proxy only subnet * Work around terraform not short-circuiting logical operators * Rename create variables to be more consistent with other blueprints * Add basic features * Update variable names * Initial kafka JS * Move providers to a new file * Kafka / Strimzi * First possibily working version for MySQL (with a lot of todo's left) * Explicitly use proxy repo + some other fixes * Strimzi draft * Refactor variables, use CluterIP as pointer for mysql-router for bootstraping * Validate number of replicas, autoscale required number of running nodes to n/2+1 * Use seaprate service for bootstrap, do not recreate all resources on change of replicas count as the config is preserved in PV * Test dual chart kafka * Update chart for kafka * Expose basic kafka configuration options * Remove unused manifest * Added batch blueprint * Added README * switch to kubectl_manifest * Add README and support for static IP address * Move namespace creation to helm * Interpolate kafka variables * Rename kafka-strimzi to kafka * Added TUTORIAL for cloudshell for batch blueprint * deleted tutorial * Remove commented replace trigger * Move to helm chart * WIP of Cloud Shell tutorial for MySQL * Rename folders * Fix rename * Update paths * Unify styles * Update paths * Add Readme links * Update mysql tutorial * Fix path according to self-link * Use relative path to cwd * Fix service_account variable location * Fix tfvars creation * Restore some fixes for helm deployment * Add cluster deletion_prevention * Fixes for tutorial * Update cluster docs * Fixes to batch tutorial * Bare bones readme for batch * Update batch readme * README fixes * Fix README title for redis * Fix Typos * Make it easy to pass variables from autopilot-cluster to other modules * Add connectivity test and bastion host * updates to readme, and gpu fix * Add versions.tf and README updates * Fix typo * Kafka and Redis README updates * Update versions.tf * Fixes * Add boilerplate * Fix linting * Move mysql to separate branch * Update cloud shell links * Fix broken link --------- Co-authored-by: Ludo <ludomagno@google.com> Co-authored-by: Daniel Marzini <44803752+danielmarzini@users.noreply.github.com> Co-authored-by: Wiktor Niesiobędzki <wiktorn@google.com> Co-authored-by: Miren Esnaola <mirene@google.com>
6.2 KiB
Deploy a batch system using Kueue
This tutorial shows you how to deploy a batch system using Kueue to perform Job queueing on Google Kubernetes Engine (GKE) using Terraform.
Jobs are applications that run to completion, such as machine learning, rendering, simulation, analytics, CI/CD, and similar workloads.
Kueue is a Cloud Native Job scheduler that works with the default Kubernetes scheduler, the Job controller, and the cluster autoscaler to provide an end-to-end batch system. Kueue implements Job queueing, deciding when Jobs should wait and when they should start, based on quotas and a hierarchy for sharing resources fairly among teams.
Kueue has the following characteristics:
- It is optimized for cloud architectures, where resources are heterogeneous, interchangeable, and scalable.
- It provides a set of APIs to manage elastic quotas and manage Job queueing.
- It does not re-implement existing functionality such as autoscaling, pod scheduling, or Job lifecycle management.
- Kueue has built-in support for the Kubernetesbatch/v1.Job API.
- It can integrate with other job APIs.
- Kueue refers to jobs defined with any API as Workloads, to avoid the confusion with the specific Kubernetes Job API.
When working with Kueue there are a few concepts that ome needs to be familiar with:
-
ResourceFlavour
An object that you can define to describe what resources are available in a cluster. Typically, it is associated with the characteristics of a group of Nodes: availability, pricing, architecture, models, etc.
-
ClusterQueue
A cluster-scoped resource that governs a pool of resources, defining usage limits and fair sharing rules.
-
LocalQueue
A namespaced resource that groups closely related workloads belonging to a single tenant.
-
Workload
An application that will run to completion. It is the unit of admission in Kueue. Sometimes referred to as job
Kueue refers to jobs defined with any API as Workloads, to avoid the confusion with the specific Kubernetes Job API.
Objectives
This tutorial is for cluster operators and other users that want to implement a batch system on Kubernetes. In this tutorial, you set up a shared cluster for two tenant teams. Each team has their own namespace where they create Jobs and share the same global resources that are controlled with the corresponding quotas.
In this tutorial we will be doing the following using Terraform code available in a git repository:
- Create a GKE cluster.
- Create a namespace for Kueue (kueue-system).
- Create a namespace for each team running batch jobs in the cluster (team-a, team-b).
- Install Kueue in the namespace created for it.
- Create the ResourceFlavor.
- Create the ClusterQueue.
- Create a LocalQueue for each of the teams in the corresponding namespace.
- Create for each of teams a manifest for a sample job associated with the corresponding LocalQueue.
Estimated time:
To get started, click Start.
select/create a project
Create the Autopilot GKE cluster
-
Change to the
autopilot-clusterdirectory.cd autopilot-cluster -
Create a new file
terraform.tfvarsin that directory.touch terraform.tfvars -
Open the file for editing.
-
Paste the following content in the file and update any value as needed.
project_id = "<walkthrough-project-name/>"
cluster_name = "cluster"
cluster_create = {
deletion_protection = false
}
region = "europe-west1"
vpc_create = {
enable_cloud_nat = true
}
-
Initialize the terraform configuration.
terraform init -
Apply the terraform configuration.
terraform apply -
Fetch the cluster credentials.
gcloud container fleet memberships get-credentials cluster --project "<walkthrough-project-name/>" -
Check the nodes are ready.
kubectl get pods -n kube-system
Install Kueue and create associated resources
-
Change to the
patterns/batchdirectory.cd ../batch -
Create a new file
terraform.tfvarsin that directory.touch terraform.tfvars -
Open the file for editing.
-
Paste the following content in the file.
credentials_config = { kubeconfig = { path = "~/.kube/config" } } -
Initialize the terraform configuration.
terraform init -
Apply the terraform configuration.
terraform apply -
Check that the Kueue pods are ready (Use CTRL+C to exit watching)
kubectl get pods -n kueue-system -w -
Check the status of the ClusterQueue
kubectl get clusterqueue cluster-queue -o wide -w -
Check the status of the LocalQueue for the teams
kubectl get localqueue -n team-a local-queue -o wide -wkubectl get localqueue -n team-b local-queue -o wide -w
Run jobs in the cluster
-
Create Jobs for namespace team-a and team-b every 10 seconds associated with the corresponding LocalQueue:
./create_jobs.sh job-team-a.yaml job-team-b.yaml 10Hit Ctrl-C when you want to stop the creation of jobs
-
Observe the workloads being queued up, admitted in the ClusterQueue, and nodes being brought up with GKE Autopilot.
kubectl -n team-a get workloads -
Copy a Job name from the previous step and observe the admission status and events for a Job through the W Workloads API:
kubectl -n team-a describe workload JOB_NAME
Destroy resources (optional)
-
Change to the
patterns/autopilot-clusterdirectory.cd ../autopilot-cluster -
Destroy the cluster with the following command.
terraform destroy
Congratulations
You’re all set!