Perkembangan AI dan Machine Studying (ML) mendorong kebutuhan akan infrastruktur yang scalable, fleksibel, dan hemat biaya. Google Kubernetes Engine (GKE) adalah salah satu solusi paling highly effective untuk menjalankan workload AI, mulai dari serving mannequin, inference, hingga coaching skala besar.
Pada artikel ini, saya akan membagikan step-by-step membangun GKE Cluster AI-Optimized lengkap, termasuk setup node GPU, auto-scaling, safety, dan ideas penghematan biaya. Artikel ini cocok untuk praktisi yang ingin production-ready, bukan sekadar PoC.
- Otomatisasi: Deployment, scaling, dan rolling replace mudah.
- Dukungan GPU/TPU: Native assist NVIDIA, hemat waktu setup.
- Value Effectivity: Autoscaler dan preemptible node, hemat biaya coaching/inference.
- Managed Safety: Workload Id, personal node, dan community insurance policies.
- Integrasi Google Cloud: Stackdriver, BigQuery, GCS, IAM, dsb.
- Regional vs Zonal: Pilih regional untuk SLA tinggi.
- Node Pool Terpisah: Pisahkan workload GPU dan CPU.
- Autoscaling: Aktifkan untuk efisiensi useful resource.
- Workload Id: Aman, tanpa service account key.
- Personal Node: Hindari public publicity.
- VPC-Native: Community lebih aman dan scalable.
a. Buat Cluster Utama
export PROJECT_ID=your-project-id
export REGION=asia-southeast2
export CLUSTER_NAME=ai-optimized-gkegcloud container clusters create $CLUSTER_NAME
--region $REGION
--release-channel common
--enable-ip-alias
--enable-private-nodes
--enable-autoscaling
--enable-autoprovisioning
--min-cpu 4
--max-cpu 64
--min-memory 16
--max-memory 512
--enable-shielded-nodes
--workload-pool=$PROJECT_ID.svc.id.goog
--machine-type "e2-standard-8"
--num-nodes "1"
--enable-stackdriver-kubernetes
--addons=HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver
--no-enable-basic-auth
--no-issue-client-certificate
--enable-master-authorized-networks
--master-authorized-networks 1.2.3.4/32 # (Ganti dengan IP kantor)
b. Tambah Node Pool GPU (misal NVIDIA T4/A100/L4)
gcloud container node-pools create gpu-pool-t4
--cluster $CLUSTER_NAME
--region $REGION
--accelerator sort=nvidia-tesla-t4,depend=1
--machine-type n1-standard-8
--num-nodes 0
--min-nodes 0
--max-nodes 10
--enable-autoscaling
--node-labels workload=ai,gpu=t4
--node-taints ai-gpu=true:NoSchedule
--scopes=cloud-platform
c. Tambah Node Pool Excessive-Reminiscence (Opsional)
gcloud container node-pools create high-mem-pool
--cluster $CLUSTER_NAME
--region $REGION
--machine-type n2-highmem-32
--num-nodes 0
--min-nodes 0
--max-nodes 10
--enable-autoscaling
--node-labels workload=ai,mem=excessive
--node-taints ai-mem=true:NoSchedule
Jalankan:
kubectl apply -f https://uncooked.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/steady/nvidia-driver-installer/cos/daemonset-preloaded.yaml
Ideas: Untuk node Ubuntu/GKE Autopilot, cek dokumentasi resmi GKE GPU.
- Workload Id: Default di script, jauh lebih aman dari service account key.
- Personal Node: Semua node tanpa public IP.
- Grasp Approved Community: Management aircraft hanya bisa diakses IP tertentu.
- Shielded Node: Cegah boot malware.
- Community Coverage: Isolasi visitors pod, implement zero belief.
- Audit IAM: Minimal privilege, pakai GCP service account scoped.
Contoh YAML easy untuk inference pakai PyTorch di GPU node:
apiVersion: apps/v1
form: Deployment
metadata:
identify: pytorch-inference
spec:
replicas: 1
selector:
matchLabels:
app: pytorch
template:
metadata:
labels:
app: pytorch
spec:
containers:
- identify: pytorch
picture: pytorch/torchserve:newest
assets:
limits:
nvidia.com/gpu: 1
nodeSelector:
gpu: t4
tolerations:
- key: "ai-gpu"
operator: "Equal"
worth: "true"
impact: "NoSchedule"
- Autoscaler: Aktifkan pada semua node pool.
- Preemptible GPU: Untuk coaching yang bisa di-interrupt.
- Scale-to-zero: Node pool bisa di-0-kan, cluster auto idle.
- Pantau Billing: Setup alert di GCP Billing.
- Overview Useful resource: Hapus node pool tidak terpakai.
- Stackdriver (Ops Agent): Default aktif, cek log dan metric di GCP.
- Prometheus + Grafana: Untuk customized metric AI.
- Node Drawback Detector: Cek kesehatan {hardware} node.
Dengan setup ini, Anda bisa menjalankan workload AI/ML (mannequin LLM, NLP, Laptop Imaginative and prescient, dst) di atas GKE secara safe, scalable, dan cost-efficient.
Jika butuh template YAML, Helm chart, atau automation by way of Terraform — point out di kolom komentar!
Tertarik?
Bookmark, share, dan observe untuk replace seputar DevOps, Kubernetes, dan AI Engineering.
Bonus: Terraform Instance (Partial)
useful resource "google_container_node_pool" "gpu_pool" {
identify = "gpu-pool-t4"
cluster = google_container_cluster.ai_optimized.identify
location = google_container_cluster.ai_optimized.location
node_count = 1node_config {
machine_type = "n1-standard-8"
guest_accelerator {
sort = "nvidia-tesla-t4"
depend = 1
}
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
labels = {
workload = "ai"
gpu = "t4"
}
taint {
key = "ai-gpu"
worth = "true"
impact = "NO_SCHEDULE"
}
}
autoscaling {
min_node_count = 0
max_node_count = 10
}
}
Referensi: