Membuat Google Kubernetes Engine (GKE) Cluster yang Dioptimalkan untuk AI/ML | by Xb4sh

Perkembangan AI dan Machine Studying (ML) mendorong kebutuhan akan infrastruktur yang scalable, fleksibel, dan hemat biaya. Google Kubernetes Engine (GKE) adalah salah satu solusi paling highly effective untuk menjalankan workload AI, mulai dari serving mannequin, inference, hingga coaching skala besar.

Pada artikel ini, saya akan membagikan step-by-step membangun GKE Cluster AI-Optimized lengkap, termasuk setup node GPU, auto-scaling, safety, dan ideas penghematan biaya. Artikel ini cocok untuk praktisi yang ingin production-ready, bukan sekadar PoC.

Otomatisasi: Deployment, scaling, dan rolling replace mudah.
Dukungan GPU/TPU: Native assist NVIDIA, hemat waktu setup.
Value Effectivity: Autoscaler dan preemptible node, hemat biaya coaching/inference.
Managed Safety: Workload Id, personal node, dan community insurance policies.
Integrasi Google Cloud: Stackdriver, BigQuery, GCS, IAM, dsb.

Regional vs Zonal: Pilih regional untuk SLA tinggi.
Node Pool Terpisah: Pisahkan workload GPU dan CPU.
Autoscaling: Aktifkan untuk efisiensi useful resource.
Workload Id: Aman, tanpa service account key.
Personal Node: Hindari public publicity.
VPC-Native: Community lebih aman dan scalable.

a. Buat Cluster Utama

export PROJECT_ID=your-project-id
export REGION=asia-southeast2
export CLUSTER_NAME=ai-optimized-gkegcloud container clusters create $CLUSTER_NAME 
--region $REGION 
--release-channel common 
--enable-ip-alias 
--enable-private-nodes 
--enable-autoscaling 
--enable-autoprovisioning 
--min-cpu 4 
--max-cpu 64 
--min-memory 16 
--max-memory 512 
--enable-shielded-nodes 
--workload-pool=$PROJECT_ID.svc.id.goog 
--machine-type "e2-standard-8" 
--num-nodes "1" 
--enable-stackdriver-kubernetes 
--addons=HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver 
--no-enable-basic-auth 
--no-issue-client-certificate 
--enable-master-authorized-networks 
--master-authorized-networks 1.2.3.4/32 # (Ganti dengan IP kantor)

b. Tambah Node Pool GPU (misal NVIDIA T4/A100/L4)

gcloud container node-pools create gpu-pool-t4 
--cluster $CLUSTER_NAME 
--region $REGION 
--accelerator sort=nvidia-tesla-t4,depend=1 
--machine-type n1-standard-8 
--num-nodes 0 
--min-nodes 0 
--max-nodes 10 
--enable-autoscaling 
--node-labels workload=ai,gpu=t4 
--node-taints ai-gpu=true:NoSchedule 
--scopes=cloud-platform

c. Tambah Node Pool Excessive-Reminiscence (Opsional)

gcloud container node-pools create high-mem-pool 
--cluster $CLUSTER_NAME 
--region $REGION 
--machine-type n2-highmem-32 
--num-nodes 0 
--min-nodes 0 
--max-nodes 10 
--enable-autoscaling 
--node-labels workload=ai,mem=excessive 
--node-taints ai-mem=true:NoSchedule

Jalankan:

kubectl apply -f https://uncooked.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/steady/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Ideas: Untuk node Ubuntu/GKE Autopilot, cek dokumentasi resmi GKE GPU.

Workload Id: Default di script, jauh lebih aman dari service account key.
Personal Node: Semua node tanpa public IP.
Grasp Approved Community: Management aircraft hanya bisa diakses IP tertentu.
Shielded Node: Cegah boot malware.
Community Coverage: Isolasi visitors pod, implement zero belief.
Audit IAM: Minimal privilege, pakai GCP service account scoped.

Contoh YAML easy untuk inference pakai PyTorch di GPU node:

apiVersion: apps/v1
form: Deployment
metadata:
identify: pytorch-inference
spec:
replicas: 1
selector:
matchLabels:
app: pytorch
template:
metadata:
labels:
app: pytorch
spec:
containers:
- identify: pytorch
picture: pytorch/torchserve:newest
assets:
limits:
nvidia.com/gpu: 1
nodeSelector:
gpu: t4
tolerations:
- key: "ai-gpu"
operator: "Equal"
worth: "true"
impact: "NoSchedule"

Autoscaler: Aktifkan pada semua node pool.
Preemptible GPU: Untuk coaching yang bisa di-interrupt.
Scale-to-zero: Node pool bisa di-0-kan, cluster auto idle.
Pantau Billing: Setup alert di GCP Billing.
Overview Useful resource: Hapus node pool tidak terpakai.

Stackdriver (Ops Agent): Default aktif, cek log dan metric di GCP.
Prometheus + Grafana: Untuk customized metric AI.
Node Drawback Detector: Cek kesehatan {hardware} node.

Dengan setup ini, Anda bisa menjalankan workload AI/ML (mannequin LLM, NLP, Laptop Imaginative and prescient, dst) di atas GKE secara safe, scalable, dan cost-efficient.
Jika butuh template YAML, Helm chart, atau automation by way of Terraform — point out di kolom komentar!

Tertarik?
Bookmark, share, dan observe untuk replace seputar DevOps, Kubernetes, dan AI Engineering.

Bonus: Terraform Instance (Partial)

useful resource "google_container_node_pool" "gpu_pool" {
identify       = "gpu-pool-t4"
cluster    = google_container_cluster.ai_optimized.identify
location   = google_container_cluster.ai_optimized.location
node_count = 1node_config {
machine_type = "n1-standard-8"
guest_accelerator {
sort  = "nvidia-tesla-t4"
depend = 1
}
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
labels = {
workload = "ai"
gpu      = "t4"
}
taint {
key    = "ai-gpu"
worth  = "true"
impact = "NO_SCHEDULE"
}
}
autoscaling {
min_node_count = 0
max_node_count = 10
}
}

Referensi:

Source link

How Deep Learning Is Reshaping Hedge Funds

10 Common SQL Patterns That Show Up in FAANG Interviews | by Rohan Dutt | Aug, 2025

Anaconda : l’outil indispensable pour apprendre la data science sereinement | by Wisdom Koudama | Aug, 2025

How Deep Learning Is Reshaping Hedge Funds

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

How to Stop Overthinking and Start Taking Action

Apple to pay $95m to settle Siri listening case

Free Webinar | April 16: How to Cultivate, Grow and Monetize Your Social Audience

Our Picks

How Deep Learning Is Reshaping Hedge Funds

Boost Team Productivity and Security With Windows 11 Pro, Now $15 for Life

10 Common SQL Patterns That Show Up in FAANG Interviews | by Rohan Dutt | Aug, 2025

Membuat Google Kubernetes Engine (GKE) Cluster yang Dioptimalkan untuk AI/ML | by Xb4sh | Jul, 2025

Related Posts