Close Menu
    Trending
    • How Deep Learning Is Reshaping Hedge Funds
    • Boost Team Productivity and Security With Windows 11 Pro, Now $15 for Life
    • 10 Common SQL Patterns That Show Up in FAANG Interviews | by Rohan Dutt | Aug, 2025
    • This Mac and Microsoft Bundle Pays for Itself in Productivity
    • Candy AI NSFW AI Video Generator: My Unfiltered Thoughts
    • Anaconda : l’outil indispensable pour apprendre la data science sereinement | by Wisdom Koudama | Aug, 2025
    • Automating Visual Content: How to Make Image Creation Effortless with APIs
    • A Founder’s Guide to Building a Real AI Strategy
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Membuat Google Kubernetes Engine (GKE) Cluster yang Dioptimalkan untuk AI/ML | by Xb4sh | Jul, 2025
    Machine Learning

    Membuat Google Kubernetes Engine (GKE) Cluster yang Dioptimalkan untuk AI/ML | by Xb4sh | Jul, 2025

    Team_AIBS NewsBy Team_AIBS NewsJuly 15, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Perkembangan AI dan Machine Studying (ML) mendorong kebutuhan akan infrastruktur yang scalable, fleksibel, dan hemat biaya. Google Kubernetes Engine (GKE) adalah salah satu solusi paling highly effective untuk menjalankan workload AI, mulai dari serving mannequin, inference, hingga coaching skala besar.

    Pada artikel ini, saya akan membagikan step-by-step membangun GKE Cluster AI-Optimized lengkap, termasuk setup node GPU, auto-scaling, safety, dan ideas penghematan biaya. Artikel ini cocok untuk praktisi yang ingin production-ready, bukan sekadar PoC.

    • Otomatisasi: Deployment, scaling, dan rolling replace mudah.
    • Dukungan GPU/TPU: Native assist NVIDIA, hemat waktu setup.
    • Value Effectivity: Autoscaler dan preemptible node, hemat biaya coaching/inference.
    • Managed Safety: Workload Id, personal node, dan community insurance policies.
    • Integrasi Google Cloud: Stackdriver, BigQuery, GCS, IAM, dsb.
    • Regional vs Zonal: Pilih regional untuk SLA tinggi.
    • Node Pool Terpisah: Pisahkan workload GPU dan CPU.
    • Autoscaling: Aktifkan untuk efisiensi useful resource.
    • Workload Id: Aman, tanpa service account key.
    • Personal Node: Hindari public publicity.
    • VPC-Native: Community lebih aman dan scalable.

    a. Buat Cluster Utama

    export PROJECT_ID=your-project-id
    export REGION=asia-southeast2
    export CLUSTER_NAME=ai-optimized-gke

    gcloud container clusters create $CLUSTER_NAME
    --region $REGION
    --release-channel common
    --enable-ip-alias
    --enable-private-nodes
    --enable-autoscaling
    --enable-autoprovisioning
    --min-cpu 4
    --max-cpu 64
    --min-memory 16
    --max-memory 512
    --enable-shielded-nodes
    --workload-pool=$PROJECT_ID.svc.id.goog
    --machine-type "e2-standard-8"
    --num-nodes "1"
    --enable-stackdriver-kubernetes
    --addons=HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver
    --no-enable-basic-auth
    --no-issue-client-certificate
    --enable-master-authorized-networks
    --master-authorized-networks 1.2.3.4/32 # (Ganti dengan IP kantor)

    b. Tambah Node Pool GPU (misal NVIDIA T4/A100/L4)

    gcloud container node-pools create gpu-pool-t4 
    --cluster $CLUSTER_NAME
    --region $REGION
    --accelerator sort=nvidia-tesla-t4,depend=1
    --machine-type n1-standard-8
    --num-nodes 0
    --min-nodes 0
    --max-nodes 10
    --enable-autoscaling
    --node-labels workload=ai,gpu=t4
    --node-taints ai-gpu=true:NoSchedule
    --scopes=cloud-platform

    c. Tambah Node Pool Excessive-Reminiscence (Opsional)

    gcloud container node-pools create high-mem-pool 
    --cluster $CLUSTER_NAME
    --region $REGION
    --machine-type n2-highmem-32
    --num-nodes 0
    --min-nodes 0
    --max-nodes 10
    --enable-autoscaling
    --node-labels workload=ai,mem=excessive
    --node-taints ai-mem=true:NoSchedule

    Jalankan:

    kubectl apply -f https://uncooked.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/steady/nvidia-driver-installer/cos/daemonset-preloaded.yaml

    Ideas: Untuk node Ubuntu/GKE Autopilot, cek dokumentasi resmi GKE GPU.

    • Workload Id: Default di script, jauh lebih aman dari service account key.
    • Personal Node: Semua node tanpa public IP.
    • Grasp Approved Community: Management aircraft hanya bisa diakses IP tertentu.
    • Shielded Node: Cegah boot malware.
    • Community Coverage: Isolasi visitors pod, implement zero belief.
    • Audit IAM: Minimal privilege, pakai GCP service account scoped.

    Contoh YAML easy untuk inference pakai PyTorch di GPU node:

    apiVersion: apps/v1
    form: Deployment
    metadata:
    identify: pytorch-inference
    spec:
    replicas: 1
    selector:
    matchLabels:
    app: pytorch
    template:
    metadata:
    labels:
    app: pytorch
    spec:
    containers:
    - identify: pytorch
    picture: pytorch/torchserve:newest
    assets:
    limits:
    nvidia.com/gpu: 1
    nodeSelector:
    gpu: t4
    tolerations:
    - key: "ai-gpu"
    operator: "Equal"
    worth: "true"
    impact: "NoSchedule"
    • Autoscaler: Aktifkan pada semua node pool.
    • Preemptible GPU: Untuk coaching yang bisa di-interrupt.
    • Scale-to-zero: Node pool bisa di-0-kan, cluster auto idle.
    • Pantau Billing: Setup alert di GCP Billing.
    • Overview Useful resource: Hapus node pool tidak terpakai.
    • Stackdriver (Ops Agent): Default aktif, cek log dan metric di GCP.
    • Prometheus + Grafana: Untuk customized metric AI.
    • Node Drawback Detector: Cek kesehatan {hardware} node.

    Dengan setup ini, Anda bisa menjalankan workload AI/ML (mannequin LLM, NLP, Laptop Imaginative and prescient, dst) di atas GKE secara safe, scalable, dan cost-efficient.
    Jika butuh template YAML, Helm chart, atau automation by way of Terraform — point out di kolom komentar!

    Tertarik?
    Bookmark, share, dan observe untuk replace seputar DevOps, Kubernetes, dan AI Engineering.

    Bonus: Terraform Instance (Partial)

    useful resource "google_container_node_pool" "gpu_pool" {
    identify = "gpu-pool-t4"
    cluster = google_container_cluster.ai_optimized.identify
    location = google_container_cluster.ai_optimized.location
    node_count = 1

    node_config {
    machine_type = "n1-standard-8"
    guest_accelerator {
    sort = "nvidia-tesla-t4"
    depend = 1
    }
    oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
    labels = {
    workload = "ai"
    gpu = "t4"
    }
    taint {
    key = "ai-gpu"
    worth = "true"
    impact = "NO_SCHEDULE"
    }
    }

    autoscaling {
    min_node_count = 0
    max_node_count = 10
    }
    }

    Referensi:



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleHow Fuzzy Matching and Machine Learning Are Transforming AML Technology
    Next Article How to Ensure Reliability in LLM Applications
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    How Deep Learning Is Reshaping Hedge Funds

    August 2, 2025
    Machine Learning

    10 Common SQL Patterns That Show Up in FAANG Interviews | by Rohan Dutt | Aug, 2025

    August 2, 2025
    Machine Learning

    Anaconda : l’outil indispensable pour apprendre la data science sereinement | by Wisdom Koudama | Aug, 2025

    August 2, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    How Deep Learning Is Reshaping Hedge Funds

    August 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    How to Stop Overthinking and Start Taking Action

    June 21, 2025

    Apple to pay $95m to settle Siri listening case

    January 3, 2025

    Free Webinar | April 16: How to Cultivate, Grow and Monetize Your Social Audience

    April 3, 2025
    Our Picks

    How Deep Learning Is Reshaping Hedge Funds

    August 2, 2025

    Boost Team Productivity and Security With Windows 11 Pro, Now $15 for Life

    August 2, 2025

    10 Common SQL Patterns That Show Up in FAANG Interviews | by Rohan Dutt | Aug, 2025

    August 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.