Implementing Horizontal Pod Autoscaling in Kubernetes

March 22, 2025

Kubernetes

Autoscaling

DevOps

Performance

At Vectra AI, I designed and implemented Kubernetes Ingress rules and horizontal pod autoscaling policies within AWS EKS. In this article, I'll share how to set up effective autoscaling for your applications.

What is Horizontal Pod Autoscaling?

Horizontal Pod Autoscaling (HPA) automatically increases or decreases the number of pod replicas based on observed metrics like CPU utilization or memory usage.

Prerequisites

A running Kubernetes cluster
Metrics Server installed
An application deployment to scale

Setting Up Metrics Server

The Metrics Server collects resource metrics from Kubelets and exposes them through the Kubernetes API server. Install it with:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Creating an HPA

Here's a basic HPA manifest that scales a deployment based on CPU usage:


      apiVersion: autoscaling/v2
      kind: HorizontalPodAutoscaler
      metadata:
        name: my-app-hpa
      spec:
        scaleTargetRef:
          apiVersion: apps/v1
          kind: Deployment
          name: my-app
        minReplicas: 2
        maxReplicas: 10
        metrics:
        - type: Resource
          resource:
            name: cpu
            target:
              type: Utilization
              averageUtilization: 50

Advanced Configurations

For more complex scenarios, you can:

Scale based on multiple metrics (CPU and memory)
Use custom metrics from your application
Configure scaling behavior and stabilization windows

Testing Your HPA

Generate load on your application and watch the HPA in action:

kubectl get hpa my-app-hpa --watch

Best Practices

Set appropriate resource requests and limits
Choose scaling thresholds carefully
Consider application startup time when setting scaling policies
Monitor scaling events and adjust as needed

Conclusion

Implementing Horizontal Pod Autoscaling in Kubernetes significantly improves application resilience and scalability while optimizing resource usage. It's an essential tool for running production workloads efficiently.