Why is HPA request based?

Apoorva Jagtap
2 min readApr 23, 2021


Those who have been playing around with Kubernetes or any of its flavors must have tried to configure HPA (Horizontal Pod Autoscaler) at least once in order to steer clear the concern of keeping a continuous watch for overconsumption of resources whenever a high load is served by the application.

By this time, I believe we all would know, how to configure HPA for any specific application. Working with multiple customers made me realize that there are still some confusions around how HPA calculates the resource consumption and decides to scale up/down.

Honestly, the Kubernetes documentation has covered all of these relevant details in a very precise manner, but, one might need to roll around multiple documents to catch up on all the aspects of HPA. So, to avoid the hustle of looking around relevant docs, I would try to put all I’ve learned about HPA in one place.

Request Based:

  • As per the current configurations, the targetCPUUtilization and targetMemoryUtilization specified in HPA, is basically based on requests.cpu or requests.memory , respectively. A detailed example could be checked here.
  • There have been several discussions going on upstream, with respect to configuring the HPA’s algorithm to be based on limits instead of the requests.
  • However, one such issue is still open as we have a conflict on what would be more efficient (limits or requests).

TargetPercentage within the range of 100?

  • When we talk about setting a percentage of utilization as per standard terms, we are most likely to think that one should specify the target less than or equal to 100%.
  • However, this is *not* necessary in case of HPA. I myself was a bit confused with the “why” and “how” of this part, but stumbling around multiple discussions and judgments helped me understand this.
  • As currently, HPA uses resources.requests as its base to calculate and compare the resource utilization, setting a target above 100% should not cause any problem as long as the threshold(tragetUtilization) is less than or equal to resources.limits .
  • For example, deploy an application with resources.requests.cpu=200m and resources.limits.cpu="4"for each container. For this application, configure a HPA with targetCPUUtilization=300% . Now, each time the average consumption of all application pods reaches 300% of 200m (requests.cpu) i.e. 600m, the new pods would scale up.

How many pods would scale up?

Kubernetes follows a simple algorithm to calculate the number of pods to scaled up/down as specified here.
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

What’s the waiting period?

While designing HPA the idea of preventing a constant fluctuation in the number of pods due to the traffic/load has been a crucial part, and for this, there is a stabilization window of 5 minutes by default.
HPA monitors the load on the application, and if there is the currentUtilization is less than the target for 5 minutes, it starts scaling down the pods, however, if the load increases above the target again, within the stabilization window, the pod’s would wait for another 5 minutes before scaling down.

  • With the enhancements, the waiting period could be modified as per the requirement. Curious to learn about those with pseudocode? Check out here.