Managing updates for worker nodes in an Amazon Elastic Kubernetes Service (EKS) cluster is a critical task for maintaining security, stability, and performance. This article delves into various strategies for performing AWS server update rollouts on EKS worker nodes, comparing self-managed worker nodes with managed node groups and exploring different rollout mechanisms. We aim to provide a comprehensive overview to help you choose the best approach for your specific needs.
Self-Managed Worker Nodes: Rolling Update Options
When you manage your own worker nodes in EKS, you have several options for performing rolling updates. Let’s examine the most common ones:
Instance Refresh with Node Termination Handler
Amazon EC2 Auto Scaling’s Instance Refresh feature, combined with the node-termination-handler
in queue mode, is one approach to updating self-managed nodes.
How it works: Instance Refresh automates the process of replacing instances in an Auto Scaling Group (ASG). When initiated, it replaces instances based on a configuration you define, such as a new launch template or instance type. The node-termination-handler
is crucial here; in queue mode, it gracefully handles node terminations by detecting termination signals and cordoning and draining the node before actual termination.
Pros:
- Automation: Instance Refresh automates the replacement of instances, reducing manual effort.
- Configuration Driven: Updates are driven by ASG configurations, ensuring consistency.
Cons:
-
Terminate-Then-Create Behavior: A significant drawback of Instance Refresh is its default behavior of terminating instances before launching new ones. As highlighted in AWS documentation, this can lead to downtime, especially in single-instance ASGs.
During an instance refresh, Amazon EC2 Auto Scaling takes a set of instances out of service, terminates them, and then launches a set of instances with the new configuration
Instances terminated before launch: When there is only one instance in the Auto Scaling group, starting an instance refresh can result in an outage because Amazon EC2 Auto Scaling terminates an instance and then launches a new instance
This inherent behavior necessitates careful consideration of ASG size and application redundancy. For instance refresh to operate without downtime, you likely need a minimum ASG size of two and ensure your applications have sufficient replicas to tolerate node unavailability during the update.
-
Potential for Downtime: The terminate-first approach can cause temporary capacity reduction and potential downtime if not properly planned with sufficient redundancy. The time it takes for a new instance to become ready (e.g., around 120 seconds for Amazon Linux) can be longer than the termination process, exacerbating this issue.
Use Cases: Instance Refresh can be suitable for non-critical environments or when downtime is acceptable. In production environments, it requires careful planning with adequately sized ASGs and application replicas to mitigate downtime risks.
eks-rolling-update
Script
The eks-rolling-update
script, available on GitHub, provides another method for rolling updates of self-managed worker nodes.
How it works: This script orchestrates rolling updates by iterating through nodes, cordoning and draining them, and then allowing new nodes to join the cluster.
Pros:
- Create-Then-Terminate Approach: Unlike Instance Refresh,
eks-rolling-update
typically follows a create-then-terminate approach (depending on configuration and implementation details), which is generally safer for rolling updates as it ensures new capacity is available before removing old nodes.
Cons:
- Script Maintenance and Reliability: As a community-maintained script, its reliability and long-term maintenance can be a concern for production environments. Past experiences suggest potential issues that might require manual intervention, making fully unattended operation risky in some cases.
- Potential for Manual Intervention: While designed for automation, unforeseen issues during script execution could leave the cluster in an inconsistent state, demanding manual correction.
Use Cases: eks-rolling-update
may be suitable for teams comfortable with community-supported tools and willing to invest time in testing and monitoring its behavior in their specific environment. However, for mission-critical production systems, the potential for instability and manual intervention should be carefully weighed.
Manual Rolling Updates (Generally Not Recommended)
While technically possible, manually performing rolling updates on worker nodes is generally discouraged due to its error-prone and time-consuming nature. It involves manually cordoning, draining, and terminating instances, then launching and configuring new ones. This method lacks automation and scalability, increasing the risk of human error and inconsistencies.
Managed Node Groups: AWS-Managed Rollouts
AWS Managed Node Groups offer a streamlined approach to managing worker nodes and their updates.
Advantages of Managed Node Groups:
- Simplified Management: AWS handles much of the underlying infrastructure management, reducing operational overhead.
- Integrated Updates: Managed node groups provide built-in rolling update capabilities, automatically triggered by events like AMI changes or template updates.
- Feature Richness: While initially having limitations, managed node groups have evolved significantly and now offer most features required for production deployments, with the primary remaining limitation being the inability to scale to zero nodes.
Managed Node Group Update Behavior and Inefficiencies
AWS-managed rolling updates are automatically initiated when changes are detected. However, the update mechanism can exhibit inefficiencies, particularly in single-node ASGs.
The Issue: As documented and graphically illustrated in the AWS containers roadmap issue, the managed node update process can temporarily over-provision instances. For single-node ASGs, this can result in a significant temporary increase in the number of instances before scaling back down.
Increments the Auto Scaling group maximum size and desired size by one up to twice the number of Availability Zones in the Region that the Auto Scaling group is deployed in
In practice, for a single-node ASG, this behavior can lead to the ASG temporarily scaling up to as many as 7 instances before returning to the desired state of 1.
Consequences:
- Increased Update Time: The over-provisioning and subsequent scale-down process significantly lengthens the update time. For a single-node ASG, updates can take upwards of 30 minutes.
- Cost Inefficiency: During the update, you are billed for the অতিরিক্ত instances, which can be substantial, especially with expensive instance types like GPU instances.
- Operational Overhead: The extended update times can delay deployments and increase the overall operational burden.
Acceptability: For multi-node ASGs (e.g., with a minimum of 3 nodes), this temporary over-provisioning might be acceptable in many production scenarios. However, for use cases involving single-node ASGs, especially in large clusters or with cost-sensitive applications, the inefficiency becomes a significant concern.
Optimization: Multiple ASGs per Availability Zone
One potential optimization to mitigate the impact of over-provisioning during managed node group updates is to create multiple ASGs, one for each Availability Zone (AZ), for the same worker role.
Benefits:
- Reduced Instance Spikes: By distributing nodes across multiple ASGs, the temporary instance increase during updates is spread out, reducing the peak number of extra instances.
- Faster Updates: Updates can potentially be faster as each AZ’s ASG is updated more independently, reducing the serial nature of updates in a single large ASG.
Considerations:
- Increased Complexity: Managing multiple ASGs adds some complexity to infrastructure management.
- AZ Balancing: Careful planning is needed to ensure proper balancing of nodes across AZs and to maintain high availability.
Conclusion: Choosing the Right AWS Server Update Rollout Strategy
Selecting the optimal AWS server update rollout strategy for your EKS worker nodes depends on your specific requirements, risk tolerance, and operational capabilities.
-
Managed Node Groups: For most production environments, managed node groups offer a compelling balance of simplicity and functionality. While the update mechanism can be inefficient in single-node ASG scenarios, the overall reduced management overhead and integrated updates make them a strong contender. Consider the AZ-based ASG optimization for mitigating update inefficiencies.
-
Self-Managed Nodes with Instance Refresh: Instance Refresh can be viable for less critical environments or when cost optimization is paramount and downtime is tolerable. Careful planning for redundancy and understanding the terminate-first behavior are crucial for minimizing risks.
-
Self-Managed Nodes with
eks-rolling-update
: This script provides more control and potentially a safer create-then-terminate approach. However, the reliance on a community-maintained tool and potential for instability should be carefully evaluated for production use. -
Manual Updates: Manual updates should be avoided in production due to their inherent risks and inefficiencies.
Ultimately, thorough testing and monitoring are essential for validating your chosen update strategy and ensuring smooth and reliable AWS server update rollouts in your EKS environment. Consider your application’s availability requirements, cost constraints, and operational expertise when making your decision.