Overview
What is Armada?
Armada is a multi-Kubernetes cluster batch job meta-scheduler designed to handle massive-scale workloads. Built on top of Kubernetes, Armada enables organizations to distribute millions of batch jobs per day across tens of thousands of nodes spanning multiple clusters, making it an ideal solution for high-throughput computational workloads.
Armada serves as middleware that transforms Kubernetes into a powerful batch processing platform while maintaining compatibility with service workloads. It addresses the fundamental limitations of running batch workloads at scale on Kubernetes by providing:
- Multi-cluster orchestration: Schedule jobs across many Kubernetes clusters seamlessly
- High-throughput queueing: Handle millions of queued jobs
- Advanced batch scheduling: Fair queuing, gang scheduling, preemption, and resource limits
- Enterprise-grade reliability: Secure, highly available components designed for production use
As a CNCF Sandbox project, Armada is actively maintained and used in production environments, including at G-Research where it processes millions of jobs daily.
Why Use Armada?
Kubernetes Limitations for Batch Workloads
Traditional Kubernetes faces several challenges when running batch workloads at scale:
-
Single Cluster Scaling Limits: Scaling a single Kubernetes cluster beyond a certain size is challenging, typically maxing out around 5,000-15,000 nodes depending on configuration.
-
Storage Backend Constraints: Etcd, Kubernetes' in-cluster storage backend, has performance limitations that make achieving very high throughput difficult and can become a bottleneck for job queuing.
-
Inadequate Batch Scheduling: The default kube-scheduler lacks essential batch scheduling features like fair queuing, gang scheduling, and intelligent preemption.
Armada's Solution
Armada overcomes these limitations by:
- Distributing across multiple clusters: Manage thousands of nodes across many Kubernetes clusters
- Partial Out-of-cluster scheduling: Leverage external storage backends (e.g., PostgreSQL and Redis) for high-throughput batch job queueing and scheduling
- Purpose-built batch scheduler: Include advanced scheduling features designed specifically for batch workloads
Key Features and Benefits
Core Scheduling Features
Fair-Use Scheduling
- Maintains fair resource share over time across users and teams
- Based on dominant resource fairness principles
- Includes priority factors for different queues
- Inspired by HTCondor priority systems
High Throughput Processing
- Handle millions of queued jobs simultaneously
- Efficient job submission and status tracking
Gang Scheduling
- Atomically schedule sets of related jobs
- Ensures all jobs in a group start together or not at all
- Critical for distributed computing frameworks like MPI
Intelligent Preemption
- Run urgent jobs in a timely fashion
- Balance resource allocation between users
- Configurable preemption policies
Enterprise-Grade Operations
Massive Scale Support
- Utilize multiple Kubernetes clusters simultaneously
- Scale beyond single cluster limitations
- Add and remove clusters without service disruption
Advanced Resource Management
- Resource and job scheduling rate limits
- Detailed resource allocation controls
Comprehensive Monitoring
- Detailed analytics via Prometheus integration
- Resource allocation and system behavior insights
- Automatic failure detection and node removal
Production-Ready Features
- Secure authentication and authorization
- High availability architecture
- Automatic node failure handling
Use Cases and Success Stories
High-Performance Computing (HPC)
- Machine Learning Training: Distribute large-scale ML training jobs across multiple clusters
- Scientific Computing: Run complex simulations and data analysis workloads
- Financial Modeling: Execute risk calculations and quantitative analysis at scale
Data Processing Pipelines
- ETL Workloads: Process large datasets with parallel batch jobs
- Data Analytics: Run distributed analytics jobs across multiple clusters
- Backup and Archival: Coordinate large-scale data movement operations
CI/CD and Development
- Build Systems: Distribute compilation and testing jobs
- Integration Testing: Run comprehensive test suites across multiple environments
- Deployment Automation: Coordinate complex deployment workflows
Production Deployment at G-Research
G-Research, a leading quantitative research company, uses Armada in production to:
- Process millions of jobs per day
- Manage tens of thousands of nodes
- Support diverse computational workloads
- Maintain high availability and performance
Comparison with Other Schedulers
vs. Native Kubernetes Scheduler
- Scale: Armada spans multiple clusters vs. single cluster limitation
- Throughput: Millions of jobs vs. thousands with native scheduler
- Batch Features: Purpose-built for batch vs. service-oriented design
- Fair Scheduling: Advanced fair-use policies vs. basic priority classes
vs. Traditional HPC Schedulers (SLURM, PBS)
- Container Native: Built for containerized workloads vs. traditional HPC
- Kubernetes Integration: Leverages Kubernetes ecosystem vs. isolated systems
- Cloud Ready: Designed for cloud and hybrid environments
- Modern APIs: REST/gRPC APIs vs. command-line interfaces
- Rich Client Support: Client libraries available for multiple languages (Go, Java, Scala, Python and .NET)
Next Steps
Ready to explore Armada? Here are your next steps:
Last updated on