Operator Guide
Complete guide for platform engineers and operators - from installation to production maintenance
Complete guide for platform engineers and operators - from installation to production maintenance.
This guide covers everything you need to deploy, scale, and maintain Armada clusters in production environments. For information on using Armada to submit and manage jobs, see the User Guide.
Overview
As an operator, you're responsible for:
- Installing and configuring Armada components
- Setting up and managing multiple Kubernetes clusters
- Configuring authentication and authorization
- Monitoring system health and performance
- Scaling components to handle workload
- Troubleshooting issues and maintaining availability
Armada consists of several components that work together:
- Armada Server: The API server that accepts job submissions and manages queues
- Armada Scheduler: Determines when and where jobs should run
- Armada Executor: Runs in each Kubernetes cluster and executes jobs
- Lookout: Provides job monitoring and web UI
- Supporting services: Pulsar (message broker), PostgreSQL, and Redis
For a detailed explanation of how these components interact, see the Architecture documentation.
Local Installation
For local development and testing, you can use Kind (Kubernetes in Docker) or Minikube.
The easiest way to get started locally is using the Armada Operator, which automates the entire setup process. See the Getting Started guide for step-by-step instructions.
Note: Local installations are for development and testing only. Do not use them in production.
Production Installation
Prerequisites
Before installing Armada in production, ensure you have:
-
Kubernetes cluster(s): At least one Kubernetes cluster for the control plane. Additional clusters can be added as worker clusters.
-
Required dependencies:
- Apache Pulsar: Message broker used by Armada components for event streaming
- PostgreSQL: Relational database for storing job state and metadata
- Redis: In-memory data store for caching and job queues
- cert-manager: For managing TLS certificates (required for HTTPS ingress)
- gRPC-compatible ingress controller: For exposing Armada's gRPC API
-
Optional but recommended:
- Prometheus: For metrics collection and monitoring
- NGINX Ingress Controller: For exposing web services (Lookout UI)
Installation Methods
Armada can be installed using either Helm charts or the Armada Operator. Choose the method that best fits your infrastructure:
Using Helm Charts
Helm charts provide fine-grained control over Armada deployment and are suitable for advanced configurations.
-
Set the Armada version:
export ARMADA_VERSION=v1.2.3 git clone https://github.com/armadaproject/armada.git --branch $ARMADA_VERSION cd armada -
Install the Armada Server:
# Create a values file (server-values.yaml) helm install armada-server ./deployment/armada \ --set image.tag=$ARMADA_VERSION \ -f server-values.yaml -
Install the Armada Executor (repeat for each worker cluster):
# Create a values file (executor-values.yaml) helm install armada-executor ./deployment/armada-executor \ --set image.tag=$ARMADA_VERSION \ -f executor-values.yaml
For detailed Helm chart configuration options, see the Helm Charts documentation.
Using Armada Operator
The Armada Operator provides a Kubernetes-native way to manage Armada deployments using Custom Resource Definitions (CRDs). This is the recommended approach for most users.
-
Install the Armada Operator:
helm repo add gresearch https://g-research.github.io/charts helm install armada-operator gresearch/armada-operator \ --namespace armada-system \ --create-namespace -
Install dependencies:
# Install Pulsar, PostgreSQL, Redis, and Prometheus make install-armada-deps # If using the operator repository -
Deploy Armada components:
kubectl create namespace armada kubectl apply -n armada -f armada-crs.yaml
For detailed Operator setup instructions, see the Armada Operator README.
Configuration
Server Configuration
The Armada Server requires configuration for:
- Redis connection: Used for job queues and caching
- Pulsar connection: Used for event streaming
- PostgreSQL connection: Used for storing job metadata
- Authentication: Configure authentication methods (Basic Auth, OpenID Connect, Kerberos)
- Ingress: Configure hostnames and TLS certificates
Example server values file:
ingressClass: 'nginx'
clusterIssuer: 'letsencrypt-prod'
hostnames:
- 'armada.example.com'
replicas: 3
applicationConfig:
redis:
masterName: 'mymaster'
addrs:
- 'redis-ha-announce-0.default.svc.cluster.local:26379'
- 'redis-ha-announce-1.default.svc.cluster.local:26379'
- 'redis-ha-announce-2.default.svc.cluster.local:26379'
poolSize: 1000
pulsar:
URL: 'pulsar://pulsar-broker.default.svc.cluster.local:6650'
postgres:
connection:
host: 'postgresql.default.svc.cluster.local'
port: 5432
user: 'postgres'
dbname: 'armada'
auth:
anonymousAuth: false
basicAuth:
users:
'admin':
password: 'secure-password'
groups: ['administrators']Executor Configuration
Each executor must be configured with:
- Cluster ID: Unique identifier for the cluster
- Server URL: URL of the Armada Server
- Authentication: Credentials for authenticating with the server
- Kubernetes configuration: Settings for managing pods and nodes
Example executor values file:
applicationConfig:
application:
clusterId: 'production-cluster-1'
apiConnection:
armadaUrl: 'armada.example.com:443'
basicAuth:
username: 'executor-user'
password: 'executor-password'
kubernetes:
minimumPodAge: 3m
failedPodExpiry: 10m
stuckPodExpiry: 3mNote: By default, executors run on control plane nodes. For managed Kubernetes services where you cannot access the control plane, configure the executor to run on worker nodes:
nodeSelector: null
tolerations: []For complete configuration options, see the Helm Charts documentation.
Authentication and Security
Armada supports multiple authentication methods:
Basic Authentication
Basic authentication is simple but not recommended for production. Configure it in the server values:
applicationConfig:
auth:
basicAuth:
users:
'user1':
password: 'password1'
groups: ['teamA']OpenID Connect
For production environments, use OpenID Connect authentication:
applicationConfig:
auth:
openIdAuth:
providerUrl: 'https://cognito-idp.region.amazonaws.com/user-pool-id'
groupsClaim: 'cognito:groups'Kubernetes Native Authentication
For enhanced security, use Kubernetes-native authentication where executors authenticate using their service account tokens. See the Kubernetes Native Auth implementation for setup instructions.
Permissions
Configure permissions using group mappings:
applicationConfig:
auth:
permissionGroupMapping:
submit_any_jobs: ['administrators']
create_queue: ['administrators', 'team-leads']
cancel_any_jobs: ['administrators']
watch_all_events: ['administrators']
execute_jobs: ['armada-executor']Scheduling Configuration
Configure scheduling behavior to optimize resource allocation:
applicationConfig:
scheduling:
queueLeaseBatchSize: 200
minimumResourceToSchedule:
memory: 100000000 # 100MB
cpu: 0.25
maximalClusterFractionToSchedule:
memory: 0.25
cpu: 0.25
maximalResourceFractionPerQueue:
memory: 0.25
cpu: 0.25For more details on scheduling configuration, see the Helm Charts documentation.
Monitoring and Observability
All Armada components expose metrics on /metrics endpoints that can be scraped by Prometheus.
Metrics Endpoints
- Server:
:9000/metrics - Executor:
:9001/metrics - Scheduler:
:9000/metrics - Lookout:
:9000/metrics
Prometheus Integration
Enable Prometheus monitoring when installing with Helm:
prometheus:
enabled: true
labels:
app: armada
scrapeInterval: 10sThis creates ServiceMonitor resources that Prometheus can automatically discover and scrape.
Key Metrics to Monitor
Monitor these metrics to ensure healthy operation:
- Queue metrics: Queue size, priority, resource usage
- Job metrics: Job submission rate, completion rate, failure rate
- Resource metrics: Available capacity, allocated resources
- API metrics: Request rates, latency (p95, p99)
- Executor metrics: Active jobs, pod states, reconciliation loops
Logging
All components log to stdout and stderr. Use your Kubernetes logging solution (e.g., Fluentd, Loki) to collect and analyze logs.
Check component health using:
# Check pod status
kubectl get pods -n armada
# View logs
kubectl logs -n armada deployment/armada-server
kubectl logs -n armada deployment/armada-executor
# Check events
kubectl get events -n armada --sort-by='.lastTimestamp'Scaling and High Availability
Scaling Components
Server Scaling
Scale the server horizontally by increasing replicas:
replicas: 3The server is stateless and can be scaled horizontally. Use a load balancer in front of multiple server instances.
Executor Scaling
Each executor manages one Kubernetes cluster. To add more clusters:
- Install a new executor in the target cluster
- Configure it with a unique
clusterId - Ensure it can reach the Armada Server
Database Scaling
For high-availability deployments:
- PostgreSQL: Use a managed PostgreSQL service with automatic failover or set up PostgreSQL replication
- Redis: Use Redis HA (High Availability) with sentinel or a managed Redis service
- Pulsar: Use Pulsar with multiple brokers for high availability
Resource Limits
Configure resource limits to prevent any single queue from consuming all resources:
applicationConfig:
scheduling:
maximalResourceFractionPerQueue:
memory: 0.25
cpu: 0.25
maximalResourceFractionToSchedulePerQueue:
memory: 0.05
cpu: 0.05This ensures fair resource distribution across queues.
High Availability Best Practices
- Multiple server replicas: Run at least 3 server replicas for redundancy
- Database backups: Regularly backup PostgreSQL and ensure point-in-time recovery
- Pulsar persistence: Configure Pulsar with persistent storage and replication
- Health checks: Configure Kubernetes liveness and readiness probes
- Graceful shutdown: Ensure proper termination grace periods for components
Troubleshooting
Common Issues
Jobs Not Scheduling
-
Check executor connectivity: Verify executors can reach the server
kubectl logs -n armada deployment/armada-executor | grep -i error -
Check queue configuration: Ensure queues exist and have valid priority factors
armadactl get queues -
Check resource availability: Verify clusters have available resources
kubectl top nodes -
Check scheduler logs: Look for scheduling errors
kubectl logs -n armada deployment/armada-scheduler
Executor Not Receiving Jobs
- Verify authentication: Check executor credentials are correct
- Check cluster ID: Ensure cluster ID is unique and matches server configuration
- Check network connectivity: Verify executor can reach the server endpoint
- Review executor logs: Look for authentication or connection errors
Database Connection Issues
- Check connection strings: Verify PostgreSQL connection settings
- Check network policies: Ensure pods can reach the database
- Check database status: Verify PostgreSQL is running and accessible
- Review connection pool settings: Adjust pool size if needed
Performance Issues
- Monitor metrics: Check Prometheus metrics for bottlenecks
- Review scheduling configuration: Adjust
queueLeaseBatchSizeand other parameters - Check database performance: Monitor PostgreSQL query performance
- Review Pulsar throughput: Ensure message broker can handle load
Debugging Tips
- Enable verbose logging: Increase log levels in component configuration
- Use kubectl describe: Inspect pod events and conditions
kubectl describe pod -n armada <pod-name> - Check resource usage: Monitor CPU and memory usage
kubectl top pods -n armada - Review configuration: Validate YAML configurations
kubectl get configmap -n armada -o yaml
Getting Help
If you encounter issues not covered here:
- GitHub Issues: Report bugs and request features at github.com/armadaproject/armada/issues
- Community Slack: Join discussions on CNCF Slack
- Documentation: Check the Architecture documentation for system design details
Additional Resources
- Architecture Overview - Understand how Armada components work
- User Guide - Learn how to submit and manage jobs
- Armada Operator - Kubernetes-native deployment option
- Helm Charts Documentation - Detailed Helm configuration reference
- GitHub Repository - Source code and issue tracker
Last updated on