Operator Guide

Complete guide for platform engineers and operators - from installation to production maintenance

Complete guide for platform engineers and operators - from installation to production maintenance.

This guide covers everything you need to deploy, scale, and maintain Armada clusters in production environments. For information on using Armada to submit and manage jobs, see the User Guide.

Overview

As an operator, you're responsible for:

Installing and configuring Armada components
Setting up and managing multiple Kubernetes clusters
Configuring authentication and authorization
Monitoring system health and performance
Scaling components to handle workload
Troubleshooting issues and maintaining availability

Armada consists of several components that work together:

Armada Server: The API server that accepts job submissions and manages queues
Armada Scheduler: Determines when and where jobs should run
Armada Executor: Runs in each Kubernetes cluster and executes jobs
Lookout: Provides job monitoring and web UI
Supporting services: Pulsar (message broker), PostgreSQL, and Redis

For a detailed explanation of how these components interact, see the Architecture documentation.

Local Installation

For local development and testing, you can use Kind (Kubernetes in Docker) or Minikube.

The easiest way to get started locally is using the Armada Operator, which automates the entire setup process. See the Getting Started guide for step-by-step instructions.

Note: Local installations are for development and testing only. Do not use them in production.

Production Installation

Prerequisites

Before installing Armada in production, ensure you have:

Kubernetes cluster(s): At least one Kubernetes cluster for the control plane. Additional clusters can be added as worker clusters.
Required dependencies:
- Apache Pulsar: Message broker used by Armada components for event streaming
- PostgreSQL: Relational database for storing job state and metadata
- Redis: In-memory data store for caching and job queues
- cert-manager: For managing TLS certificates (required for HTTPS ingress)
- gRPC-compatible ingress controller: For exposing Armada's gRPC API
Optional but recommended:
- Prometheus: For metrics collection and monitoring
- NGINX Ingress Controller: For exposing web services (Lookout UI)

Installation Methods

Armada can be installed using either Helm charts or the Armada Operator. Choose the method that best fits your infrastructure:

Using Helm Charts

Helm charts provide fine-grained control over Armada deployment and are suitable for advanced configurations.

Set the Armada version:

export ARMADA_VERSION=v1.2.3
git clone https://github.com/armadaproject/armada.git --branch $ARMADA_VERSION
cd armada

Install the Armada Server:

# Create a values file (server-values.yaml)
helm install armada-server ./deployment/armada \
  --set image.tag=$ARMADA_VERSION \
  -f server-values.yaml

Install the Armada Executor (repeat for each worker cluster):

# Create a values file (executor-values.yaml)
helm install armada-executor ./deployment/armada-executor \
  --set image.tag=$ARMADA_VERSION \
  -f executor-values.yaml

For detailed Helm chart configuration options, see the Helm Charts documentation.

Using Armada Operator

The Armada Operator provides a Kubernetes-native way to manage Armada deployments using Custom Resource Definitions (CRDs). This is the recommended approach for most users.

Install the Armada Operator:

helm repo add gresearch https://g-research.github.io/charts
helm install armada-operator gresearch/armada-operator \
  --namespace armada-system \
  --create-namespace

Install dependencies:

# Install Pulsar, PostgreSQL, Redis, and Prometheus
make install-armada-deps  # If using the operator repository

Deploy Armada components:

kubectl create namespace armada
kubectl apply -n armada -f armada-crs.yaml

For detailed Operator setup instructions, see the Armada Operator README.

Configuration

Server Configuration

The Armada Server requires configuration for:

Redis connection: Used for job queues and caching
Pulsar connection: Used for event streaming
PostgreSQL connection: Used for storing job metadata
Authentication: Configure authentication methods (Basic Auth, OpenID Connect, Kerberos)
Ingress: Configure hostnames and TLS certificates

Example server values file:

ingressClass: 'nginx'
clusterIssuer: 'letsencrypt-prod'
hostnames:
  - 'armada.example.com'
replicas: 3

applicationConfig:
  redis:
    masterName: 'mymaster'
    addrs:
      - 'redis-ha-announce-0.default.svc.cluster.local:26379'
      - 'redis-ha-announce-1.default.svc.cluster.local:26379'
      - 'redis-ha-announce-2.default.svc.cluster.local:26379'
    poolSize: 1000
  pulsar:
    URL: 'pulsar://pulsar-broker.default.svc.cluster.local:6650'
  postgres:
    connection:
      host: 'postgresql.default.svc.cluster.local'
      port: 5432
      user: 'postgres'
      dbname: 'armada'
  auth:
    anonymousAuth: false
    basicAuth:
      users:
        'admin':
          password: 'secure-password'
          groups: ['administrators']

Executor Configuration

Each executor must be configured with:

Cluster ID: Unique identifier for the cluster
Server URL: URL of the Armada Server
Authentication: Credentials for authenticating with the server
Kubernetes configuration: Settings for managing pods and nodes

Example executor values file:

applicationConfig:
  application:
    clusterId: 'production-cluster-1'
  apiConnection:
    armadaUrl: 'armada.example.com:443'
    basicAuth:
      username: 'executor-user'
      password: 'executor-password'
  kubernetes:
    minimumPodAge: 3m
    failedPodExpiry: 10m
    stuckPodExpiry: 3m

Note: By default, executors run on control plane nodes. For managed Kubernetes services where you cannot access the control plane, configure the executor to run on worker nodes:

nodeSelector: null
tolerations: []

For complete configuration options, see the Helm Charts documentation.

Authentication and Security

Armada supports multiple authentication methods:

Basic Authentication

Basic authentication is simple but not recommended for production. Configure it in the server values:

applicationConfig:
  auth:
    basicAuth:
      users:
        'user1':
          password: 'password1'
          groups: ['teamA']

OpenID Connect

For production environments, use OpenID Connect authentication:

applicationConfig:
  auth:
    openIdAuth:
      providerUrl: 'https://cognito-idp.region.amazonaws.com/user-pool-id'
      groupsClaim: 'cognito:groups'

Kubernetes Native Authentication

For enhanced security, use Kubernetes-native authentication where executors authenticate using their service account tokens. See the Kubernetes Native Auth implementation for setup instructions.

Permissions

Configure permissions using group mappings:

applicationConfig:
  auth:
    permissionGroupMapping:
      submit_any_jobs: ['administrators']
      create_queue: ['administrators', 'team-leads']
      cancel_any_jobs: ['administrators']
      watch_all_events: ['administrators']
      execute_jobs: ['armada-executor']

Scheduling Configuration

Configure scheduling behavior to optimize resource allocation:

applicationConfig:
  scheduling:
    queueLeaseBatchSize: 200
    minimumResourceToSchedule:
      memory: 100000000 # 100MB
      cpu: 0.25
    maximalClusterFractionToSchedule:
      memory: 0.25
      cpu: 0.25
    maximalResourceFractionPerQueue:
      memory: 0.25
      cpu: 0.25

For more details on scheduling configuration, see the Helm Charts documentation.

Monitoring and Observability

All Armada components expose metrics on /metrics endpoints that can be scraped by Prometheus.

Metrics Endpoints

Server: :9000/metrics
Executor: :9001/metrics
Scheduler: :9000/metrics
Lookout: :9000/metrics

Prometheus Integration

Enable Prometheus monitoring when installing with Helm:

prometheus:
  enabled: true
  labels:
    app: armada
  scrapeInterval: 10s

This creates ServiceMonitor resources that Prometheus can automatically discover and scrape.

Key Metrics to Monitor

Monitor these metrics to ensure healthy operation:

Queue metrics: Queue size, priority, resource usage
Job metrics: Job submission rate, completion rate, failure rate
Resource metrics: Available capacity, allocated resources
API metrics: Request rates, latency (p95, p99)
Executor metrics: Active jobs, pod states, reconciliation loops

Logging

All components log to stdout and stderr. Use your Kubernetes logging solution (e.g., Fluentd, Loki) to collect and analyze logs.

Check component health using:

# Check pod status
kubectl get pods -n armada

# View logs
kubectl logs -n armada deployment/armada-server
kubectl logs -n armada deployment/armada-executor

# Check events
kubectl get events -n armada --sort-by='.lastTimestamp'

replicas: 3

The server is stateless and can be scaled horizontally. Use a load balancer in front of multiple server instances.

Executor Scaling

Each executor manages one Kubernetes cluster. To add more clusters:

Install a new executor in the target cluster
Configure it with a unique clusterId
Ensure it can reach the Armada Server

Database Scaling

For high-availability deployments:

PostgreSQL: Use a managed PostgreSQL service with automatic failover or set up PostgreSQL replication
Redis: Use Redis HA (High Availability) with sentinel or a managed Redis service
Pulsar: Use Pulsar with multiple brokers for high availability

Resource Limits

Configure resource limits to prevent any single queue from consuming all resources:

applicationConfig:
  scheduling:
    maximalResourceFractionPerQueue:
      memory: 0.25
      cpu: 0.25
    maximalResourceFractionToSchedulePerQueue:
      memory: 0.05
      cpu: 0.05

This ensures fair resource distribution across queues.

High Availability Best Practices

Multiple server replicas: Run at least 3 server replicas for redundancy
Database backups: Regularly backup PostgreSQL and ensure point-in-time recovery
Pulsar persistence: Configure Pulsar with persistent storage and replication
Health checks: Configure Kubernetes liveness and readiness probes
Graceful shutdown: Ensure proper termination grace periods for components

kubectl logs -n armada deployment/armada-executor | grep -i error

Check queue configuration: Ensure queues exist and have valid priority factors
```
armadactl get queues
```
Check resource availability: Verify clusters have available resources
```
kubectl top nodes
```

Check scheduler logs: Look for scheduling errors

kubectl logs -n armada deployment/armada-scheduler

Executor Not Receiving Jobs

Verify authentication: Check executor credentials are correct
Check cluster ID: Ensure cluster ID is unique and matches server configuration
Check network connectivity: Verify executor can reach the server endpoint
Review executor logs: Look for authentication or connection errors

Database Connection Issues

Check connection strings: Verify PostgreSQL connection settings
Check network policies: Ensure pods can reach the database
Check database status: Verify PostgreSQL is running and accessible
Review connection pool settings: Adjust pool size if needed

Performance Issues

Monitor metrics: Check Prometheus metrics for bottlenecks
Review scheduling configuration: Adjust queueLeaseBatchSize and other parameters
Check database performance: Monitor PostgreSQL query performance
Review Pulsar throughput: Ensure message broker can handle load

Debugging Tips

Enable verbose logging: Increase log levels in component configuration
Use kubectl describe: Inspect pod events and conditions
```
kubectl describe pod -n armada <pod-name>
```
Check resource usage: Monitor CPU and memory usage
```
kubectl top pods -n armada
```
Review configuration: Validate YAML configurations
```
kubectl get configmap -n armada -o yaml
```

Getting Help

If you encounter issues not covered here:

GitHub Issues: Report bugs and request features at github.com/armadaproject/armada/issues
Community Slack: Join discussions on CNCF Slack
Documentation: Check the Architecture documentation for system design details

Additional Resources

Architecture Overview - Understand how Armada components work
User Guide - Learn how to submit and manage jobs
Armada Operator - Kubernetes-native deployment option
Helm Charts Documentation - Detailed Helm configuration reference
GitHub Repository - Source code and issue tracker