Armada Job Service

Problem Description

Armada’s API is event driven, preventing it from integrating with tools, such as Apache Airflow, written with the expectation that it can easily fetch status of a running job. It is not scalable to have Airflow subscribe to the event stream to observe status, so we must implement a caching layer which will expose a friendlier API for individual job querying.

Proposed Change

Notes

Proposed Airflow Operator flow

  1. Create the job_set
  2. [do the work to schedule the job]
  3. Status polling loop that talks to job service

Alternative Options

Change Armada API

Armada could expose a direct endpoint allowing access to status of a running job. A previous iteration of Armada did provide an endpoint to get status of a running job. This was found to be a bottleneck for scaling to large number of jobs and/or users. The switch to an event API was used to alleviate this performance issue.

Change Airflow DAG API

Airflow could be modified to allow alternate forms of integration which work better with event-based systems. This is impractical because we do not have Airflow contributors on staff, and the timeline required to get such a change proposed, approved, and merged upstream is much too long and includes lots of risk.

Data Access

Cons

API (impact/changes?)

Security Impact

The cache should use the same security as our armadactl. Airflow does not currently support multitenancy.

Documentation Impact

Use Cases

Airflow Operator

1) User creates a dag and assigns a job-set. 2) Dag setup includes ArmadaPythonClient and JobServiceClient 3) Airflow operator takes both ArmadaPythonClient and JobServiceClient 4) Airflow operator submits job via ArmadaPythonClient 5) Airflow operator polls JobServiceClient via GetJobStatus 6) Once Armada has a terminal event, the airflow task is complete.

Implementation Plan

I have a PR that implements this plan.

Subscription

After talking with Chris Martin, I found out that we will be implementing our own redis cache for the events. We will not be using pulsar, nats, or jetstream to stream events.

The logic for this service should be as follows:

Airflow Sequence Diagram

AirflowSequence

JobService Server Diagram

JobService