# Monitoring High Availability

> How to track the health and performance of your High Availability Namespaces.

Temporal Cloud offers several ways for you to track the health and performance of your
[High Availability](/cloud/high-availability) namespaces.

## Detect a failover or an outage 

With some [Worker deployment patterns](/cloud/high-availability/architecture-patterns) — most notably [Active/Passive (Cold)](/cloud/high-availability/architecture-patterns#active-cold) — detecting an outage is your responsibility, and your Workflows make no progress until you detect it and bring up Workers in the new active region. Fast, reliable detection therefore directly determines your recovery time, so it is worth monitoring for both of the following.

### Detect a failover 

The clearest way to detect that a failover has happened is to watch whether your Namespace's active region changed. When Temporal Cloud promotes the replica in the secondary region to active, the active region reported for the Namespace changes — a reliable, unambiguous signal that a failover occurred. To track failovers as they happen, look for the `FailoverNamespace` operation described in [Failover audit log](#failover-audit-log).

### Detect an outage 

A failover is not the only signal worth watching. You may want to detect a regional outage directly, before or independently of a Namespace failover, so you can begin your own response. Watch for:

- **A spike in replication lag** between the primary and the replica. See [Monitoring replication](#monitoring-replication).
- **A drop in Workflow throughput**, such as a sudden decline in the rate of Workflows started, completed, or Tasks processed.
- **A spike in errors across your overall stack**, not just Temporal — for example, application errors, failed Activities, or connection failures.
- **A drop in throughput across your overall stack**, such as fewer requests reaching your services or fewer Activities executing.
- **Errors or failovers in other cloud systems you depend on**, such as databases, queues, or other regional services, which often signal a broader regional outage.

## Replication status

You can monitor your replica status with the Temporal Cloud UI. If the replica is unhealthy, Temporal Cloud disables the
"Trigger a failover" option to prevent failing over to an unhealthy replica. An unhealthy replica might be due to:

- **Data synchronization issues:** The replica fails to remain in sync with the primary due to network or performance
  problems.
- **Replication lag:** The replica falls behind the primary, causing it to be out of sync.
- **Network issues:** Loss of communication between the replica and the primary causes problems.
- **Failed health checks:** If the replica fails health checks, it's marked as unhealthy.

These issues prevent the replica from being used during a failover, ensuring system stability and consistency.

## Monitoring replication

Temporal Cloud's High Availability features use asynchronous replication between the primary and the replica. Workflow
updates in the primary, along with associated History Events, are transmitted to the replica. Replication lag refers to
the transmission delay of Workflow updates and history events from the primary to the replica.

> **💡 Tip:**
>
> Temporal Cloud strives to maintain a P95 replication lag of less than 1 minute. In this context,
> P95 means 95% of updates are processed faster than this limit.
>

A forced failover, when there is significant replication lag, increases the likelihood of rolling back Workflow
progress. Always check the replication lag metrics before initiating a failover.

Temporal Cloud emits replication lag [metrics](/cloud/metrics/openmetrics/metrics-reference#replication-metrics)
as pre-computed percentiles (p50, p95, p99) that are labeled with `temporal_namespace`.

When a Namespace is using a replica, you may notice that the Action count in `temporal_cloud_v1_total_action_count` is
2x what it was before adding a replica. This happens because Actions are replicated; they occur on both the primary and
the replica.

## Failover audit log

When Temporal triggers failovers, the [audit log](/cloud/audit-logs) will update with details.

Look for `"operation": "FailoverNamespace"` in the logs.