High Availability (HA)

Siren Alert supports High Availability (HA) for reporting on node clusters. This provides continued service of the alerting system when a cluster’s leader node fails by switching the leader’s responsibilities to another node. The reporting functionality is muted on all but the leader node, preventing duplicate reports being sent for each alert.

Configuring Elasticsearch-based High Availability

Siren Alert can use Elasticsearch as the coordination backend for high availability. This allows all nodes in the cluster to coordinate leader election and failover using Elasticsearch, without relying on a local file system or other external mechanisms.

To enable and configure Elasticsearch-based HA, follow these steps:

Prerequisites

If Siren Alert cluster mode was used in the past, remove sentinl.settings.cluster and any nested properties from the investigate.yml file.

Ensure you have created a sirenalert user and ensure these settings are present in each node. See the security setup for more details

....
investigate_access_control:
  sirenalert:
    elasticsearch:
      username: sirenalert
      password: password
....

Setup

Set the following configuration in your investigate.yml file on each Investigate node in the cluster and then restart each node

....
investigate_core:
  instance_name: <unique name for each node in the cluster>
....
high_availability:
  enabled: true
  timeout_ms: 1000 [optional]
  refresh_ms: 2000 [optional]
  ttl_ms: 5000 [optional]
  log_level: silent [optional]
....

Ensure that you set investigate_core.instance_name to a unique name for each node.
If the instance_name is not set, it will be set to a randomly generate UUIDv4 ID. This will make later debugging much more difficult.

The defaults for each of the optional settings is shown above. These defaults are generally sufficient for the majority of installations and the settings do not need to be added to the configuration.

+ timeout_ms:: A timeout (in milliseconds) for each individual request made by a node to Elasticsearch for HA coordination.
If a request takes longer than this value, it is considered failed.
This helps nodes quickly detect communication issues or failures.

+ refresh_ms:: The interval (in milliseconds) at which each node sends coordination requests to Elasticsearch.
A smaller value means nodes will check more frequently for leader status changes, resulting in faster leader election and failover.
However, setting this too low may cause excessive load on Elasticsearch due to frequent requests.
Adjust this value to balance responsiveness and system load.

+ ttl_ms:: The time-to-live (in milliseconds) for the elected leader node status in Elasticsearch.
If the leader does not update its status within this period, it is considered dead, and a new leader election will be triggered.
This ensures that leadership is transferred promptly if the current leader becomes unavailable.

+ log_level:: Controls the verbosity of HA-related logging. Possible values are silent, info, warn, error, or debug.

Notes

There is a very small chance that an alert is lost due to the execution of the alert occuring at the exact time of a new leader node election.
The special document in .siren index used for HA should not be deleted or modified manually.

For more information, see the main Siren Alert documentation or contact support.