Skip to content

Prod guidance content refined #802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Mar 24, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
6b848d4
production guidance work started. WIP
eedugon Mar 14, 2025
7d68fd4
merging main
eedugon Mar 14, 2025
d36d4ec
prod guidance backup
eedugon Mar 17, 2025
4629acb
production guidance content refined
eedugon Mar 17, 2025
14fbae8
Merge remote-tracking branch 'origin/main' into prod_guidance
eedugon Mar 17, 2025
af62abd
fixed 3 links
eedugon Mar 17, 2025
101daa2
fixing anchor
eedugon Mar 17, 2025
85cdb75
fixing ec-ha links and anchors
eedugon Mar 17, 2025
6e295be
updated ha and scaling statement
eedugon Mar 17, 2025
01a16c7
merging main
eedugon Mar 18, 2025
0b02bd1
link to kibana reporting updated
eedugon Mar 18, 2025
7279799
Apply suggestions from code review
eedugon Mar 21, 2025
990d8e2
resilience sections and sub-sections improved
eedugon Mar 22, 2025
fc46692
Kibana production guidance updated per review comments
eedugon Mar 22, 2025
003d576
kibana prod guidance mappings and small changes
eedugon Mar 22, 2025
880ab0e
kibana prod guidance update
eedugon Mar 22, 2025
0e6233a
performance optimizations optimized
eedugon Mar 22, 2025
a2de239
optimizations titles updated
eedugon Mar 22, 2025
8d05cfd
minor update
eedugon Mar 22, 2025
2c64c47
es prod landing page almost finished
eedugon Mar 22, 2025
77cfa00
merging main
eedugon Mar 22, 2025
5741985
redirect after renaming es prod guidance file
eedugon Mar 22, 2025
1ebd0b2
almost done, landing pages finished and ECE ha doc updated
eedugon Mar 23, 2025
79aafa5
merged main
eedugon Mar 23, 2025
f4c6969
fixing multiple links
eedugon Mar 23, 2025
ec67fa2
pending link
eedugon Mar 23, 2025
4fe1121
final refinements
eedugon Mar 24, 2025
ebda352
Merge remote-tracking branch 'origin/main' into prod_guidance
eedugon Mar 24, 2025
1009d8e
link to fleet reference fixed
eedugon Mar 24, 2025
b1eaaa4
removed comment already implemented
eedugon Mar 24, 2025
a6a8130
Merge branch 'main' into prod_guidance
shainaraskas Mar 24, 2025
1497c9c
Apply suggestions from code review
eedugon Mar 24, 2025
842633d
final reviews implemented
eedugon Mar 24, 2025
2c10d13
Merge branch 'main' into prod_guidance
shainaraskas Mar 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 27 additions & 41 deletions deploy-manage/deploy/cloud-enterprise/ece-ha.md
Original file line number Diff line number Diff line change
@@ -1,74 +1,60 @@
---
navigation_title: High availability
applies_to:
deployment:
ece: all
mapped_pages:
- https://www.elastic.co/guide/en/cloud-enterprise/current/ece-ha.html
---

# High availability [ece-ha]
# High availability in ECE

Ensuring high availability in {{ece}} (ECE) requires careful planning and implementation across multiple areas, including availability zones, master nodes, replica shards, snapshot backups, and Zookeeper nodes.
Ensuring high availability (HA) in {{ece}} (ECE) requires careful planning and implementation across multiple areas, including availability zones, master nodes, replica shards, snapshot backups, and Zookeeper nodes.

This section describes key considerations and best practices to prevent downtime and data loss at both the ECE platform level and within orchestrated deployments.

## Availability zones [ece-ece-ha-1-az]

Fault tolerance for ECE is based around the concept of *availability zones*.

An availability zone contains resources available to an ECE installation that are isolated from other availability zones to safeguard against potential failure.

Planning for a fault-tolerant installation with multiple availability zones means avoiding any single point of failure that could bring down ECE.

The main difference between ECE installations that include two or three availability zones is that three availability zones enable ECE to create clusters with a *tiebreaker*. If you have only two availability zones in total in your installation, no tiebreaker is created.
::::{note}
This section focuses on ensuring high availability at the ECE platform level. For deployment-level considerations, including resiliency, scaling, and performance optimizations for running {{es}} and {{kib}}, refer to the general [production guidance](/deploy-manage/production-guidance.md).
::::

We recommend that for each deployment you use at least two availability zones for production and three for mission-critical systems. Using more than three availability zones for a deployment is not required nor supported. Availability zones are intended for high availability, not scalability.
To maintain a minimum HA, you should deploy at least two ECE hosts for each role—**allocator, constructor, and proxy**—and at least three hosts for the **director** role, which runs ZooKeeper and requires quorum to operate reliably.

::::{warning}
{{es}} clusters that are set up to use only one availability zone are not [highly available](/deploy-manage/production-guidance/availability-and-resilience.md) and are at risk of data loss. To safeguard against data loss, you must use at least two {{ece}} availability zones.
::::
In addition, to improve resiliency at the availability zone level, it’s recommended to deploy ECE across three availability zones, with at least two allocators per zone and spare capacity to accommodate instance failover and workload redistribution in case of failures.

::::{warning}
Increasing the number of zones should not be used to add more resources. The concept of zones is meant for High Availability (2 zones) and Fault Tolerance (3 zones), but neither will work if the cluster relies on the resources from those zones to be operational. The recommendation is to scale up the resources within a single zone until the cluster can take the full load (add some buffer to be prepared for a peak of requests), then scale out by adding additional zones depending on your requirements: 2 zones for High Availability, 3 zones for Fault Tolerance.
::::
All Elastic-documented architectures recommend using three availability zones with ECE roles distributed across all zones. Refer to [deployment scenarios](./identify-deployment-scenario.md) for examples of small, medium, and large installations.

Regardless of the resiliency level at the platform level, it’s important to also [configure your deployments for high availability](/deploy-manage/production-guidance/availability-and-resilience/resilience-in-ech.md).

## Master nodes [ece-ece-ha-2-master-nodes]
## Availability zones [ece-ece-ha-1-az]

Tiebreakers are used in distributed clusters to avoid cases of [split brain](https://en.wikipedia.org/wiki/Split-brain_(computing)), where an {{es}} cluster splits into multiple, autonomous parts that continue to handle requests independently of each other, at the risk of affecting cluster consistency and data loss. A split-brain scenario is avoided by making sure that a minimum number of [master-eligible nodes](elasticsearch://reference/elasticsearch/configuration-reference/node-settings.md#master-node) must be present in order for any part of the cluster to elect a master node and accept user requests. To prevent multiple parts of a cluster from being eligible, there must be a [quorum-based majority](/deploy-manage/distributed-architecture/discovery-cluster-formation/modules-discovery-quorums.md) of `(n/2)+1` nodes, where `n` is the number of master-eligible nodes in the cluster. The minimum number of master nodes to reach quorum in a two-node cluster is the same as for a three-node cluster: two nodes must be available.
Fault tolerance for ECE is based around the concept of *availability zones*.

When you create a cluster with nodes in two availability zones when a third zone is available, ECE can create a tiebreaker in the third availability zone to help establish quorum in case of loss of an availability zone. The extra tiebreaker node that helps to provide quorum does not have to be a full-fledged and expensive node, as it does not hold data. For example: By tagging allocators hosts in ECE, can you create a cluster with eight nodes each in zones `ece-1a` and `ece-1b`, for a total of 16 nodes, and one tiebreaker node in zone `ece-1c`. This cluster can lose any of the three availability zones whilst maintaining quorum, which means that the cluster can continue to process user requests, provided that there is sufficient capacity available when an availability zone goes down.
An availability zone contains resources available to an ECE installation that are isolated from other availability zones to safeguard against potential failure.

By default, each node in an {{es}} cluster is a master-eligible node and a data node. In larger clusters, such as production clusters, it’s a good practice to split the roles, so that master nodes are not handling search or indexing work. When you create a cluster, you can specify to use dedicated [master-eligible nodes](elasticsearch://reference/elasticsearch/configuration-reference/node-settings.md#master-node), one per availability zone.
Planning for a fault-tolerant installation with multiple availability zones means avoiding any single point of failure that could bring down ECE.

::::{warning}
Clusters that only have two or fewer master-eligible node are not [highly available](/deploy-manage/production-guidance/availability-and-resilience.md) and are at risk of data loss. You must have [at least three master-eligible nodes](/deploy-manage/distributed-architecture/discovery-cluster-formation/modules-discovery-quorums.md).
::::{important}
Adding more availability zones should not be used as a way to increase processing capacity and performance. The concept of zones is meant for high availability (2 zones) and fault tolerance (3 zones), but neither will work if your deployments rely on the resources from those zones to be operational. Refer to [scaling considerations](/deploy-manage/production-guidance/scaling-considerations.md#scaling-and-fault-tolerance) for more information.
::::

## Replica shards [ece-ece-ha-3-replica-shards]
The main difference between ECE installations that include two or three availability zones is that three availability zones enable ECE to create {{es}} clusters with a [voting-only tiebreaker](/deploy-manage/distributed-architecture/clusters-nodes-shards/node-roles.md#voting-only-node) instance. If you have only two availability zones in your installation, no tiebreaker can be placed in a third zone, limiting the cluster’s ability to tolerate certain failures.

With multiple {{es}} nodes in multiple availability zones you have the recommended hardware, the next thing to consider is having the recommended index replication. Each index, with the exception of searchable snapshot indexes, should have one or more replicas. Use the index settings API to find any indices with no replica:
## Tiebreaker master nodes

```sh
GET _all/_settings/index.number_of_replicas
```
A tiebreaker is a lightweight voting-only node used in distributed clusters to help avoid split-brain scenarios, where the cluster could incorrectly split into multiple autonomous parts during a network partition.

::::{warning}
Indices with no replica, except for [searchable snapshot indices](/deploy-manage/tools/snapshot-and-restore/searchable-snapshots.md), are not highly available. You should use replicas to mitigate against possible data loss.
::::
When you create a cluster with nodes in two availability zones when a third zone is available, ECE can create a tiebreaker in the third availability zone to help establish quorum in case of loss of an availability zone. The extra tiebreaker node that helps to provide quorum does not have to be a full-fledged and expensive node, as it does not hold data. For example: By [tagging allocators](./ece-configuring-ece-tag-allocators.md) hosts in ECE, can you create a cluster with eight nodes each in zones `ece-1a` and `ece-1b`, for a total of 16 nodes, and one tiebreaker node in zone `ece-1c`. This cluster can lose any of the three availability zones whilst maintaining quorum, which means that the cluster can continue to process user requests, provided that there is sufficient capacity available when an availability zone goes down.

Refer to [](../../reference-architectures.md) for information about {{es}} architectures.
## Zookeeper nodes

## Snapshot backups [ece-ece-ha-4-snapshot]
Make sure you have three Zookeepers—by default, on the Director host—for your ECE installation. Similar to three {{es}} master nodes can form a quorum, three Zookeepers can form the quorum for high availability purposes.

You should configure and use [{{es}} snapshots](/deploy-manage/tools/snapshot-and-restore.md). Snapshots provide a way to backup and restore your {{es}} indices. They can be used to copy indices for testing, to recover from failures or accidental deletions, or to migrate data to other deployments. We recommend configuring an [{{ece}}-level repository](../../tools/snapshot-and-restore/cloud-enterprise.md) to apply across all deployments. See [Work with snapshots](../../tools/snapshot-and-restore.md) for more guidance.
Backing up Zookeeper data directory is also recommended. Refer to [rebuilding a broken Zookeeper quorum](../../../troubleshoot/deployments/cloud-enterprise/rebuilding-broken-zookeeper-quorum.md) for more guidance.

## Further considerations [ece-ece-ha-5-other]
## External resources accessibility

* Make sure you have three Zookeepers - by default, on the Director host - for your ECE installation. Similar to three Elasticsearch master nodes can form a quorum, three Zookeepers can forum the quorum for high availability purposes. Backing up the Zookeeper data directory is also recommended: refer to [](/troubleshoot/deployments/cloud-enterprise/rebuilding-broken-zookeeper-quorum.md) for more guidance.
If you’re using a [private Docker registry server](ece-install-offline-with-registry.md) or hosting any [custom bundles and plugins](../../../solutions/search/full-text/search-with-synonyms.md) on a web server, make sure these resources are accessible from all ECE allocators, so they can continue to be accessed in the event of a network partition or zone outage.

* Make sure that if you’re using a [private Docker registry server](ece-install-offline-with-registry.md) or are using any [custom bundles and plugins](../../../solutions/search/full-text/search-with-synonyms.md) hosted on a web server, that these are available to all ECE allocators, so that they can continue to be accessed in the event of a network partition or zone outage.
## Other recommendations

* Don’t delete containers unless guided by Elastic Support or there’s public documentation explicitly describing this as required action. Otherwise, it can cause issues and you may lose access or functionality of your {{ece}} platform. See [](/troubleshoot/deployments/cloud-enterprise/troubleshooting-container-engines.md) for more information.
Avoid deleting containers unless explicitly instructed by Elastic Support or official documentation. Doing so may lead to unexpected issues or loss of access to your {{ece}} platform. For more details, refer to [](/troubleshoot/deployments/cloud-enterprise/troubleshooting-container-engines.md).

If in doubt, please [contact support for help](/troubleshoot/index.md#contact-us).
2 changes: 1 addition & 1 deletion deploy-manage/deploy/elastic-cloud/cloud-hosted.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ Of course, you can choose to follow your own path and use Elastic components ava

**Adjust the capacity and capabilities of your deployments for production**

There are a few things that can help you make sure that your production deployments remain available, healthy, and ready to handle your data in a scalable way over time, with the expected level of performance. Check [](/deploy-manage/production-guidance/plan-for-production-elastic-cloud.md).
There are a few things that can help you make sure that your production deployments remain available, healthy, and ready to handle your data in a scalable way over time, with the expected level of performance. Check [](/deploy-manage/production-guidance.md).

**Secure your environment**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ You can also create a deployment using the [Elastic Cloud API](https://www.elast

To make sure you’re all set for production, consider the following actions:

* [Plan for your expected workloads](/deploy-manage/production-guidance/plan-for-production-elastic-cloud.md) and consider how many availability zones you’ll need.
* [Plan for your expected workloads](/deploy-manage/production-guidance.md) and consider how many availability zones you’ll need.
* [Create a deployment](/deploy-manage/deploy/elastic-cloud/create-an-elastic-cloud-hosted-deployment.md) on the region you need and with a hardware profile that matches your use case.
* [Change your configuration](/deploy-manage/deploy/elastic-cloud/ec-customize-deployment-components.md) by turning on autoscaling, adding high availability, or adjusting components of the Elastic Stack.
* [Add extensions and plugins](/deploy-manage/deploy/elastic-cloud/add-plugins-extensions.md) to use Elastic supported extensions or add your own custom dictionaries and scripts.
Expand Down
39 changes: 28 additions & 11 deletions deploy-manage/production-guidance.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,44 @@
---
mapped_pages:
- https://www.elastic.co/guide/en/cloud/current/ec-best-practices-data.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/scalability.html
applies_to:
deployment:
ess: all
ece: all
eck: all
self: all
---

# Production guidance [ec-best-practices-data]
% scope: the scope of this page is just a brief introduction to prod guidance at elastic stack level, links to ES and KIB,
# Production guidance

This section provides some best practices for managing your data to help you set up a production environment that matches your workloads, policies, and deployment needs.
Running the {{stack}} in production requires careful planning to ensure resilience, performance, and scalability. This section outlines best practices and recommendations for optimizing {{es}} and {{kib}} in production environments.

You’ll learn how to design highly available and resilient deployments, implement best practices for managing workloads, and apply performance optimizations to handle scaling demands efficiently.

## Plan your data structure, availability, and formatting [ec_plan_your_data_structure_availability_and_formatting]
For {{es}}, this includes strategies for fault tolerance, data replication, and workload distribution to maintain stability under load. For {{kib}}, you’ll explore how to deploy multiple Kibana instances within the same environment and make informed decisions about scaling horizontally or vertically based on the task manager metrics, which provide insights into background task execution and resource consumption.

* Build a [data architecture](/manage-data/lifecycle/data-tiers.md) that best fits your needs. Your {{ech}} deployment comes with default hot tier {{es}} nodes that store your most frequently accessed data. Based on your own access and retention policies, you can add warm, cold, frozen data tiers, and automated deletion of old data.
* Make your data [highly available](/deploy-manage/tools.md) for production environments or otherwise critical data stores, and take regular [backup snapshots](tools/snapshot-and-restore.md).
* Normalize event data to better analyze, visualize, and correlate your events by adopting the [Elastic Common Schema](ecs://reference/ecs-getting-started.md) (ECS). Elastic integrations use ECS out-of-the-box. If you are writing your own integrations, ECS is recommended.
By following this guidance, you can ensure your {{stack}} deployment is robust, efficient, and prepared for production-scale workloads.

For detailed, component-specific guidance, refer to:
* [](./production-guidance/elasticsearch-in-production-environments.md)
* [](./production-guidance/kibana-in-production-environments.md)

## Optimize data storage and retention [ec_optimize_data_storage_and_retention]
## Deployment types

Once you have your data tiers deployed and you have data flowing, you can [manage the index lifecycle](/manage-data/lifecycle/index-lifecycle-management.md).
Production guidelines and concepts described in this section apply to all [deployment types](/deploy-manage/deploy.md#choosing-your-deployment-type)—including {{ech}}, {{ece}}, {{eck}}, and self-managed clusters—**except** {{serverless-full}}.

::::{tip}
[Elastic integrations](https://www.elastic.co/integrations) provide default index lifecycle policies, and you can [build your own policies for your custom integrations](/manage-data/lifecycle/index-lifecycle-management/tutorial-automate-rollover.md).
However, certain parts may be relevant only to self-managed clusters, as orchestration systems automate some of the configurations discussed here. Check the [badges](/get-started/versioning-availability.md#availability-of-features) on each document or section to confirm whether the content applies to your deployment type.

::::{note}
**{{serverless-full}}** projects are fully managed and automatically scaled by Elastic. Your project’s performance and general data retention are controlled by the [Search AI Lake settings](/deploy-manage/deploy/elastic-cloud/project-settings.md#elasticsearch-manage-project-search-ai-lake-settings).
::::

## Other Elastic products

If you are looking for production guidance for Elastic products other than {{es}} or {{kib}}, check out the following resources:

* [High availability on ECE orchestrator](/deploy-manage/deploy/cloud-enterprise/ece-ha.md)
* [APM scalability and performance](/troubleshoot/observability/apm/processing-performance.md)
* [Fleet server scalability](/reference/fleet/fleet-server-scalability.md)
* [Deploying and scaling Logstash](logstash://reference/deploying-scaling-logstash.md)
Loading
Loading