Skip to content

Fix welcome-to-elastic links #484

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 13, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@ For more information on how to deploy {agent} on {k8s}, please review these page
[discrete]
== Observability at scale

This document summarizes some key factors and best practices for using https://www.elastic.co/guide/en/welcome-to-elastic/current/getting-started-kubernetes.html[Elastic {observability}] to monitor {k8s} infrastructure at scale. Users need to consider different parameters and adjust {stack} accordingly. These elements are affected as the size of {k8s} cluster increases:
This document summarizes some key factors and best practices for using {estc-welcome-current}/getting-started-kubernetes.html[Elastic {observability}] to monitor {k8s} infrastructure at scale. Users need to consider different parameters and adjust {stack} accordingly. These elements are affected as the size of {k8s} cluster increases:

- The amount of metrics being collected from several {k8s} endpoints
- The {agent}'s resources to cope with the high CPU and Memory needs for the internal processing
- The {es} resources needed due to the higher rate of metric ingestion
- The Dashboard's visualizations response times as more data are requested on a given time window
- The Dashboard's visualizations response times as more data are requested on a given time window

The document is divided in two main sections:

Expand All @@ -41,7 +41,7 @@ The {k8s} {observability} is based on https://docs.elastic.co/en/integrations/ku

Controller manager and Scheduler datastreams are being enabled only on the specific node that actually runs based on autodiscovery rules

The default manifest provided deploys {agent} as DaemonSet which results in an {agent} being deployed on every node of the {k8s} cluster.
The default manifest provided deploys {agent} as DaemonSet which results in an {agent} being deployed on every node of the {k8s} cluster.

Additionally, by default one agent is elected as **leader** (for more information visit <<kubernetes_leaderelection-provider>>). The {agent} Pod which holds the leadership lock is responsible for collecting the cluster-wide metrics in addition to its node's metrics.

Expand All @@ -58,7 +58,7 @@ The DaemonSet deployment approach with leader election simplifies the installati
[discrete]
=== Specifying resources and limits in Agent manifests

Resourcing of your Pods and the Scheduling priority (check section <<agent-scheduling,Scheduling priority>>) of them are two topics that might be affected as the {k8s} cluster size increases.
Resourcing of your Pods and the Scheduling priority (check section <<agent-scheduling,Scheduling priority>>) of them are two topics that might be affected as the {k8s} cluster size increases.
The increasing demand of resources might result to under-resource the Elastic Agents of your cluster.

Based on our tests we advise to configure only the `limit` section of the `resources` section in the manifest. In this way the `request`'s settings of the `resources` will fall back to the `limits` specified. The `limits` is the upper bound limit of your microservice process, meaning that can operate in less resources and protect {k8s} to assign bigger usage and protect from possible resource exhaustion.
Expand All @@ -76,11 +76,11 @@ Based on our https://github.com/elastic/elastic-agent/blob/main/docs/elastic-age

Sample Elastic Agent Configurations:
|===
| No of Pods in K8s Cluster | Leader Agent Resources | Rest of Agents
| 1000 | cpu: "1500m", memory: "800Mi" | cpu: "300m", memory: "600Mi"
| 3000 | cpu: "2000m", memory: "1500Mi" | cpu: "400m", memory: "800Mi"
| 5000 | cpu: "3000m", memory: "2500Mi" | cpu: "500m", memory: "900Mi"
| 10000 | cpu: "3000m", memory: "3600Mi" | cpu: "700m", memory: "1000Mi"
| No of Pods in K8s Cluster | Leader Agent Resources | Rest of Agents
| 1000 | cpu: "1500m", memory: "800Mi" | cpu: "300m", memory: "600Mi"
| 3000 | cpu: "2000m", memory: "1500Mi" | cpu: "400m", memory: "800Mi"
| 5000 | cpu: "3000m", memory: "2500Mi" | cpu: "500m", memory: "900Mi"
| 10000 | cpu: "3000m", memory: "3600Mi" | cpu: "700m", memory: "1000Mi"
|===

> The above tests were performed with {agent} version 8.7 and scraping period of `10sec` (period setting for the {k8s} integration). Those numbers are just indicators and should be validated for each different {k8s} environment and amount of workloads.
Expand All @@ -94,19 +94,19 @@ Although daemonset installation is simple, it can not accommodate the varying ag

- A dedicated {agent} deployment of a single Agent for collecting cluster wide metrics from the apiserver

- Node level {agent}s(no leader Agent) in a Daemonset
- Node level {agent}s(no leader Agent) in a Daemonset

- kube-state-metrics shards and {agent}s in the StatefulSet defined in the kube-state-metrics autosharding manifest

Each of these groups of {agent}s will have its own policy specific to its function and can be resourced independently in the appropriate manifest to accommodate its specific resource requirements.

Resource assignment led us to alternatives installation methods.
Resource assignment led us to alternatives installation methods.

IMPORTANT: The main suggestion for big scale clusters *is to install {agent} as side container along with `kube-state-metrics` Shard*. The installation is explained in details https://github.com/elastic/elastic-agent/tree/main/deploy/kubernetes#kube-state-metrics-ksm-in-autosharding-configuration[{agent} with Kustomize in Autosharding]

The following **alternative configuration methods** have been verified:

1. With `hostNetwork:false`
1. With `hostNetwork:false`
- {agent} as Side Container within KSM Shard pod
- For non-leader {agent} deployments that collect per KSM shards
2. With `taint/tolerations` to isolate the {agent} daemonset pods from rest of deployments
Expand All @@ -116,10 +116,10 @@ You can find more information in the document called https://github.com/elastic/
Based on our https://github.com/elastic/elastic-agent/blob/main/docs/elastic-agent-scaling-tests.md[{agent} scaling tests], the following table aims to assist users on how to configure their KSM Sharding as {k8s} cluster scales:
|===
| No of Pods in K8s Cluster | No of KSM Shards | Agent Resources
| 1000 | No Sharding can be handled with default KSM config | limits: memory: 700Mi , cpu:500m
| 3000 | 4 Shards | limits: memory: 1400Mi , cpu:1500m
| 5000 | 6 Shards | limits: memory: 1400Mi , cpu:1500m
| 10000 | 8 Shards | limits: memory: 1400Mi , cpu:1500m
| 1000 | No Sharding can be handled with default KSM config | limits: memory: 700Mi , cpu:500m
| 3000 | 4 Shards | limits: memory: 1400Mi , cpu:1500m
| 5000 | 6 Shards | limits: memory: 1400Mi , cpu:1500m
| 10000 | 8 Shards | limits: memory: 1400Mi , cpu:1500m
|===

> The tests above were performed with {agent} version 8.8 + TSDB Enabled and scraping period of `10sec` (for the {k8s} integration). Those numbers are just indicators and should be validated per different {k8s} policy configuration, along with applications that the {k8s} cluster might include
Expand Down Expand Up @@ -152,7 +152,7 @@ Additionally, https://github.com/elastic/integrations/blob/main/docs/dashboard_g
[discrete]
=== Elastic Stack Configuration

The configuration of Elastic Stack needs to be taken under consideration in large scale deployments. In case of Elastic Cloud deployments the choice of the deployment https://www.elastic.co/guide/en/cloud/current/ec-getting-started-profiles.html[{ecloud} hardware profile] is important.
The configuration of Elastic Stack needs to be taken under consideration in large scale deployments. In case of Elastic Cloud deployments the choice of the deployment https://www.elastic.co/guide/en/cloud/current/ec-getting-started-profiles.html[{ecloud} hardware profile] is important.

For heavy processing and big ingestion rate needs, the `CPU-optimised` profile is proposed.

Expand All @@ -161,7 +161,7 @@ For heavy processing and big ingestion rate needs, the `CPU-optimised` profile i
== Validation and Troubleshooting practices

[discrete]
=== Define if Agents are collecting as expected
=== Define if Agents are collecting as expected

After {agent} deployment, we need to verify that Agent services are healthy, not restarting (stability) and that collection of metrics continues with expected rate (latency).

Expand Down Expand Up @@ -217,7 +217,7 @@ Components:
Healthy: communicating with pid '42462'
------------------------------------------------

It is a common problem of lack of CPU/memory resources that agent process restart as {k8s} size grows. In the logs of agent you
It is a common problem of lack of CPU/memory resources that agent process restart as {k8s} size grows. In the logs of agent you

[source,json]
------------------------------------------------
Expand All @@ -229,7 +229,7 @@ kubectl logs -n kube-system elastic-agent-qw6f4 | grep "kubernetes/metrics"

------------------------------------------------

You can verify the instant resource consumption by running `top pod` command and identify if agents are close to the limits you have specified in your manifest.
You can verify the instant resource consumption by running `top pod` command and identify if agents are close to the limits you have specified in your manifest.

[source,bash]
------------------------------------------------
Expand Down Expand Up @@ -261,7 +261,7 @@ Identify how many events have been sent to {es}:

[source,bash]
------------------------------------------------
kubectl logs -n kube-system elastic-agent-h24hh -f | grep -i state_pod
kubectl logs -n kube-system elastic-agent-h24hh -f | grep -i state_pod
[ouptut truncated ...]

"state_pod":{"events":2936,"success":2936}
Expand All @@ -282,5 +282,5 @@ Corresponding dashboards for `CPU Usage`, `Index Response Times` and `Memory Pre

== Relevant links

- https://www.elastic.co/guide/en/welcome-to-elastic/current/getting-started-kubernetes.html[Monitor {k8s} Infrastructure]
- {estc-welcome-current}/getting-started-kubernetes.html[Monitor {k8s} Infrastructure]
- https://www.elastic.co/blog/kubernetes-cluster-metrics-logs-monitoring[Blog: Managing your {k8s} cluster with Elastic {observability}]