elastic · jrodewig · Sep 13, 2023 · Sep 13, 2023
@@ -9,12 +9,12 @@ For more information on how to deploy {agent} on {k8s}, please review these page
 [discrete]
 == Observability at scale
 
-This document summarizes some key factors and best practices for using https://www.elastic.co/guide/en/welcome-to-elastic/current/getting-started-kubernetes.html[Elastic {observability}] to monitor {k8s} infrastructure at scale. Users need to consider different parameters and adjust {stack} accordingly. These elements are affected as the size of {k8s} cluster increases:
+This document summarizes some key factors and best practices for using {estc-welcome-current}/getting-started-kubernetes.html[Elastic {observability}] to monitor {k8s} infrastructure at scale. Users need to consider different parameters and adjust {stack} accordingly. These elements are affected as the size of {k8s} cluster increases:
 
 - The amount of metrics being collected from several {k8s} endpoints
 - The {agent}'s resources to cope with the high CPU and Memory needs for the internal processing
 - The {es} resources needed due to the higher rate of metric ingestion
-- The Dashboard's visualizations response times as more data are requested on a given time window 
+- The Dashboard's visualizations response times as more data are requested on a given time window
 
 The document is divided in two main sections:
 
@@ -41,7 +41,7 @@ The {k8s} {observability} is based on https://docs.elastic.co/en/integrations/ku
 
 Controller manager and Scheduler datastreams are being enabled only on the specific node that actually runs based on autodiscovery rules
 
-The default manifest provided deploys {agent} as DaemonSet which results in an {agent} being deployed on every node of the {k8s} cluster. 
+The default manifest provided deploys {agent} as DaemonSet which results in an {agent} being deployed on every node of the {k8s} cluster.
 
 Additionally, by default one agent is elected as **leader** (for more information visit <<kubernetes_leaderelection-provider>>). The {agent} Pod which holds the leadership lock is responsible for collecting the cluster-wide metrics in addition to its node's metrics.
 
@@ -58,7 +58,7 @@ The DaemonSet deployment approach with leader election simplifies the installati
 [discrete]
 === Specifying resources and limits in Agent manifests
 
-Resourcing of your Pods and the Scheduling priority (check section <<agent-scheduling,Scheduling priority>>) of them are two topics that might be affected as the {k8s} cluster size increases. 
+Resourcing of your Pods and the Scheduling priority (check section <<agent-scheduling,Scheduling priority>>) of them are two topics that might be affected as the {k8s} cluster size increases.
 The increasing demand of resources might result to under-resource the Elastic Agents of your cluster.
 
 Based on our tests we advise to configure only the `limit` section of the `resources` section in the manifest. In this way the `request`'s settings of the `resources` will fall back to the `limits` specified. The `limits` is the upper bound limit of your microservice process, meaning that can operate in less resources and protect {k8s} to assign bigger usage and protect from possible resource exhaustion.
@@ -76,11 +76,11 @@ Based on our https://github.com/elastic/elastic-agent/blob/main/docs/elastic-age
 
 Sample Elastic Agent Configurations:
 |===
-| No of Pods in K8s Cluster | Leader Agent Resources | Rest of Agents 
-| 1000   | cpu: "1500m",  memory: "800Mi" | cpu: "300m",  memory: "600Mi" 
-| 3000   | cpu: "2000m",  memory: "1500Mi" | cpu: "400m",  memory: "800Mi" 
-| 5000   | cpu: "3000m",  memory: "2500Mi" | cpu: "500m",  memory: "900Mi" 
-| 10000  | cpu: "3000m",  memory: "3600Mi" | cpu: "700m",  memory: "1000Mi" 
+| No of Pods in K8s Cluster | Leader Agent Resources | Rest of Agents
+| 1000   | cpu: "1500m",  memory: "800Mi" | cpu: "300m",  memory: "600Mi"
+| 3000   | cpu: "2000m",  memory: "1500Mi" | cpu: "400m",  memory: "800Mi"
+| 5000   | cpu: "3000m",  memory: "2500Mi" | cpu: "500m",  memory: "900Mi"
+| 10000  | cpu: "3000m",  memory: "3600Mi" | cpu: "700m",  memory: "1000Mi"
 |===
 
 > The above tests were performed with {agent} version 8.7 and scraping period of `10sec` (period setting for the {k8s} integration). Those numbers are just indicators and should be validated for each different {k8s} environment and amount of workloads.
@@ -94,19 +94,19 @@ Although daemonset installation is simple, it can not accommodate the varying ag
 
 - A dedicated {agent} deployment of a single Agent for collecting cluster wide metrics from the apiserver
 
-- Node level {agent}s(no leader Agent) in a Daemonset 
+- Node level {agent}s(no leader Agent) in a Daemonset
 
 - kube-state-metrics shards and {agent}s in the StatefulSet defined in the kube-state-metrics autosharding manifest
- 
+
 Each of these groups of {agent}s will have its own policy specific to its function and can be resourced independently in the appropriate manifest to accommodate its specific resource requirements.
 
-Resource assignment led us to alternatives installation methods. 
+Resource assignment led us to alternatives installation methods.
 
 IMPORTANT: The main suggestion for big scale clusters *is to install {agent} as side container along with `kube-state-metrics` Shard*. The installation is explained in details https://github.com/elastic/elastic-agent/tree/main/deploy/kubernetes#kube-state-metrics-ksm-in-autosharding-configuration[{agent} with Kustomize in Autosharding]
 
 The following **alternative configuration methods** have been verified:
 
-1. With `hostNetwork:false` 
+1. With `hostNetwork:false`
   - {agent} as Side Container within KSM Shard pod
   - For non-leader {agent} deployments that collect per KSM shards
 2. With `taint/tolerations` to isolate the {agent} daemonset pods from rest of deployments
@@ -116,10 +116,10 @@ You can find more information in the document called https://github.com/elastic/
 Based on our https://github.com/elastic/elastic-agent/blob/main/docs/elastic-agent-scaling-tests.md[{agent} scaling tests], the following table aims to assist users on how to configure their KSM Sharding as {k8s} cluster scales:
 |===
 | No of Pods in K8s Cluster | No of KSM Shards | Agent Resources
-| 1000   | No Sharding can be handled with default KSM config | limits: memory: 700Mi , cpu:500m 
-| 3000   | 4 Shards | limits: memory: 1400Mi , cpu:1500m 
-| 5000   | 6 Shards | limits: memory: 1400Mi , cpu:1500m 
-| 10000  | 8 Shards | limits: memory: 1400Mi , cpu:1500m 
+| 1000   | No Sharding can be handled with default KSM config | limits: memory: 700Mi , cpu:500m
+| 3000   | 4 Shards | limits: memory: 1400Mi , cpu:1500m
+| 5000   | 6 Shards | limits: memory: 1400Mi , cpu:1500m
+| 10000  | 8 Shards | limits: memory: 1400Mi , cpu:1500m
 |===
 
 > The tests above were performed with {agent} version 8.8 + TSDB Enabled and scraping period of `10sec` (for the {k8s} integration). Those numbers are just indicators and should be validated per different {k8s} policy configuration, along with applications that the {k8s} cluster might include
@@ -152,7 +152,7 @@ Additionally, https://github.com/elastic/integrations/blob/main/docs/dashboard_g
 [discrete]
 === Elastic Stack Configuration
 
-The configuration of Elastic Stack needs to be taken under consideration in large scale deployments. In case of Elastic Cloud deployments the choice of the deployment https://www.elastic.co/guide/en/cloud/current/ec-getting-started-profiles.html[{ecloud} hardware profile] is important. 
+The configuration of Elastic Stack needs to be taken under consideration in large scale deployments. In case of Elastic Cloud deployments the choice of the deployment https://www.elastic.co/guide/en/cloud/current/ec-getting-started-profiles.html[{ecloud} hardware profile] is important.
 
 For heavy processing and big ingestion rate needs, the `CPU-optimised` profile is proposed.
 
@@ -161,7 +161,7 @@ For heavy processing and big ingestion rate needs, the `CPU-optimised` profile i
 == Validation and Troubleshooting practices
 
 [discrete]
-=== Define if Agents are collecting as expected 
+=== Define if Agents are collecting as expected
 
 After {agent} deployment, we need to verify that Agent services are healthy, not restarting (stability) and that collection of metrics continues with expected rate (latency).
 
@@ -217,7 +217,7 @@ Components:
                         Healthy: communicating with pid '42462'
 ------------------------------------------------
 
-It is a common problem of lack of CPU/memory resources that agent process restart as {k8s} size grows. In the logs of agent you 
+It is a common problem of lack of CPU/memory resources that agent process restart as {k8s} size grows. In the logs of agent you
 
 [source,json]
 ------------------------------------------------
@@ -229,7 +229,7 @@ kubectl logs -n kube-system elastic-agent-qw6f4 | grep "kubernetes/metrics"
 
 ------------------------------------------------
 
-You can verify the instant resource consumption by running `top pod` command and identify if agents are close to the limits you have specified in your manifest. 
+You can verify the instant resource consumption by running `top pod` command and identify if agents are close to the limits you have specified in your manifest.
 
 [source,bash]
 ------------------------------------------------
@@ -261,7 +261,7 @@ Identify how many events have been sent to {es}:
 
 [source,bash]
 ------------------------------------------------
-kubectl logs -n kube-system elastic-agent-h24hh -f | grep -i state_pod 
+kubectl logs -n kube-system elastic-agent-h24hh -f | grep -i state_pod
 [ouptut truncated ...]
 
 "state_pod":{"events":2936,"success":2936}
@@ -282,5 +282,5 @@ Corresponding dashboards for `CPU Usage`, `Index Response Times` and `Memory Pre
 
 == Relevant links
 
-- https://www.elastic.co/guide/en/welcome-to-elastic/current/getting-started-kubernetes.html[Monitor {k8s} Infrastructure]
+- {estc-welcome-current}/getting-started-kubernetes.html[Monitor {k8s} Infrastructure]
 - https://www.elastic.co/blog/kubernetes-cluster-metrics-logs-monitoring[Blog: Managing your {k8s} cluster with Elastic {observability}]