You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: FAQ.md
+49-14Lines changed: 49 additions & 14 deletions
Original file line number
Diff line number
Diff line change
@@ -8,15 +8,15 @@ If you can't find help here, don't hesitate to open [an issue](https://github.co
8
8
*[Is it for OpenShift only?](#is-it-for-openshift-only)
9
9
*[Which version of Kubernetes / OpenShift is supported?](#which-version-of-kubernetes--openshift-is-supported)
10
10
* How-to
11
-
*[To run the eBPF agent](#to-run-the-ebpf-agent)
12
-
*[To use IPFIX exports](#to-use-ipfix-exports)
13
-
*[To get the OpenShift Console plugin](#to-get-the-openshift-console-plugin)
11
+
*[How do I visualize flows and metrics?](#how-do-i-visualize-flows-and-metrics)
14
12
*[How can I make sure everything is correctly deployed?](#how-can-i-make-sure-everything-is-correctly-deployed)
15
13
* Troubleshooting
16
14
*[Everything seems correctly deployed but there isn't any flow showing up](#everything-seems-correctly-deployed-but-there-isnt-any-flow-showing-up)
17
15
*[There is no Network Traffic menu entry in OpenShift Console](#there-is-no-network-traffic-menu-entry-in-openshift-console)
18
16
*[I first deployed flowcollector, and then kafka. Flowlogs-pipeline is not consuming any flow from Kafka](#i-first-deployed-flowcollector-and-then-kafka-flowlogs-pipeline-is-not-consuming-any-flow-from-kafka)
17
+
*[I get a Loki error / timeout, when trying to run a large query, such as querying for the last month of data](#i-get-a-loki-error--timeout-when-trying-to-run-a-large-query-such-as-querying-for-the-last-month-of-data)
19
18
*[I don't see flows from either the `br-int` or `br-ex` interfaces](#i-dont-see-flows-from-either-the-br-int-or-br-ex-interfaces)
19
+
*[I'm finding discrepancies in metrics](#im-finding-discrepancies-in-metrics)
20
20
21
21
## Q&A
22
22
@@ -28,25 +28,28 @@ And if something is not working as hoped with your setup, you are welcome to con
28
28
29
29
### Which version of Kubernetes / OpenShift is supported?
30
30
31
-
It depends on which `agent` you want to use: `ebpf` or `ipfix`, and whether you want to get the OpenShift Console plugin.
31
+
All versions of Kubernetes since 1.22 should work, although there is no official support (best effort).
32
32
33
-
## How to
33
+
All versions of OpenShift currently supported by Red Hat are supported. Older version, greater than 4.10, should also work although not being officially supported (best effort).
34
34
35
-
### To run the eBPF agent
35
+
Some features depend on the Linux kernel version in use. It should be at least 4.18 (earlier versions have never been tested). More recent kernels (> 5.14) are better, for agent feature completeness and improved performances.
36
36
37
-
What matters is the version of the Linux kernel: 4.18 or more is supported. Earlier versions are not tested.
37
+
### How do I visualize flows and metrics?
38
38
39
-
Other than that, there are no known restrictions on the Kubernetes version.
39
+
For OpenShift users, a visualization tool is integrated in the OpenShift console. Just open the console in your browser, and you will see new menu items (such as Network Traffic under Observe) once NetObserv is installed and configured.
40
40
41
-
### To use CNI's IPFIX exports
41
+
Without OpenShift, you can still access the data (Loki logs, Prometheus metrics) in different ways:
42
42
43
-
This feature has been deprecated and is not available anymore. Flows are now always generated by the eBPF agent.
43
+
- Querying Loki (or Prometheus) directly
44
+
- Using the Prometheus console
45
+
- Using and configuring Grafana
44
46
45
-
Note that NetObserv itself is still able to export its enriched flows as IPFIX: that can be done by configuring `spec.exporters`.
47
+
All these options depend on how you installed these components.
46
48
47
-
### To get the OpenShift Console plugin
48
-
49
-
OpenShift 4.10 or above is required.
49
+
If you feel ready for hacking, there is also a way to view the Test Console, used by the development team, which is similar to the OpenShift console plugin and can work without OpenShift. You need to:
50
+
- Build the console plugin in "standalone" mode: https://github.com/netobserv/network-observability-console-plugin?tab=readme-ov-file#standalone-frontend (you can just build the image, no need to run it locally).
51
+
- Configure the Operator to use this build: `kubectl set env deployment/netobserv-controller-manager -c "manager" RELATED_IMAGE_CONSOLE_PLUGIN="<your build image here>"`
52
+
- Configure the Operator to deploy the Test Console: in `FlowCollector` yaml, set `spec.consolePlugin.advanced.env.TEST_CONSOLE` to `true`.
50
53
51
54
### How can I make sure everything is correctly deployed?
52
55
@@ -174,6 +177,22 @@ This is a [known bug](https://github.com/segmentio/kafka-go/issues/1044) in one
174
177
175
178
Please recreate the flowlogs-pipeline pods by either killing them maunally or deleting and recreating the flow collector object.
176
179
180
+
### I get a Loki error / timeout, when trying to run a large query, such as querying for the last month of data
181
+
182
+
There are several ways to mitigate this issue, although there is no silver bullet. As a rule of thumb, be aware that Prometheus is a better fit than Loki to query on large time ranges.
183
+
184
+
With Loki queries, a first thing to understand is that, while Loki allows to query both on indexed and non-indexed fields (aka. labels), **queries that contain filters on labels will perform much better**. So, perhaps you can adapt your query to add an indexed filter. For instance if you were querying for a particular Pod (this isn't indexed), add its Namespace to the query. The list of indexed fields [is documented here](https://docs.openshift.com/container-platform/4.15/observability/network_observability/json-flows-format-reference.html#network-observability-flows-format_json_reference), in the `Loki label` column.
185
+
186
+
Depending on what you are trying to get, you may as well **consider querying Prometheus rather than Loki**. Queries on Prometheus are much faster than on Loki, it should not struggle with large time ranges, hence should be favored whenever possible. But Prometheus metrics do not contain as much information as flow logs in Loki, so whether or not you can do that really depends on the use case. When you use the NetObserv console plugin, it will try automatically to favor Prometheus over Loki if the query is compatible; else it falls back to Loki. If your query does't run against Prometheus, changing some filters or aggregations can make the switch. In the console plugin, you can force the use of Prometheus. Incompatible queries will fail, and the error message displayed should help you figure out which labels you can try to change to make the query compatible (for instance, changing a filter or an aggregation from Resource/Pods to Owner).
187
+
188
+
If the data that you need isn't available as a Prometheus metric, you may also **consider using the [FlowMetrics API](https://github.com/netobserv/network-observability-operator/blob/main/docs/Metrics.md#custom-metrics-using-the-flowmetrics-api)** to create your own metric. You need to be careful about the metrics cardinality, as explained in this link.
189
+
190
+
If the problem persists, there are ways to **configure Loki to improve the query performance**. Some options depend on the installation mode you used for Loki (using the Operator and `LokiStack`, or `Monolithic` mode, or `Microservices` mode):
191
+
192
+
- In `LokiStack` or `Microservices` modes, try [increasing the number of querier replicas](https://loki-operator.dev/docs/api.md/#loki-grafana-com-v1-LokiComponentSpec)
193
+
- Increase the [query timeout](https://loki-operator.dev/docs/api.md/#loki-grafana-com-v1-QueryLimitSpec). You will also need to increase NetObserv read timeout to Loki accordingly, in `FlowCollector` `spec.loki.readTimeout`.
194
+
195
+
177
196
### I don't see flows from either the `br-int` or `br-ex` interfaces
178
197
179
198
[`br-ex` and `br-int` are virtual bridge devices](https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html/networking_guide/bridge-mappings),
@@ -183,3 +202,19 @@ by the agent when it is processed by other interfaces (e.g. physical host or vir
183
202
184
203
This means that, if you restrict the agent interfaces (using the `interfaces` or `excludeInterfaces`
185
204
properties) to attach only to `br-int` and/or `br-ex`, you won't be able to see any flow.
205
+
206
+
### I'm finding discrepancies in metrics
207
+
208
+
1. NetObserv metrics (such as `netobserv_workload_ingress_bytes_total`) show *higher values* than cadvisor metrics (such as `container_network_receive_bytes_total`)
209
+
210
+
This can be caused when traffic goes through Kubernetes Services: when a Pod talks to another Pod via a Service, two flows are generated: one against the service and one against the pod. To avoid querying duplicated counts, you can refine your promQL to ignore traffic targeting Services: e.g: `sum(rate(netobserv_workload_ingress_bytes_total{DstK8S_Namespace="my-namespace",SrcK8S_Type!="Service",DstK8S_Type!="Service"}[2m]))`
211
+
212
+
2. NetObserv metrics (such as `netobserv_workload_ingress_bytes_total`) show *lower values* than cadvisor metrics (such as `container_network_receive_bytes_total`)
213
+
214
+
There are several possible causes:
215
+
216
+
- Sampling is being used. Check your `FlowCollector` `spec.agent.ebpf.sampling`: a value greater than 1 means not every flows are sampled. NetObserv metrics aren't normalized automatically, but you can do so in your promQL by multiplying with the sampling rate, for instance: `sum(rate(netobserv_workload_ingress_bytes_total{DstK8S_Namespace="my-namespace"}[2m])) * avg(netobserv_agent_sampling_rate > 0)`. Be aware that, the higher the sampling, the less accurate the metrics.
217
+
218
+
- Filters are configured in the agent, resulting in ignoring some of the traffic. Check your `FlowCollector` `spec.agent.ebpf.flowFilter`, `spec.agent.ebpf.interfaces`, `spec.agent.ebpf.excludeInterfaces` and make sure it doesn't filter out some of the traffic that you are looking at.
219
+
220
+
- Flows may also be dropped due to constraints on resources. Monitor the eBPF agent health in the `NetObserv / Health` dashboard: there are graphs showing drops. Increasing `spec.agent.ebpf.cacheMaxSize` can help to avoid these drops, at the cost of an increased memory usage.
0 commit comments