Skip to content

Netbird relay connection stale for some peers (workaround found) #3936

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Silex opened this issue Jun 6, 2025 · 6 comments
Open

Netbird relay connection stale for some peers (workaround found) #3936

Silex opened this issue Jun 6, 2025 · 6 comments

Comments

@Silex
Copy link
Contributor

Silex commented Jun 6, 2025

Hello

With netbird self hosted version 0.45.1, peers version 0.45.3 and 0.36.5 that are relayed due to CGNAT issues (one peer is a 5G router, other peer is a windows PC behind corporate firewall) after a while the relay becomes "stale" in the sense that you cannot ping anymore between the peers, yet it says it's connected:

$ netbird status -d

pictet-nvr1.netbird.stvs:
  NetBird IP: 100.70.94.175
  Public key: wNWlJ95DqnJMCdXX77gZwVLB4oDDInwp7DpACxy/SV4=
  Status: Connected
  -- detail --
  Connection type: Relayed
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: rels://netbird.stvs.com:443
  Last connection update: 7 hours, 9 minutes ago
  Last WireGuard handshake: 7 hours, 10 minutes ago
  Transfer status (received/sent) 711.3 MiB/18.1 GiB
  Quantum resistance: false
  Routes: -
  Networks: -
  Latency: 52.905573ms

$ wg show

peer: wNWlJ95DqnJMCdXX77gZwVLB4oDDInwp7DpACxy/SV4=
  endpoint: 127.0.0.1:38500
  allowed ips: 100.70.94.175/32
  latest handshake: 7 hours, 13 minutes, 32 seconds ago
  transfer: 711.28 MiB received, 18.11 GiB sent
  persistent keepalive: every 25 seconds

As you see the latest handshake is way too old. A simple workaround is to stop/start netbird, but that kills all other connections (the PC is connected to many routers). Another workaround is to remove problematic router from policy group & add it again to force an update, but having to handle that manually is annoying.

I guess one could also wg set his way into removing the offending peer, and netbird would recreate the wireguard peer? So maybe I can monitor latest handshakes and "kill" the peers that are stuck?

Any ideas welcome.

@Silex
Copy link
Contributor Author

Silex commented Jun 6, 2025

I found this which is interesting, but seems netbird already does the right thing:

https://www.reddit.com/r/WireGuard/comments/k3d1hc/latest_handshake_few_hours_ago/

@Silex
Copy link
Contributor Author

Silex commented Jun 6, 2025

Just to clarify the setup:

Netbird runs on multiple 5G routers (Teltonika TRB500) and on multiple servers (windows). The connexions are relayed due to CGNAT/firewall issues.

One of these server records cameras served through the multiple routers.

Almost every night, some of the routers relayed connexions become stale and thus the cameras are unreachable. Simply restarting netbird fixes the issues.

From the other servers most of the time the connexions to the routers are not stale, but it also happens from time to time.

This problematic server is a VM that runs with by different provider so maybe the network issues are mainly due to this other provider, but my guess is that it has more to do with the wireguard tunnel not being correctly detected as not working (e.g 5G router IP changed, 5G connection glitches, etc).

@Silex
Copy link
Contributor Author

Silex commented Jun 6, 2025

Meh, I though it was the wireguard tunnel but it seems deeper than that:

When peer is unreachable:

peer: 6kq3/G775aJK5slDq1OyEyLFK4TvyZiurx+OddRotVw=
  endpoint: 127.1.189.16:51820
  allowed ips: 100.70.189.16/32
  transfer: 0 B received, 148 B sent
  persistent keepalive: every 25 seconds

When peer is reachable:

peer: 6kq3/G775aJK5slDq1OyEyLFK4TvyZiurx+OddRotVw=
  endpoint: 127.1.189.16:51820
  allowed ips: 100.70.189.16/32
  latest handshake: 28 seconds ago
  transfer: 796.04 KiB received, 247.33 KiB sent
  persistent keepalive: every 25 seconds

I removed/recreated the peer using plain wg set commands but it does not reconnect the peer.

The only thing working at this point is netbird down/up or editing the peer policy so netbird "resets" the config.

Should I give 0.46.0 a try?

@nazarewk
Copy link
Contributor

nazarewk commented Jun 6, 2025

I removed/recreated the peer using plain wg set commands but it does not reconnect the peer.

I'm pretty sure it uses elaborate negotiation process to establish connectivity. I wouldn't expect wg set to have any chance of working unless the Peer was directly reachable over the internet.

You can always try the 0.46.0 but after looking briefly at the notes, I don't see anything particularly relevant there.

@Silex
Copy link
Contributor Author

Silex commented Jun 6, 2025

@nazarewk thanks.

I'm trying to find a workaroud so I only reset the stale peer instead of the whole netbird connection. Any idea? Removing & adding the wireguard peer seemed smart but I guess it's a dead end.

@Silex
Copy link
Contributor Author

Silex commented Jun 6, 2025

Hum, forwarding UDP 51820 from WAN to peer does not seem to help P2P connection. Any idea what to try?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
@Silex @nazarewk and others