Skip to content

Commit 67d25ce

Browse files
committed
Merge branch 'nexthop-preparations-for-resilient-next-hop-groups'
Petr Machata says: ==================== nexthop: Preparations for resilient next-hop groups At this moment, there is only one type of next-hop group: an mpath group. Mpath groups implement the hash-threshold algorithm, described in RFC 2992[1]. To select a next hop, hash-threshold algorithm first assigns a range of hashes to each next hop in the group, and then selects the next hop by comparing the SKB hash with the individual ranges. When a next hop is removed from the group, the ranges are recomputed, which leads to reassignment of parts of hash space from one next hop to another. RFC 2992 illustrates it thus: +-------+-------+-------+-------+-------+ | 1 | 2 | 3 | 4 | 5 | +-------+-+-----+---+---+-----+-+-------+ | 1 | 2 | 4 | 5 | +---------+---------+---------+---------+ Before and after deletion of next hop 3 under the hash-threshold algorithm. Note how next hop 2 gave up part of the hash space in favor of next hop 1, and 4 in favor of 5. While there will usually be some overlap between the previous and the new distribution, some traffic flows change the next hop that they resolve to. If a multipath group is used for load-balancing between multiple servers, this hash space reassignment causes an issue that packets from a single flow suddenly end up arriving at a server that does not expect them, which may lead to TCP reset. If a multipath group is used for load-balancing among available paths to the same server, the issue is that different latencies and reordering along the way causes the packets to arrive in wrong order. Resilient hashing is a technique to address the above problem. Resilient next-hop group has another layer of indirection between the group itself and its constituent next hops: a hash table. The selection algorithm uses a straightforward modulo operation to choose a hash bucket, and then reads the next hop that this bucket contains, and forwards traffic there. This indirection brings an important feature. In the hash-threshold algorithm, the range of hashes associated with a next hop must be continuous. With a hash table, mapping between the hash table buckets and the individual next hops is arbitrary. Therefore when a next hop is deleted the buckets that held it are simply reassigned to other next hops: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ v v v v +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Before and after deletion of next hop 3 under the resilient hashing algorithm. When weights of next hops in a group are altered, it may be possible to choose a subset of buckets that are currently not used for forwarding traffic, and use those to satisfy the new next-hop distribution demands, keeping the "busy" buckets intact. This way, established flows are ideally kept being forwarded to the same endpoints through the same paths as before the next-hop group change. This patchset prepares the next-hop code for eventual introduction of resilient hashing groups. - Patches #1-#4 carry otherwise disjoint changes that just remove certain assumptions in the next-hop code. - Patches #5-#6 extend the in-kernel next-hop notifiers to support more next-hop group types. - Patches #7-#12 refactor RTNL message handlers. Resilient next-hop groups will introduce a new logical object, a hash table bucket. It turns out that handling bucket-related messages is similar to how next-hop messages are handled. These patches extract the commonalities into reusable components. The plan is to contribute approximately the following patchsets: 1) Nexthop policy refactoring (already pushed) 2) Preparations for resilient next hop groups (this patchset) 3) Implementation of resilient next hop group 4) Netdevsim offload plus a suite of selftests 5) Preparations for mlxsw offload of resilient next-hop groups 6) mlxsw offload including selftests Interested parties can look at the current state of the code at [2] and [3]. [1] https://tools.ietf.org/html/rfc2992 [2] https://github.com/idosch/linux/commits/submit/res_integ_v1 [3] https://github.com/idosch/iproute2/commits/submit/res_v1 ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2 parents 4915a40 + 0bccf8e commit 67d25ce

File tree

4 files changed

+245
-116
lines changed

4 files changed

+245
-116
lines changed

drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c

Lines changed: 41 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4309,25 +4309,36 @@ static int mlxsw_sp_nexthop_obj_validate(struct mlxsw_sp *mlxsw_sp,
43094309
if (event != NEXTHOP_EVENT_REPLACE)
43104310
return 0;
43114311

4312-
if (!info->is_grp)
4312+
switch (info->type) {
4313+
case NH_NOTIFIER_INFO_TYPE_SINGLE:
43134314
return mlxsw_sp_nexthop_obj_single_validate(mlxsw_sp, info->nh,
43144315
info->extack);
4315-
return mlxsw_sp_nexthop_obj_group_validate(mlxsw_sp, info->nh_grp,
4316-
info->extack);
4316+
case NH_NOTIFIER_INFO_TYPE_GRP:
4317+
return mlxsw_sp_nexthop_obj_group_validate(mlxsw_sp,
4318+
info->nh_grp,
4319+
info->extack);
4320+
default:
4321+
NL_SET_ERR_MSG_MOD(info->extack, "Unsupported nexthop type");
4322+
return -EOPNOTSUPP;
4323+
}
43174324
}
43184325

43194326
static bool mlxsw_sp_nexthop_obj_is_gateway(struct mlxsw_sp *mlxsw_sp,
43204327
const struct nh_notifier_info *info)
43214328
{
43224329
const struct net_device *dev;
43234330

4324-
if (info->is_grp)
4331+
switch (info->type) {
4332+
case NH_NOTIFIER_INFO_TYPE_SINGLE:
4333+
dev = info->nh->dev;
4334+
return info->nh->gw_family || info->nh->is_reject ||
4335+
mlxsw_sp_netdev_ipip_type(mlxsw_sp, dev, NULL);
4336+
case NH_NOTIFIER_INFO_TYPE_GRP:
43254337
/* Already validated earlier. */
43264338
return true;
4327-
4328-
dev = info->nh->dev;
4329-
return info->nh->gw_family || info->nh->is_reject ||
4330-
mlxsw_sp_netdev_ipip_type(mlxsw_sp, dev, NULL);
4339+
default:
4340+
return false;
4341+
}
43314342
}
43324343

43334344
static void mlxsw_sp_nexthop_obj_blackhole_init(struct mlxsw_sp *mlxsw_sp,
@@ -4410,11 +4421,22 @@ mlxsw_sp_nexthop_obj_group_info_init(struct mlxsw_sp *mlxsw_sp,
44104421
struct mlxsw_sp_nexthop_group *nh_grp,
44114422
struct nh_notifier_info *info)
44124423
{
4413-
unsigned int nhs = info->is_grp ? info->nh_grp->num_nh : 1;
44144424
struct mlxsw_sp_nexthop_group_info *nhgi;
44154425
struct mlxsw_sp_nexthop *nh;
4426+
unsigned int nhs;
44164427
int err, i;
44174428

4429+
switch (info->type) {
4430+
case NH_NOTIFIER_INFO_TYPE_SINGLE:
4431+
nhs = 1;
4432+
break;
4433+
case NH_NOTIFIER_INFO_TYPE_GRP:
4434+
nhs = info->nh_grp->num_nh;
4435+
break;
4436+
default:
4437+
return -EINVAL;
4438+
}
4439+
44184440
nhgi = kzalloc(struct_size(nhgi, nexthops, nhs), GFP_KERNEL);
44194441
if (!nhgi)
44204442
return -ENOMEM;
@@ -4427,12 +4449,18 @@ mlxsw_sp_nexthop_obj_group_info_init(struct mlxsw_sp *mlxsw_sp,
44274449
int weight;
44284450

44294451
nh = &nhgi->nexthops[i];
4430-
if (info->is_grp) {
4431-
nh_obj = &info->nh_grp->nh_entries[i].nh;
4432-
weight = info->nh_grp->nh_entries[i].weight;
4433-
} else {
4452+
switch (info->type) {
4453+
case NH_NOTIFIER_INFO_TYPE_SINGLE:
44344454
nh_obj = info->nh;
44354455
weight = 1;
4456+
break;
4457+
case NH_NOTIFIER_INFO_TYPE_GRP:
4458+
nh_obj = &info->nh_grp->nh_entries[i].nh;
4459+
weight = info->nh_grp->nh_entries[i].weight;
4460+
break;
4461+
default:
4462+
err = -EINVAL;
4463+
goto err_nexthop_obj_init;
44364464
}
44374465
err = mlxsw_sp_nexthop_obj_init(mlxsw_sp, nh_grp, nh, nh_obj,
44384466
weight);

drivers/net/netdevsim/fib.c

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -860,23 +860,28 @@ static struct nsim_nexthop *nsim_nexthop_create(struct nsim_fib_data *data,
860860

861861
nexthop = kzalloc(sizeof(*nexthop), GFP_KERNEL);
862862
if (!nexthop)
863-
return NULL;
863+
return ERR_PTR(-ENOMEM);
864864

865865
nexthop->id = info->id;
866866

867867
/* Determine the number of nexthop entries the new nexthop will
868868
* occupy.
869869
*/
870870

871-
if (!info->is_grp) {
871+
switch (info->type) {
872+
case NH_NOTIFIER_INFO_TYPE_SINGLE:
872873
occ = 1;
873-
goto out;
874+
break;
875+
case NH_NOTIFIER_INFO_TYPE_GRP:
876+
for (i = 0; i < info->nh_grp->num_nh; i++)
877+
occ += info->nh_grp->nh_entries[i].weight;
878+
break;
879+
default:
880+
NL_SET_ERR_MSG_MOD(info->extack, "Unsupported nexthop type");
881+
kfree(nexthop);
882+
return ERR_PTR(-EOPNOTSUPP);
874883
}
875884

876-
for (i = 0; i < info->nh_grp->num_nh; i++)
877-
occ += info->nh_grp->nh_entries[i].weight;
878-
879-
out:
880885
nexthop->occ = occ;
881886
return nexthop;
882887
}
@@ -972,8 +977,8 @@ static int nsim_nexthop_insert(struct nsim_fib_data *data,
972977
int err;
973978

974979
nexthop = nsim_nexthop_create(data, info);
975-
if (!nexthop)
976-
return -ENOMEM;
980+
if (IS_ERR(nexthop))
981+
return PTR_ERR(nexthop);
977982

978983
nexthop_old = rhashtable_lookup_fast(&data->nexthop_ht, &info->id,
979984
nsim_nexthop_ht_params);

include/net/nexthop.h

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,12 @@ struct nh_info {
6666
struct nh_grp_entry {
6767
struct nexthop *nh;
6868
u8 weight;
69-
atomic_t upper_bound;
69+
70+
union {
71+
struct {
72+
atomic_t upper_bound;
73+
} mpath;
74+
};
7075

7176
struct list_head nh_list;
7277
struct nexthop *nh_parent; /* nexthop of group with this entry */
@@ -109,6 +114,11 @@ enum nexthop_event_type {
109114
NEXTHOP_EVENT_REPLACE,
110115
};
111116

117+
enum nh_notifier_info_type {
118+
NH_NOTIFIER_INFO_TYPE_SINGLE,
119+
NH_NOTIFIER_INFO_TYPE_GRP,
120+
};
121+
112122
struct nh_notifier_single_info {
113123
struct net_device *dev;
114124
u8 gw_family;
@@ -137,7 +147,7 @@ struct nh_notifier_info {
137147
struct net *net;
138148
struct netlink_ext_ack *extack;
139149
u32 id;
140-
bool is_grp;
150+
enum nh_notifier_info_type type;
141151
union {
142152
struct nh_notifier_single_info *nh;
143153
struct nh_notifier_grp_info *nh_grp;

0 commit comments

Comments
 (0)