Skip to content

Commit 6a757c0

Browse files
Florian Westphalummakynes
authored andcommitted
netfilter: conntrack: allow insertion of clashing entries
This patch further relaxes the need to drop an skb due to a clash with an existing conntrack entry. Current clash resolution handles the case where the clash occurs between two identical entries (distinct nf_conn objects with same tuples), i.e.: Original Reply existing: 10.2.3.4:42 -> 10.8.8.8:53 10.2.3.4:42 <- 10.0.0.6:5353 clashing: 10.2.3.4:42 -> 10.8.8.8:53 10.2.3.4:42 <- 10.0.0.6:5353 ... existing handling will discard the unconfirmed clashing entry and makes skb->_nfct point to the existing one. The skb can then be processed normally just as if the clash would not have existed in the first place. For other clashes, the skb needs to be dropped. This frequently happens with DNS resolvers that send A and AAAA queries back-to-back when NAT rules are present that cause packets to get different DNAT transformations applied, for example: -m statistics --mode random ... -j DNAT --dnat-to 10.0.0.6:5353 -m statistics --mode random ... -j DNAT --dnat-to 10.0.0.7:5353 In this case the A or AAAA query is dropped which incurs a costly delay during name resolution. This patch also allows this collision type: Original Reply existing: 10.2.3.4:42 -> 10.8.8.8:53 10.2.3.4:42 <- 10.0.0.6:5353 clashing: 10.2.3.4:42 -> 10.8.8.8:53 10.2.3.4:42 <- 10.0.0.7:5353 In this case, clash is in original direction -- the reply direction is still unique. The change makes it so that when the 2nd colliding packet is received, the clashing conntrack is tagged with new IPS_NAT_CLASH_BIT, gets a fixed 1 second timeout and is inserted in the reply direction only. The entry is hidden from 'conntrack -L', it will time out quickly and it can be early dropped because it will never progress to the ASSURED state. To avoid special-casing the delete code path to special case the ORIGINAL hlist_nulls node, a new helper, "hlist_nulls_add_fake", is added so hlist_nulls_del() will work. Example: CPU A: CPU B: 1. 10.2.3.4:42 -> 10.8.8.8:53 (A) 2. 10.2.3.4:42 -> 10.8.8.8:53 (AAAA) 3. Apply DNAT, reply changed to 10.0.0.6 4. 10.2.3.4:42 -> 10.8.8.8:53 (AAAA) 5. Apply DNAT, reply changed to 10.0.0.7 6. confirm/commit to conntrack table, no collisions 7. commit clashing entry Reply comes in: 10.2.3.4:42 <- 10.0.0.6:5353 (A) -> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42 10.2.3.4:42 <- 10.0.0.7:5353 (AAAA) -> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42 The conntrack entry is deleted from table, as it has the NAT_CLASH bit set. In case of a retransmit from ORIGINAL dir, all further packets will get the DNAT transformation to 10.0.0.6. I tried to come up with other solutions but they all have worse problems. Alternatives considered were: 1. Confirm ct entries at allocation time, not in postrouting. a. will cause uneccesarry work when the skb that creates the conntrack is dropped by ruleset. b. in case nat is applied, ct entry would need to be moved in the table, which requires another spinlock pair to be taken. c. breaks the 'unconfirmed entry is private to cpu' assumption: we would need to guard all nfct->ext allocation requests with ct->lock spinlock. 2. Make the unconfirmed list a hash table instead of a pcpu list. Shares drawback c) of the first alternative. 3. Document this is expected and force users to rearrange their ruleset (e.g. by using "-m cluster" instead of "-m statistics"). nft has the 'jhash' expression which can be used instead of 'numgen'. Major drawback: doesn't fix what I consider a bug, not very realistic and I believe its reasonable to have the existing rulesets to 'just work'. 4. Document this is expected and force users to steer problematic packets to the same CPU -- this would serialize the "allocate new conntrack entry/nat table evaluation/perform nat/confirm entry", so no race can occur. Similar drawback to 3. Another advantage of this patch compared to 1) and 2) is that there are no changes to the hot path; things are handled in the udp tracker and the clash resolution path. Cc: [email protected] Cc: "Paul E. McKenney" <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Jozsef Kadlecsik <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
1 parent bb89abe commit 6a757c0

File tree

4 files changed

+108
-7
lines changed

4 files changed

+108
-7
lines changed

include/linux/rculist_nulls.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,13 @@ static inline void hlist_nulls_add_tail_rcu(struct hlist_nulls_node *n,
145145
}
146146
}
147147

148+
/* after that hlist_nulls_del will work */
149+
static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
150+
{
151+
n->pprev = &n->next;
152+
n->next = (struct hlist_nulls_node *)NULLS_MARKER(NULL);
153+
}
154+
148155
/**
149156
* hlist_nulls_for_each_entry_rcu - iterate over rcu list of given type
150157
* @tpos: the type * to use as a loop cursor.

include/uapi/linux/netfilter/nf_conntrack_common.h

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,15 @@ enum ip_conntrack_status {
9797
IPS_UNTRACKED_BIT = 12,
9898
IPS_UNTRACKED = (1 << IPS_UNTRACKED_BIT),
9999

100+
#ifdef __KERNEL__
101+
/* Re-purposed for in-kernel use:
102+
* Tags a conntrack entry that clashed with an existing entry
103+
* on insert.
104+
*/
105+
IPS_NAT_CLASH_BIT = IPS_UNTRACKED_BIT,
106+
IPS_NAT_CLASH = IPS_UNTRACKED,
107+
#endif
108+
100109
/* Conntrack got a helper explicitly attached via CT target. */
101110
IPS_HELPER_BIT = 13,
102111
IPS_HELPER = (1 << IPS_HELPER_BIT),
@@ -110,7 +119,8 @@ enum ip_conntrack_status {
110119
*/
111120
IPS_UNCHANGEABLE_MASK = (IPS_NAT_DONE_MASK | IPS_NAT_MASK |
112121
IPS_EXPECTED | IPS_CONFIRMED | IPS_DYING |
113-
IPS_SEQ_ADJUST | IPS_TEMPLATE | IPS_OFFLOAD),
122+
IPS_SEQ_ADJUST | IPS_TEMPLATE | IPS_UNTRACKED |
123+
IPS_OFFLOAD),
114124

115125
__IPS_MAX_BIT = 15,
116126
};

net/netfilter/nf_conntrack_core.c

Lines changed: 74 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -940,11 +940,71 @@ static int __nf_ct_resolve_clash(struct sk_buff *skb,
940940
return NF_DROP;
941941
}
942942

943+
/**
944+
* nf_ct_resolve_clash_harder - attempt to insert clashing conntrack entry
945+
*
946+
* @skb: skb that causes the collision
947+
* @repl_idx: hash slot for reply direction
948+
*
949+
* Called when origin or reply direction had a clash.
950+
* The skb can be handled without packet drop provided the reply direction
951+
* is unique or there the existing entry has the identical tuple in both
952+
* directions.
953+
*
954+
* Caller must hold conntrack table locks to prevent concurrent updates.
955+
*
956+
* Returns NF_DROP if the clash could not be handled.
957+
*/
958+
static int nf_ct_resolve_clash_harder(struct sk_buff *skb, u32 repl_idx)
959+
{
960+
struct nf_conn *loser_ct = (struct nf_conn *)skb_nfct(skb);
961+
const struct nf_conntrack_zone *zone;
962+
struct nf_conntrack_tuple_hash *h;
963+
struct hlist_nulls_node *n;
964+
struct net *net;
965+
966+
zone = nf_ct_zone(loser_ct);
967+
net = nf_ct_net(loser_ct);
968+
969+
/* Reply direction must never result in a clash, unless both origin
970+
* and reply tuples are identical.
971+
*/
972+
hlist_nulls_for_each_entry(h, n, &nf_conntrack_hash[repl_idx], hnnode) {
973+
if (nf_ct_key_equal(h,
974+
&loser_ct->tuplehash[IP_CT_DIR_REPLY].tuple,
975+
zone, net))
976+
return __nf_ct_resolve_clash(skb, h);
977+
}
978+
979+
/* We want the clashing entry to go away real soon: 1 second timeout. */
980+
loser_ct->timeout = nfct_time_stamp + HZ;
981+
982+
/* IPS_NAT_CLASH removes the entry automatically on the first
983+
* reply. Also prevents UDP tracker from moving the entry to
984+
* ASSURED state, i.e. the entry can always be evicted under
985+
* pressure.
986+
*/
987+
loser_ct->status |= IPS_FIXED_TIMEOUT | IPS_NAT_CLASH;
988+
989+
__nf_conntrack_insert_prepare(loser_ct);
990+
991+
/* fake add for ORIGINAL dir: we want lookups to only find the entry
992+
* already in the table. This also hides the clashing entry from
993+
* ctnetlink iteration, i.e. conntrack -L won't show them.
994+
*/
995+
hlist_nulls_add_fake(&loser_ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
996+
997+
hlist_nulls_add_head_rcu(&loser_ct->tuplehash[IP_CT_DIR_REPLY].hnnode,
998+
&nf_conntrack_hash[repl_idx]);
999+
return NF_ACCEPT;
1000+
}
1001+
9431002
/**
9441003
* nf_ct_resolve_clash - attempt to handle clash without packet drop
9451004
*
9461005
* @skb: skb that causes the clash
9471006
* @h: tuplehash of the clashing entry already in table
1007+
* @hash_reply: hash slot for reply direction
9481008
*
9491009
* A conntrack entry can be inserted to the connection tracking table
9501010
* if there is no existing entry with an identical tuple.
@@ -963,10 +1023,18 @@ static int __nf_ct_resolve_clash(struct sk_buff *skb,
9631023
* exactly the same, only the to-be-confirmed conntrack entry is discarded
9641024
* and @skb is associated with the conntrack entry already in the table.
9651025
*
1026+
* Failing that, the new, unconfirmed conntrack is still added to the table
1027+
* provided that the collision only occurs in the ORIGINAL direction.
1028+
* The new entry will be added after the existing one in the hash list,
1029+
* so packets in the ORIGINAL direction will continue to match the existing
1030+
* entry. The new entry will also have a fixed timeout so it expires --
1031+
* due to the collision, it will not see bidirectional traffic.
1032+
*
9661033
* Returns NF_DROP if the clash could not be resolved.
9671034
*/
9681035
static __cold noinline int
969-
nf_ct_resolve_clash(struct sk_buff *skb, struct nf_conntrack_tuple_hash *h)
1036+
nf_ct_resolve_clash(struct sk_buff *skb, struct nf_conntrack_tuple_hash *h,
1037+
u32 reply_hash)
9701038
{
9711039
/* This is the conntrack entry already in hashes that won race. */
9721040
struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
@@ -987,6 +1055,10 @@ nf_ct_resolve_clash(struct sk_buff *skb, struct nf_conntrack_tuple_hash *h)
9871055
if (ret == NF_ACCEPT)
9881056
return ret;
9891057

1058+
ret = nf_ct_resolve_clash_harder(skb, reply_hash);
1059+
if (ret == NF_ACCEPT)
1060+
return ret;
1061+
9901062
drop:
9911063
nf_ct_add_to_dying_list(loser_ct);
9921064
NF_CT_STAT_INC(net, drop);
@@ -1101,7 +1173,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
11011173
return NF_ACCEPT;
11021174

11031175
out:
1104-
ret = nf_ct_resolve_clash(skb, h);
1176+
ret = nf_ct_resolve_clash(skb, h, reply_hash);
11051177
dying:
11061178
nf_conntrack_double_unlock(hash, reply_hash);
11071179
local_bh_enable();

net/netfilter/nf_conntrack_proto_udp.c

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,18 @@ static bool udp_error(struct sk_buff *skb,
8181
return false;
8282
}
8383

84+
static void nf_conntrack_udp_refresh_unreplied(struct nf_conn *ct,
85+
struct sk_buff *skb,
86+
enum ip_conntrack_info ctinfo,
87+
u32 extra_jiffies)
88+
{
89+
if (unlikely(ctinfo == IP_CT_ESTABLISHED_REPLY &&
90+
ct->status & IPS_NAT_CLASH))
91+
nf_ct_kill(ct);
92+
else
93+
nf_ct_refresh_acct(ct, ctinfo, skb, extra_jiffies);
94+
}
95+
8496
/* Returns verdict for packet, and may modify conntracktype */
8597
int nf_conntrack_udp_packet(struct nf_conn *ct,
8698
struct sk_buff *skb,
@@ -116,8 +128,8 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
116128
if (!test_and_set_bit(IPS_ASSURED_BIT, &ct->status))
117129
nf_conntrack_event_cache(IPCT_ASSURED, ct);
118130
} else {
119-
nf_ct_refresh_acct(ct, ctinfo, skb,
120-
timeouts[UDP_CT_UNREPLIED]);
131+
nf_conntrack_udp_refresh_unreplied(ct, skb, ctinfo,
132+
timeouts[UDP_CT_UNREPLIED]);
121133
}
122134
return NF_ACCEPT;
123135
}
@@ -198,8 +210,8 @@ int nf_conntrack_udplite_packet(struct nf_conn *ct,
198210
if (!test_and_set_bit(IPS_ASSURED_BIT, &ct->status))
199211
nf_conntrack_event_cache(IPCT_ASSURED, ct);
200212
} else {
201-
nf_ct_refresh_acct(ct, ctinfo, skb,
202-
timeouts[UDP_CT_UNREPLIED]);
213+
nf_conntrack_udp_refresh_unreplied(ct, skb, ctinfo,
214+
timeouts[UDP_CT_UNREPLIED]);
203215
}
204216
return NF_ACCEPT;
205217
}

0 commit comments

Comments
 (0)