-
Notifications
You must be signed in to change notification settings - Fork 22
Faster isin comparisons #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
fd9c3d7
f1afbeb
ae54396
f33dcea
14a8800
c524e74
e6a6666
2fc7525
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -14,7 +14,7 @@ | |||
delegated_method) | ||||
from ._utils import combine, pack, unpack | ||||
from .common import _U8_MAX, _IPv4_MAX | ||||
from .parser import _to_ipaddress_pyint | ||||
from .parser import _to_ipaddress_pyint, _as_ip_object | ||||
|
||||
# ----------------------------------------------------------------------------- | ||||
# Extension Type | ||||
|
@@ -286,34 +286,94 @@ def packed(self): | |||
return self.data.tobytes() | ||||
|
||||
def isin(self, other): | ||||
""" | ||||
"""Check whether elements of 'self' are in 'other'. | ||||
|
||||
Comparison is done elementwise. | ||||
|
||||
Parameters | ||||
---------- | ||||
other : str or sequences | ||||
For ``str`` 'other', the argument is attempted to | ||||
be converted to an :class:`ipaddress.IPv4Network` or | ||||
a :class:`ipaddress.IPv6Network` or an :class:`IPArray`. | ||||
If all those conversions fail, a TypeError is raised. | ||||
|
||||
For a sequence of strings, the same conversion is attempted. | ||||
You should not mix networks with addresses. | ||||
|
||||
Finally, other may be an ``IPArray`` of addresses to compare to. | ||||
|
||||
Returns | ||||
------- | ||||
contained : ndarray | ||||
A 1-D boolean ndarray with the same length as self. | ||||
|
||||
Examples | ||||
-------- | ||||
Comparison to a single network | ||||
|
||||
>>> s = IPArray(['192.168.1.1', '255.255.255.255']) | ||||
>>> s.isin('192.168.1.0/24') | ||||
array([ True, False]) | ||||
|
||||
Comparison to many networks | ||||
>>> s.isin(['192.168.1.0/24', '192.168.2.0/24']) | ||||
array([ True, False]) | ||||
|
||||
Comparison to many IP Addresses | ||||
|
||||
>>> s.isin(['192.168.1.1', '192.168.1.2', '255.255.255.1']]) | ||||
array([ True, False]) | ||||
""" | ||||
if isinstance(other, str) or not isinstance(other, | ||||
collections.Sequence): | ||||
box = (isinstance(other, str) or | ||||
not isinstance(other, (IPArray, collections.Sequence))) | ||||
if box: | ||||
other = [other] | ||||
|
||||
networks = [] | ||||
for net in other: | ||||
try: | ||||
networks.append(ipaddress.IPv4Network(net)) | ||||
except ValueError: | ||||
networks.append(ipaddress.IPv6Network(net)) | ||||
addresses = [] | ||||
|
||||
if not isinstance(other, IPArray): | ||||
for net in other: | ||||
net = _as_ip_object(net) | ||||
if isinstance(net, (ipaddress.IPv4Network, | ||||
ipaddress.IPv6Network)): | ||||
networks.append(net) | ||||
if isinstance(net, (ipaddress.IPv4Address, | ||||
ipaddress.IPv6Address)): | ||||
addresses.append(ipaddress.IPv6Network(net)) | ||||
else: | ||||
addresses = other | ||||
|
||||
# Flatten all the addresses | ||||
addresses = IPArray(addresses) # TODO: think about copy=False | ||||
|
||||
# TODO: perf | ||||
pyips = self.to_pyipaddress() | ||||
mask = np.zeros(len(self), dtype='bool') | ||||
for network in networks: | ||||
for i, ip in enumerate(pyips): | ||||
if ip in network: | ||||
mask[i] = True | ||||
mask |= self._isin_network(network) | ||||
|
||||
# no... we should flatten this. | ||||
mask |= self._isin_addresses(addresses) | ||||
return mask | ||||
|
||||
def _isin_network(self, other): | ||||
# type: (Union[ipaddress.IPv4Network,ipaddress.IPv6Network]) -> ndarray | ||||
"""Check whether an array of addresses is contained in a network.""" | ||||
# A network is bounded below by 'network_address' and | ||||
# above by 'broadcast_address'. | ||||
# IPArray handles comparisons between arrays of addresses, and NumPy | ||||
# handles broadcasting. | ||||
net_lo = type(self)([other.network_address]) | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will this fail on attempt to compare IP4 and IP6? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Once we're in an IPArray, we don't care about IPv4 vs. IPv6. Everything is 128 bits. |
||||
net_hi = type(self)([other.broadcast_address]) | ||||
|
||||
return (net_lo <= self) & (self <= net_hi) | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Implemented by us in cyberpandas/cyberpandas/ip_array.py Line 185 in 52e28f7
Amazing what a night of sleep can do to change your perspective on a problem :) |
||||
|
||||
def _isin_addresses(self, other): | ||||
"""Check whether elements of self are present in other.""" | ||||
from pandas.core.algorithms import isin | ||||
# TODO(factorize): replace this | ||||
return isin(self, other) | ||||
|
||||
def setitem(self, indexer, value): | ||||
"""Set the 'value' inplace. | ||||
""" | ||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no way to short-circuit here? Or is that something for a future cythonized method...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why loop to create
networks
and then loop again?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, I don't think there's anything to short-circut. We may not know whether an element of
self
is in any of theother
networks until the last network.Some context, I expected
self
to be large andnetworks
to be small. So I'm OK looping overnetworks
.