-
Notifications
You must be signed in to change notification settings - Fork 16
Add Cython bindings for arrow::ArrayBuilder classes #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add a simple github actions setup for the test suite?
import numpy as np | ||
|
||
|
||
cdef class Int32Builder(_Weakrefable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_Weakrefable
is this private? Is there a public class we can use instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_Weakrefable
is c-defined in pyarrow.lib
. It just adds the ability for users to weakref this class. Strictly speaking, we don't need it at the moment in this implementation but payarrow
's classes are all Weakrefable so I included it here for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that we're relying on a private api so let's not use _Weakrefable at all.:
cdef class _Weakrefable:
cdef object __weakref__
Adding the __weakref__
attribute allows these Python objects to be used with weakref.ref()
. Do we expect to weakref() these classes? If not, then we shouldn't bother. If it turns out we need to support this then we can define _Weakrefable ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm ok I see your point. Removed.
module.libraries.extend(pa.get_libraries()) | ||
module.library_dirs.extend(pa.get_library_dirs()) | ||
if os.name == 'posix': | ||
module.extra_compile_args.append('-std=c++11') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment linking to https://arrow.apache.org/docs/python/extending.html#example so we know where this pattern comes from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
|
||
setup( | ||
name='pymongoarrow', | ||
version=get_pymongoarrow_version(), | ||
packages=find_packages(), | ||
ext_modules=get_extension_modules(), | ||
setup_requires=['cython >= 0.29']) | ||
setup_requires=['cython >= 0.29', 'pyarrow >= 3', 'numpy >= 1.16.6']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me see if I have this right. So we need pyarrow+numpy to build+install the wheel but the wheel can be installed and used even without pyarrow+numpy? Or are we missing a necessary install_requires
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are missing the install_requires
- we will depend on pymongo and pyarrow (pyarrow has a dependency on numpy). Pandas would need to be installed manually if the user intends to use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add install_requires=['pyarrow >= 3']
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
bindings/python/test/test_builder.py
Outdated
# limitations under the License. | ||
from unittest import TestCase | ||
|
||
from pyarrow import Array, int32, int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing my comment above, since we use pyarrow in the test suite, should we add it to test_requires
or install_requires
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a TODO do add these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add arrow to install_requires then we don't need test_requires
at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, this is correct. Eventually, we will likely need pandas for our tests but the package wouldn't really depend on it... We can add a test_requires for pandas when that happens.
return self.builder.get().length() | ||
|
||
cdef shared_ptr[CInt64Builder] unwrap(self): | ||
return self.builder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm these classes are so similar. There must be a way to share the implementation, right? I found this blog about it https://martinralbrecht.wordpress.com/2017/07/23/adventures-in-cython-templating/
But for now I don't think we should use any of these "templating" options and instead keep it simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I tried a bunch of different things for sharing code here since so much of it is common and since Cython does support class inheritance. Unfortunately, the scoping rules enforced on cdef
-ined variables by Cython are causing problems when I try to share code between these classes. For now I have given up on this front. We can make this more sophisticated after 0.1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update - I did a bit of updating... Some more optimizations will be possible in a subsequent PR.
elif isinstance(value, int): | ||
self.builder.get().Append(value) | ||
else: | ||
raise TypeError('Int32Builder only accepts integer objects') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of calling isinstance(value, int)
, can we rely on CInt32Builder to raise an error when the value is not supported? If so it would have better performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't going to be using this method to actually construct the arrays from BSON streams. That code will be pure-C/C++. These wrapper classes will help us create the necessary builders etc from Python code instead of needing to do this via C++ code which is rather complex due to scoping rules enforced by Cython.
See the unwrap
method in all these classes - this will eventually be used by some Cython code to access the underlying builders directly to construct arrays. The need for an unwrap method is also why the underlying builder classes are cdef
ined as shared_ptr
instances instead of unique_ptr
instances (so that they can be unwrapped and a new reference to them can be passed to pure-C code somewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a GH actions script in a subsequent commit or PR.
import numpy as np | ||
|
||
|
||
cdef class Int32Builder(_Weakrefable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm ok I see your point. Removed.
elif isinstance(value, int): | ||
self.builder.get().Append(value) | ||
else: | ||
raise TypeError('Int32Builder only accepts integer objects') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't going to be using this method to actually construct the arrays from BSON streams. That code will be pure-C/C++. These wrapper classes will help us create the necessary builders etc from Python code instead of needing to do this via C++ code which is rather complex due to scoping rules enforced by Cython.
See the unwrap
method in all these classes - this will eventually be used by some Cython code to access the underlying builders directly to construct arrays. The need for an unwrap method is also why the underlying builder classes are cdef
ined as shared_ptr
instances instead of unique_ptr
instances (so that they can be unwrapped and a new reference to them can be passed to pure-C code somewhere).
return self.builder.get().length() | ||
|
||
cdef shared_ptr[CInt64Builder] unwrap(self): | ||
return self.builder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update - I did a bit of updating... Some more optimizations will be possible in a subsequent PR.
module.libraries.extend(pa.get_libraries()) | ||
module.library_dirs.extend(pa.get_library_dirs()) | ||
if os.name == 'posix': | ||
module.extra_compile_args.append('-std=c++11') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
|
||
setup( | ||
name='pymongoarrow', | ||
version=get_pymongoarrow_version(), | ||
packages=find_packages(), | ||
ext_modules=get_extension_modules(), | ||
setup_requires=['cython >= 0.29']) | ||
setup_requires=['cython >= 0.29', 'pyarrow >= 3', 'numpy >= 1.16.6']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
bindings/python/test/test_builder.py
Outdated
# limitations under the License. | ||
from unittest import TestCase | ||
|
||
from pyarrow import Array, int32, int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, this is correct. Eventually, we will likely need pandas for our tests but the package wouldn't really depend on it... We can add a test_requires for pandas when that happens.
|
||
|
||
class TestDate64Builder(TestCase): | ||
def test_simple(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we test the 's' unit here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
Adds wrappers for
ArrayBuilder
APIs.In order to run this:
pyarrow
and runpython -c "import pyarrow as pa; pa.create_library_symlinks()
python -m unittest test/test_builder.py
TODO
Add builders for other supported types- will be added in subsequent PRsinstall_requires
andtest_requires
sections tosetup.py