Skip to content

Add Cython bindings for arrow::ArrayBuilder classes #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 9, 2021

Conversation

prashantmital
Copy link
Contributor

@prashantmital prashantmital commented Feb 5, 2021

Adds wrappers for ArrayBuilder APIs.

In order to run this:

  • first refer to the description of Cython bindings for libbson #1 .
  • then, install pyarrow and run python -c "import pyarrow as pa; pa.create_library_symlinks()
  • to run the tests do python -m unittest test/test_builder.py

TODO

  • Add builders for other supported types - will be added in subsequent PRs
  • Add install_requires and test_requires sections to setup.py

Copy link
Collaborator

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a simple github actions setup for the test suite?

import numpy as np


cdef class Int32Builder(_Weakrefable):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_Weakrefable is this private? Is there a public class we can use instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_Weakrefable is c-defined in pyarrow.lib. It just adds the ability for users to weakref this class. Strictly speaking, we don't need it at the moment in this implementation but payarrow's classes are all Weakrefable so I included it here for consistency.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we're relying on a private api so let's not use _Weakrefable at all.:

cdef class _Weakrefable:
    cdef object __weakref__

Adding the __weakref__ attribute allows these Python objects to be used with weakref.ref(). Do we expect to weakref() these classes? If not, then we shouldn't bother. If it turns out we need to support this then we can define _Weakrefable ourselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok I see your point. Removed.

module.libraries.extend(pa.get_libraries())
module.library_dirs.extend(pa.get_library_dirs())
if os.name == 'posix':
module.extra_compile_args.append('-std=c++11')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment linking to https://arrow.apache.org/docs/python/extending.html#example so we know where this pattern comes from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



setup(
name='pymongoarrow',
version=get_pymongoarrow_version(),
packages=find_packages(),
ext_modules=get_extension_modules(),
setup_requires=['cython >= 0.29'])
setup_requires=['cython >= 0.29', 'pyarrow >= 3', 'numpy >= 1.16.6'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see if I have this right. So we need pyarrow+numpy to build+install the wheel but the wheel can be installed and used even without pyarrow+numpy? Or are we missing a necessary install_requires?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are missing the install_requires - we will depend on pymongo and pyarrow (pyarrow has a dependency on numpy). Pandas would need to be installed manually if the user intends to use that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add install_requires=['pyarrow >= 3']?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# limitations under the License.
from unittest import TestCase

from pyarrow import Array, int32, int64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing my comment above, since we use pyarrow in the test suite, should we add it to test_requires or install_requires?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a TODO do add these.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add arrow to install_requires then we don't need test_requires at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, this is correct. Eventually, we will likely need pandas for our tests but the package wouldn't really depend on it... We can add a test_requires for pandas when that happens.

return self.builder.get().length()

cdef shared_ptr[CInt64Builder] unwrap(self):
return self.builder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm these classes are so similar. There must be a way to share the implementation, right? I found this blog about it https://martinralbrecht.wordpress.com/2017/07/23/adventures-in-cython-templating/

But for now I don't think we should use any of these "templating" options and instead keep it simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I tried a bunch of different things for sharing code here since so much of it is common and since Cython does support class inheritance. Unfortunately, the scoping rules enforced on cdef-ined variables by Cython are causing problems when I try to share code between these classes. For now I have given up on this front. We can make this more sophisticated after 0.1.0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update - I did a bit of updating... Some more optimizations will be possible in a subsequent PR.

elif isinstance(value, int):
self.builder.get().Append(value)
else:
raise TypeError('Int32Builder only accepts integer objects')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of calling isinstance(value, int), can we rely on CInt32Builder to raise an error when the value is not supported? If so it would have better performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't going to be using this method to actually construct the arrays from BSON streams. That code will be pure-C/C++. These wrapper classes will help us create the necessary builders etc from Python code instead of needing to do this via C++ code which is rather complex due to scoping rules enforced by Cython.

See the unwrap method in all these classes - this will eventually be used by some Cython code to access the underlying builders directly to construct arrays. The need for an unwrap method is also why the underlying builder classes are cdefined as shared_ptr instances instead of unique_ptr instances (so that they can be unwrapped and a new reference to them can be passed to pure-C code somewhere).

Copy link
Contributor Author

@prashantmital prashantmital left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a GH actions script in a subsequent commit or PR.

import numpy as np


cdef class Int32Builder(_Weakrefable):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok I see your point. Removed.

elif isinstance(value, int):
self.builder.get().Append(value)
else:
raise TypeError('Int32Builder only accepts integer objects')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't going to be using this method to actually construct the arrays from BSON streams. That code will be pure-C/C++. These wrapper classes will help us create the necessary builders etc from Python code instead of needing to do this via C++ code which is rather complex due to scoping rules enforced by Cython.

See the unwrap method in all these classes - this will eventually be used by some Cython code to access the underlying builders directly to construct arrays. The need for an unwrap method is also why the underlying builder classes are cdefined as shared_ptr instances instead of unique_ptr instances (so that they can be unwrapped and a new reference to them can be passed to pure-C code somewhere).

return self.builder.get().length()

cdef shared_ptr[CInt64Builder] unwrap(self):
return self.builder
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update - I did a bit of updating... Some more optimizations will be possible in a subsequent PR.

module.libraries.extend(pa.get_libraries())
module.library_dirs.extend(pa.get_library_dirs())
if os.name == 'posix':
module.extra_compile_args.append('-std=c++11')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



setup(
name='pymongoarrow',
version=get_pymongoarrow_version(),
packages=find_packages(),
ext_modules=get_extension_modules(),
setup_requires=['cython >= 0.29'])
setup_requires=['cython >= 0.29', 'pyarrow >= 3', 'numpy >= 1.16.6'])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# limitations under the License.
from unittest import TestCase

from pyarrow import Array, int32, int64
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, this is correct. Eventually, we will likely need pandas for our tests but the package wouldn't really depend on it... We can add a test_requires for pandas when that happens.



class TestDate64Builder(TestCase):
def test_simple(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we test the 's' unit here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@prashantmital prashantmital merged commit dedb1e0 into main Feb 9, 2021
@prashantmital prashantmital deleted the feat/arrow-builders branch February 9, 2021 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants