Skip to content

Ensure that data_p_ is typed with the underlying type of data_ #267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 17, 2023

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Mar 8, 2022

This PR attempts to force data_p_ to always be a pointer to the "underlying" type of SEXP data_.

Currently, this is not the case. data_p_ currently has type T*, which means that for various vectors it maps to:

  • integers: int*
  • doubles: double*
  • logicals: r_bool* (rather than int*)
  • raws: uint8_t* (rather than Rbyte*)
  • list: not used, but would be SEXP*
  • strings: not used, but would be r_string* (rather than SEXP*)
  • complexes: not implemented (yet!, but it is built on top of this here Implement complex support DavisVaughan/cpp11#2)

The logicals and raws cases make me a little nervous. When data_p_ is set for these classes, it is done so like:

data_p_ = reinterpret_cast<r_bool*>(LOGICAL(data))

I think we can avoid this by instead ensuring that data_p_ has type int* here, and is just set to LOGICAL(data). Then we rely on casts from r_bool to int and vice versa when setting and getting elements.

I think it definitely cleaned up logicals.hpp and raws.hpp nicely, as some intermediate static_cast() and reinterpret_cast() calls were removed.

I'm also not entirely sure how this reinterpret_cast() is currently working, but I assume it works because r_bool just has 1 variable member, int value_, which is the same size as an int that we'd get from the int*? Nevertheless, this new approach feels safer and easier to understand to me.


The general idea of this PR is to introduce a compile time "underlying" type that we can deduce from T. The default is just T, which works for int and double, but this is overridden for r_bool and uint8_t.

We then use this underlying type for the buffer (buf_) and for the pointer type.


This PR should have other benefits too. I am working on adding complexes, and I think I will use an r_complex intermediate type which will also define a custom underlying type, like:

namespace traits {
template <>
struct get_underlying_type<r_complex> { using type = Rcomplex; };
}  // namespace traits

Additionally, for read-only lists and character vectors, we could set the data_p_ to DATAPTR(), like we do in vctrs https://github.com/r-lib/vctrs/blob/7260d31a31b87ece16c7bbda3521e91d1c0f242f/src/vctrs-core.h#L97. This could provide much faster access, since we don't have to go through VECTOR_ELT(), and should be safe for read-only access. That would require the get_underlying_type<r_string> I've added here (it currently goes unused).

@DavisVaughan DavisVaughan force-pushed the feature/data-pointer branch 2 times, most recently from b3c65cc to 84ff462 Compare March 8, 2022 17:01
struct get_underlying_type {
using type = T;
};
} // namespace traits
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The traits namespace may be overkill, I'm not sure.

I have seen this idea used elsewhere though. One of those places is Rcpp:
https://github.com/RcppCore/Rcpp/blob/master/inst/include/Rcpp/traits/storage_type.h

Comment on lines -47 to +48
LOGICAL_GET_REGION(data_->data_, pos, length_, reinterpret_cast<int*>(buf_.data()));
LOGICAL_GET_REGION(data_->data_, pos, length_, buf_.data());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buf_ now has type int* here already, so no need to cast it

Comment on lines -40 to +41
return reinterpret_cast<r_bool*>(LOGICAL(data));
return LOGICAL(data);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the motivating example. get_p() now has a return type of int* for logicals, which is exactly what LOGICAL() returns, so no need for the cast

Comment on lines -575 to +581
return buf_[pos_ - block_start_];
return static_cast<T>(buf_[pos_ - block_start_]);
} else {
return data_->data_p_[pos_];
return static_cast<T>(data_->data_p_[pos_]);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need these, the implicit casts to T will kick in as needed (for T = r_bool and T = uint8_t) but it feels like good practice to be explicit?

return is_altrep_ ? static_cast<r_bool>(LOGICAL_ELT(data_, pos)) : data_p_[pos];
return is_altrep_ ? LOGICAL_ELT(data_, pos) : data_p_[pos];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOGICAL_ELT() and data_p_[] now return the same type (an int). So in theory the right thing to do would be to wrap the whole ternary condition in static_cast<r_bool>, but I'm relying on the implicit cast to r_bool instead - this seems to be what we do elsewhere.

I am not sure what best practice is here, rely on implicit casts to the return value, or be explicit with static_cast<r_bool>? I think I'd rather see the explicit cast, so there is no guessing involved. I'll probably do a follow up PR to try this out everywhere.

Copy link
Member Author

@DavisVaughan DavisVaughan Mar 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently it is a Core Guideline to avoid implicit conversions where possible http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c164-avoid-implicit-conversion-operators

I resonate strongly with this! However, I will leave this as is for now since we use implicit casts in many cases already.

It would be even more apparent that we should be using an explicit static_cast<r_bool> if we had made the "r_bool from int" constructor explicit, i.e. explicit r_bool(int value) in r_bool.hpp, which would have forced us to use a static cast here. I feel like this would be better practice, maybe we should look into this. (Note that I did this for explicit r_complex(Rcomplex value) in the complex PR and really like the safety of it)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that explicit is better.

Previously, a `reinterpret_cast()` was used before assigning to `data_` so that:
- For raws, `RAW()` was reinterpreted from `Rbyte* -> uint8_t*`
- For logicals, `LOGICAL()` was reinterpreted from `int* -> r_bool*`
@DavisVaughan DavisVaughan force-pushed the feature/data-pointer branch from 84ff462 to d6b750b Compare March 9, 2022 15:15
Copy link
Collaborator

@romainfrancois romainfrancois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can look into the explicit stuff in a follow up PR.

@romainfrancois romainfrancois merged commit 971164d into r-lib:main May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants