-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[clang] Better bitfield access units #65742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a3ff254
to
d47b078
Compare
Note that changing the memory accesses performed by clang (load or store) is an ABI change at IR level because of UB. Also, please have a look at the existing |
The primary issue with over-wide bitfield accesses is ensuring we don't cause data races. C++ [intro.memory]p3: "A memory location is either an object of scalar type that is not a bit-field or a maximal sequence of adjacent bit-fields all having nonzero width. Two or more threads of execution can access separate memory locations without interfering with each other." So that imposes an upper limit: we can't store any byte that might overlap another field. There isn't really a lower limit: the ABI rule is that we access as many bytes as possible, but we aren't actually obligated to access those bytes; accesses that don't actually read or write a field aren't observable. The advantage of exposing the wide accesses to the optimizer is that it allows memory optimizations, like CSE or DSE, to reason about the entire bitfield as a single unit, and easily eliminate accesses. Reducing the size of bitfield accesses in the frontend is basically throwing away information. The disadvantage is, as you note here, that sometimes codegen doesn't manage to clean up the resulting code well. I guess my primary question here is, instead of making clang try to guess which accesses will be optimal, can we improve the way the LLVM code generator handles the patterns currently generated by clang? I'm not exactly against changing the IR generated by clang, but changing the IR like this seems likely to improve some cases at the expense of others. |
Could you clarify why UB is a consideration here? |
There's some weirdness with poison values; see https://reviews.llvm.org/D129541 etc. This is mostly an IR modeling thing; the machine instructions work fine, we just need to make sure we don't accidentally poison IR values that aren't supposed to be poison. |
That flag's behaviour is unchanged here -- it continues to not merge access units. My contention here is that (a) merging units can be advantageous, but (b) the existing algorithm has some deficiencies. |
Certainly, by could you clarify why changing the access patterns could do that? My understanding of Poison is that should they cause a user-visible state change, that's indicative of the original progam's UB, and the languages in question don't guarantee anything at that point -- nasal demons and all. That the observed behaviour of a UB-exhibiting program changes is not a defect. What am I missing? |
Hm, I've been thinking of it the opposite way round -- merging bitfields is throwing away information (about where cuts might be). And it's unclear to me how CSE or DSE could make use of a merged access unit to eliminate accesses -- it would seem to me that a merged access unit accessed at a single point would make it look like the whole unit was live? (Can you point me at an example of the analysis you describe happening?) That the simple x86 example I showed doesn't show (complete) undoing of the merging suggests it is hard for CSE and DSE to do the analysis you indicate. DSE did work there, to undo the merge, but there's no dead load elimination happening. But, that DSE is merely undoing the gluing that the front end did -- if we didn't glue, then it would always happen.
My contention is that the current algorithm both (a) fails to merge some mergeable access units and (b) inappropriately merges some access units especially on strict alignment machines.
In the testcases that were changed I did not encounter one that generated worse code (an ARM testcase showed better code due to, IIRC, not merging more than a register width, as with the x86 case it didn't eliminate unnecessary loads whereas with this those are gone). I would be very interested in seeing cases that degrade though. |
In essence, late optimizations can make accesses narrower, but it can't make them wider. The information about cuts is generally recoverable by analyzing the masking instructions. When I refer to CSE/DSE, I'm mostly talking about keeping values in registers for longer. They don't know anything about individual fields in bitfields. If we split bitfields too narrowly, we end up with extra memory accesses for accessing multiple adjacent fields. And if you have accesses which overlap (access some, but not all, of the same memory), CSE/DSE get much worse. Having a bitfield unit that can't be loaded in a single memory access probably doesn't have much benefit, if there's a natural boundary we can use to split.
Depending on how we fix the interaction between poison and bitfields, there could be an invariant that accessing any bitfield has to freeze all adjacent bitfield values. If we do that, then the set of "adjacent" values actually matters. There's probably some way we can make this work without actually introducing memory accesses, though; probably not worth worrying about too much here. |
thanks for your comments,
To be clear by 'accesses which overlap' you mean accesses to bitfields that share parts of a byte? I agree -- that's an unchanged part of the algorithm, bitfields that share parts of a single byte must be in the same access unit. Or do you mean an access to a non-bitfield adjacent to a bitfield? How does that square with the memory model you pointed to earlier? Or is it that if you see both such accesses, it must be safe to merge them, or there's some UB already happening? But anyway, this algorithm doesn't affect that -- in both before and after cases something outside of bitfieldAccumulation must be merging them.
I think we're agreeing here. The current algorithm generates access units that can require multiple accesses, even on a relaxed alignment machine (I refer back to the x86 example). This replacement tries much harder to not do that -- it only merges access units if that demonstrably lowers the number of accesses needed for the merged access unit from accessing the original set of units. The motivating case we had was a sequence of 4 byte loads, to fetch an unaligned 32-bit access unit, 1 byte of that being fiddled with, and then that one byte being written back. |
Ping? Anything more I can do to progress this? |
time for another ping, I think ... |
XCore: yes the current setting of the new flag, the default false value, is right for XCore. Misaligned loads can generate a function call. We do not have benchmarks as such, but comparing the code generated for the test files touched here, there is an improvement. In particular bitfield-ir.cpp is a big improvement. In bitfield.cpp, two misaligned loads remain. One disappears if I remove the struct packed attribute. The other remains: N0::read31(). This is not a problem, and is not a degradation, it's a function call intended for the purpose; just mentioning for information. |
I'm planning to take a closer look at this when I have more time. Sorry I haven't been more responsive here. One very brief note: in the comments in the code, you might want to distinguish between the semantic width of a bitfield (i.e. the C standard notion of a "memory location", which has ABI significance), vs. the accesses we choose to generate (we don't need to generate no-op reads/writes). |
Indeed, the naming here is unfortunately overloaded. SysV psABIs use 'Storage Unit' (at least 86 does, and IIRC others follow the same nomenclature). But Clang also uses 'StorageInfo' and 'Storage$FOO' in its bitfield structures, which is unfortunate. I've used 'Access Unit' to describe the latter in this patch. If you meant something else, please clarify (or if I've missed some places?). |
With this pull request, I tried to build linux kernel and kernel bpf selftest and run the selftest, and didn't find any issues. So presumably the implementation is correct. I tried a particular kernel bpf selftest: |
ce9168a
to
f95a126
Compare
@rjmccall here's a refactoring along the lines you suggested. Once one stops worrying about rescanning when the next access unit fails to accumulate into the current one, things get simpler. The compiler should be able to turn the boolean conditional flow within the loop body into direct control flow. Would you like the API change I introduced in the previous patch (having |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, the structure looks great! Mostly style and clarity comments from here.
Barrier = false; | ||
if (Field->isZeroLengthBitField(Context) && | ||
(Context.getTargetInfo().useZeroLengthBitfieldAlignment() || | ||
Context.getTargetInfo().useBitFieldTypeAlignment())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in a previous review, please consider zero-length bit-fields to be barriers unless we literally can't because they're not char-aligned. In this case, we know that the zero-length bit-field is char-aligned, so we can honor it as a barrier, and you don't need this target condition.
Please add a comment to the case above this, something like this:
This can be a zero-length bit-field if the target doesn't align them to a
char boundary. This is non-conformant, but we don't have a choice but
to cross the bit-field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in a previous review, please consider zero-length bit-fields to be barriers unless we literally can't because they're not char-aligned. In this case, we know that the zero-length bit-field is char-aligned, so we can honor it as a barrier, and you don't need this target condition.
Oh, I thought you were concerned about a darwin-specific ABI behaviour (which I then noticed was different in this area).
Your request wouldn't match the existing behaviour, AFAICT. For instance:
struct X {
unsigned a;
unsigned b: 8;
unsigned : 0;
unsigned c : 8;
} x;
both x86_64-linux and aarch64-darwin with -fno-bitfield-type-align
place those two bitfields in a single access unit:
BitFields:[
<CGBitFieldInfo Offset:0 Size:8 IsSigned:0 StorageSize:16 StorageOffset:4 VolatileOffset:0 VolatileStorageSize:0 VolatileStorageOffset:0>
<CGBitFieldInfo Offset:8 Size:0 IsSigned:0 StorageSize:16 StorageOffset:4 VolatileOffset:0 VolatileStorageSize:0 VolatileStorageOffset:0>
<CGBitFieldInfo Offset:8 Size:8 IsSigned:0 StorageSize:16 StorageOffset:4 VolatileOffset:0 VolatileStorageSize:0 VolatileStorageOffset:0>
]>
Do you have a case where those bitfields are currently placed in separate access units (when the zero length is not a barrier)?
From the C++ std, [class.bit]
As a special case, an unnamed bit-field with a width of zero specifies alignment of the next bit-field at an
allocation unit boundary.
But for ABIs that reject using zero-length bitfields as such barriers, we've already stepped outside the language (so why separate the access units if such a bitfield happens to permit it?) Should I be looking at some other document? (The x86_64 psABI seems very vague about this, just mentioning bitfields may share storage units with other members.)
Are you taking the opportunity to alter the observed behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most important standard quote to me is this, from C++ [intro.memory]p3 (which matches the definition in C):
A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having nonzero width.
Ultimately, this means that concurrent stores to bit-fields on opposite sides of a zero-width bit-field are not allowed to interfere with each other. This is the primary reason zero-width bit-fields exist, given how implementation-defined everything about their actual layout is. We obviously cannot guarantee that if the bit-fields overlap within a byte, but I do think we need to be making that guarantee whenever we can. And yes, I would like to take this opportunity to alter the observed behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears there are no existing tests for the current behaviour of such non-barrier zero-sized bitfields -- altering the patch in the way you describe results in no test failures beyond those in the newly added CodeGen/bitfield-access-pad.c
. (altering trunk in a similar way results in only similar failures in the baseline version of that new test)
Also, the two tests for -ffine-grained-bitfield-access
(CodeGenCXX/finegrain-bitfield-access.cpp
& CodeGenCXX/finegrain-bitfield-type.cpp
) do not have tests containing such zero-length fields falling on byte boundaries with -fno-bitfield-type-align
. (I.e. patterns that would be combined now and with this diff, except for the -ffine-grained-bitfield-access
disabling that.)
The behaviour here seems a little undertested.
For the record, I'd live with the behaviour you suggest, but i think that change belongs in a different patch (either before or after this patch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by your reticence here. If this were, broadly, an NFC patch, then I'd completely understand not wanting to fix a random bug that we found while working on it. But this is definitely not an NFC patch; you're changing our generated code patterns in very significant ways, including in ways that break apart access units in exactly the way that would happen if we considered these to be barriers. Given that, I don't understand why it'd only be okay to change code patterns in ways that aren't bug fixes, especially since the change strictly simplifies the logic in your patch.
#85174 |
@rjmccall here is a rebase an update, which I think addresses all your comments. I did make some material changes though:
The two bitfields are placed into a single 32-bit access unit, rather than separate 16bit ones. Either behaviour is ok with the aapcs. (The previous algorithm would have placed them into a single 24bit field if they abutted at a byte boundary, no change in that behaviour now.)
we'd generate an llvm struct of
It may be that the later pass that was adjusting bitfield accesses to natural sizes can be simplified or deleted. I've not investigated. |
Yeah, we don't need to care about the actual bit offset of the zero-width bit-field as long as we honor the non-interference properties across it. I'll take a look at the patch, thanks. |
Sorry to push another update, but I realized the LimitOffset computation could be sunk to the point of use, and therefore not computed in all cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks! I have a few editorial nits, and there's an assertion that seems off, but otherwise this looks ready to go.
These new tests verify clang's choice of types to access bitfields. This is in preparation for the next patch, which changes that algorithm. With these tests already in place it'll be easier to see what has changed.
@rjmccall thanks, here is an update with those changes. I've collapsed it to the 3 commits mentioned earlier
|
✅ With the latest revision this PR passed the Python code formatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Thanks for putting up with such a protracted review! I really like where we've ended up.
LGTM. Please wait a few days before merging to give other reviewers a chance to give final feedback.
Verify bitfield access units.
Promote ARM & AArch64's HasUnaligned to TargetInfo and set for all targets.
Reimplement bitfield access unit computation for SysV ABIs. Considers expense of unaligned and non-power-of-2 accesses.
Ping? |
Committed, expect tailclipping cleanup PR soon |
This failure is caused by the patch |
Fixes stack overflow in clang/test/SemaTemplate/instantiation-depth-default.cpp after llvm/llvm-project#65742
Reimplement the SysV (non-Microsoft) bitfield access unit computation. For avoidance of doubt, this is not an ABI change, but the memory accesses Clang emits to access bitfields.
This is a stack of two commits, to make it easier to see the change in behaviour. The first commit is a set of new tests that verify the placement of access units under the current algorith. The second commit is the change, and alters those tests as required. The guts of the change is in
clang/lib/CodeGen/CGRecordLayoutBuilder.cpp
.The SysV ABI describes placing bitfields in 'storage units', which may end up overlapping eachother or other data members (for instance 'int bf :4;' may be in a 32-bit storage unit that overlaps a plan 'char c;' field). LLVM doesn't represent structures at that level, and Clang
must create a non-overlapping representation of the structure. That's what's happening here. To avoid confusion, I'm using the phrase 'access unit' to refer to the integer type Clang uses.
There are two aspects to chosing access units:
bitfields that share bits of a single byte must be in the same access unit -- you can't touch only one of those bitfields. This computation is unchanged.
adjacent bitfields that occupy distinct bytes, might want to share an access unit. One wants to do this if that makes a better access size. This computation is changed.
The current algorithm computes both of these in a single pass. Computing if the next bitfield extends the current access unit by requiring it exacly abut the previous, and a few other constraints. But there are some issues with this:
a) If there's even a single bit of padding between the two, no concatenation occurs. This seems suboptimal.
b) Because of the above-mentioned overlapping, bitfield access units might not start on an aligned boundary, and some architectures do not support unaligned accesses.
c) Likewise, the access unit might not be able to be a natural 2^n size, and the adjacent bytes occupied (with either other data members, or with C++, a derived class occupying tail-padding bytes). We must emit exactly a (say) 24-bit access (to avoid aliasing issues?) (example below).
d) Concatentating access units can lead to an unaligned larger access, and again, that's problematic for strict alignment ISA. (consider
struct C { char a : 8; char b : 8;};
C
has alignment 1, but a 2 byte access unit would naturally want 2-byte alignment.My understanding is that larger access units were desired, to reduce the total number of accesses, and there was an assertion that the access would be narrowed if only some of the bytes changed.
That's not what we've observed, and its demonstrable on x86_64:
Here, B::a and B::b could be placed into a 3 byte access unit, but we can't use 4-byte accesses due to the 4th byte containing
c
. The code we end up generating for an increment of B::b is:we're loading both
B::a
andB::b
, sticking them together, incrementingB::b
and then only storing the latter. I.e. we've reduced the memory accesses, but we're doing more work than necessary. Without theB::c
field, there's 1 byte of padding, and a regular 4-byte access can be used (and is). X86_64 is not unique in this behaviour.I'm of the opinion that one cannot do these in a single pass -- concatenating access units fundamentally requires knowing how large both units are, and you cannot tell that just by looking at the first bitfield of the second access unit.
That's what this patch does: separate the two algorithm into separate phases. We first determine which bitfields /must/ be in a single access unit, and then determine which of those access units could be concatentated. We pay attention to the storage-size of the integral type -- an i24 naturally requires 4 bytes of storage. When generating the access units we note barriers such as
:0
bitfields (on ABIs that care), and the starting location of the next non-bitfield or, the structure tail-padding.The concatenation only occurs if:
a) The larger access is naturally aligned, or unaligned access is cheap.
b) The larger access storage size doesn't expand into a forbidden zone
c) The larger access is not bigger than a register.
d) Neither of the two access units contain a volatile bitfield.
(Note that if an access-unit's storage-size naturally overlaps a subsequent access unit, we'll merge the two provided. A partial overlap can occur with packed structures and is handled too.)
ETA: in essence, it only merges access units when the number of memory accesses to get at the merged unit is less than the number of accesses for the original unmerged set of access units.
In order to know the ending barrier location, we adjust the call to accumulateBitFields, to pass that in. There's a bit of reworking on accumulateFields to make that happen, and as an added bonus we can skip
[[no_unique_address]]
empty members that would otherwise split a bitfield run. (I suspect that never happens in practice though.)An example of the change this makes is that the above x86_64 case now places the two bitfields in separate access units, because i16 and i8 can be better accessed than an i24. If there's no B::c member they're placed in an i32 access unit, which is effectively the same as now. (I see similar changes for arm32 target.)
I added a new Clang TargetInfo hook to specify whether the target has cheap unaligned memory accesses or not. That’s used by the new access unit concatenation code. LLVM has a CodeGen hook to tell you this (
TargetLoweringBase::allowsMisalignedMemoryAccesses
), but that’s notreally accessible from where we are in Clang. AFAICT creating TargetLoweringBase requires an llvm MachineFunction, and we don’t have one when laying out records. The new hook defaults to false, which is in keeping with it being ‘a 32-bit RISC platform, like PPC or SPARC’. That is at odds with the current access unit algorithm, which behaves as-if unaligned access has no cost. I went through the target architectures’ implementation of allowsMisalignedMemoryAccesses as best I could, to override that default. It’s possible I missed cases. For the record, the following targets set the flag to true (possibly under control of specific flags): AArch64, ARM, LoongArch, Mips, PPC, SystemZ, VE, WebAssembly & X86.
Clang already has a target hook for register size, which actually reports pointer-size – a reasonable guess, but incorrect for Loongson64, which only has an ILP32 ABI supported. But, that’s an orthogonal problem. (I didn’t check for similar cases, for instance a 32 bit ABI on a mips64 cpu.)
The performance data obtained from our internal benchmarks for an out-of-tree strict-alignment ILP32 target shows code size reduction of .4% to .75%, and performance increase of .2% to 1.2%. Some benchmarks showed no change, they do not appear to be bitfield-centric.
The above-mentioned x86_64 case now places the two bitfields in different, naturally sized, access units (i16 and i8), and now generates the following assembly:
One of the ARM bitfield tests is likewise simplified.