-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Split type objects #4903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split type objects #4903
Conversation
Here's how I estimated which fields were used most. The sed takes the ranges of lines within an mp_obj_type_t definition, and the awk prints the first field of each. Here I've manually removed the lines which are just noise, so your local results will differ.
This also doesn't count exception types or other types declared via macros; dan had noted that a LOT of types are exceptions. It's possible that setting the split differently could improve space a little bit. I left |
OK, here's one way to work towards compatibility:
In this case, the namedtuple way of defining the extended fields would still be different ... So the other alternative would be to provide a new type I also think we probably don't want to carry this as a delta from micropython if we can avoid it, so working on adding this upstream first might be a better idea. This code can still serve as a good proof of concept, though. |
@tannewt has been talking for years about shrinking these type objects. I misunderstood, it's not pointers, just an optional part. I think I did suggest that orally at one point. 😄 |
Yeah, nice! I had this idea for a long time as well, and is part of the reason for adding the I wanted to do a full analysis of how frequently certain slots were used, and then write some code to search for the optimal separation into "standard" and "full" types. I also thought about having multiple flags, or even a count, to say how many extended slots there were. Eg sort the slots on how frequently they are used and then the count says how many are supported, roughly something like this: #define BINARY_OP_THRESHOLD (4)
#define mp_type_get_binary_op(t) ((t->flags & 0xff) > BINARY_OP_THRESHOLD ? t->ext->binary_op : NULL) |
To support fully variable (not just 2 sizes), we can set aside some bits in
Anyhow, I think the below design does this. One critical part of this is a good way to find the right thing to put in The next best thing I found was to use The end of Which, I think, gets you to a declaration that looks like this, assuming the last 4 fields are among the optional ones, and the rest are always present: const mp_obj_type_t mp_type_bytes = {
{ &mp_type_type },
.flags = MP_TYPE_FLAG_EXT(MP_BIT_BINARY_OP | MP_BIT_SUBSCR | MP_BIT_GETITER | MP_BIT_BUFFER_P),
.name = MP_QSTR_bytes,
.print = str_print,
.make_new = bytes_make_new,
.locals_dict = (mp_obj_dict_t *)&str8_locals_dict,
MP_FIELD_BINARY_OP = mp_obj_str_binary_op,
MP_FIELD_SUBSCR = bytes_subscr,
MP_FIELD_GETITER = mp_obj_new_bytes_iterator,
MP_FIELD_GETBUFFER = mp_obj_str_get_buffer,
}; One thing that is unfortunately missing is automatic checking between the I think there are probably better names for things but I wanted to get these ideas written down without worrying too much about the perfect names. |
Very cool, and good to know that Not sure if useful, but another approach I had considered for fully variable is to add an indirection via an index table (i.e. all types would pay for a fixed N-byte table, but then only for the slots they use, where N ~= 12). Sketch below:
There's a bit of a subtle trick there where slot zero must be
I don't have a good feel for how this will compare performance-wise with the bitwise ops to extract the relevant info out of flags. My suspicion is that just finding a good cutoff point between "short" and "full" is probably sufficient. |
It could be easiest to decide on a small (say 5-10) set of allowed variations of // Only with print, make_new, and a locals dict.
MP_DEFINE_TYPE_V1(
mp_type_simple,
{& my_type_type },
MP_QSTR_simple,
simple_print,
simple_make_new,
(mp_obj_dict_t *)&simple_locals_dict,
);
// Lots but not all slots filled.
MP_DEFINE_TYPE_V4(
mp_type_bytes,
{ &mp_type_type },
MP_QSTR_bytes,
str_print,
bytes_make_new,
(mp_obj_dict_t *)&str8_locals_dict,
NULL, // unary_op not implemented for this type
mp_obj_str_binary_op,
bytes_subscr,
mp_obj_new_bytes_iterator,
mp_obj_str_get_buffer
); |
My preference would be to have as few options as possible to reduce complexity. |
@jepler How are you feeling about the imminence of some version of this? Even a simple version frees up a lot of flash we could use at the moment. |
This (& my calculator project) are my main targets for next week! |
There's a difference in the level of support for static initialization of flexible array members in gcc vs clang. typedef struct flex {
int i;
struct {
int a, b, c;
} j[];
} flex_t;
flex_t f1 = {.i = 3};
flex_t f2 = {.i = 4, .j[0].a = 1, };
flex_t f3 = {.i = 4, .j = {{.a = 1,}}, }; Styles f1 and f3 are supported in gcc and clang; style f2 is supported in gcc but not in clang. Therefore, we must select style f3 when extended fields are being initialized. |
Will this work on older compilers (eg older gcc, v4.8)? If it doesn't then we'll need a compile-time option to disable this feature. |
https://godbolt.org/z/h8coM6MPv This is set up to test it on current gcc and clang, as well as gcc 4.1.2, the oldest available. Not in that link but I also pushed clang back to 3.0.0, the oldest available, and it accepted the trick too. The "f3" style being the only one supported means I only see how to enable a single static split, none of the more sophisticated things we conjectured about. |
By comparing the address of the initial 'name' field instead of the addresses of the objects themselves, a small amount of type safety is added back, vs just casting to void. In the event that some other kind of object is passed in as 't', which happens to have a 'name' field of the right type, the construct would be (undesirably) accepted but it would almost certainly evaluate to false at runtime.
This appears to work with clang versions at least since 3.0.
.. These files can be created if the developer runs "./build.sh" or "./build-cp.sh" inside extmod/ulab.
I think it would be really great, if this proposal ended up in |
Our ultimate plan will be to offer this upstream to micropython! |
51781b8
to
a19a48f
Compare
In CircuitPython, I think we decided to always use this size savings. However, the preprocessor CAN disable this feature, with macros something like #if !defined(MP_TYPE_FLAG_EXTENDED)
#define MP_TYPE_CALL call
#define mp_type_get_call_slot(t) t->call
#define MP_TYPE_FLAG_EXTENDED (0)
#define MP_TYPE_EXTENDED_FIELDS(...) __VA_ARGS__
#endif which will be used in ulab for compatibility with old circuitpython, micropython, and circuitpython with this change. |
Here's my own stab at a script to analyze slot usage: https://gist.github.com/jepler/5cb392d43c16845645b49296c27e0850 The resulting spreadsheet for trinket_m0: A further iteration tried all the combinations and found that my split was best, again for trinket_m0: Furthermore, the data savings was estimated at 1952 bytes, vs a measured flash usage change of 1784 bytes so code growth must have been on the order of 168 bytes. |
Using @jimmo 's spreadsheet I found a slightly better split (based on PYBv1.1) that puts the protocol slot in the base set. That saves an extra 92 bytes (on PYBv1.1 at least). Using a 3-way split (base + ext1 + ext2) of the following form:
saves an extra 672 bytes compared to @jepler 's scheme (on PYBv1.1). A 4-way split can do a little better again, but it's diminishing returns (saves 108 bytes more). |
To re-cap some estimated size savings, now shifting to pybv11 (I think all these figures were from pyb types): est. savings, exclusive of code size increase:
|
I did some experiments on the generated code for the different approaches. I have three branches that add an extra slot to mp_obj_type_t in three different ways:
They all add a Here's the baseline disassembly:
And for the split (note extra function to extract the field)
And for the index:
The compiler does a very nice job of that! It also gets an added benefit for the fast path that it doesn't have to do the second check (if the slot index is zero, it doesn't have to check if the slot itself is NULL). (Edit: updated the disassembly, there was an error in mp_do_extra). |
As long as there are at most 15 optional slots, can you use bitfields:4 instead of uint8_t for the slot indices? |
I think so! I do wonder if you end up doing a lot more work to extract the bits out for a relatively small saving, but I will test it out. |
This is different than trinket_m0; I think we use protocols less heavily than you. Is there a better way for "3 tier" to look than STATIC const mp_obj_type_t uctypes_struct_type = {
{ &mp_type_type },
.flags = MP_TYPE_FLAG_EXTENDED2 /* (implies EXTENDED1) */ ,
.name = MP_QSTR_struct,
.print = uctypes_struct_print,
.make_new = uctypes_struct_make_new,
.attr = uctypes_struct_attr,
MP_TYPE_EXTENDED_FIELDS1(
.subscr = uctypes_struct_subscr,
.unary_op = uctypes_struct_unary_op,
MP_TYPE_EXTENDED_FIELDS2(
.buffer_p = { .get_buffer = uctypes_get_buffer },
))
}; |
Here's the disassembly for this approach (see code at https://github.com/jimmo/micropython/tree/obj-slots-index-bits)
Note there was an error in the disassembly I posted for the index approach in the earlier comment. I have updated that comment and the branch. |
Here's the corresponding disassembly for M0 (rp2040) Baseline:
Split
Index
Index-with-bitfield
|
Code size, per access of type object field (@jimmo's code, counted manually by me)
|
Looking again at 3-tier, based on trinket_m0 types, I found the best saved 2152 bytes, or an additional 200 bytes compared to 1952 with the two-tier system. My optimal 3-tier was:
code (runs for quite awhile) & output (just the few lines at top and bottom) in gist: https://gist.github.com/254a06c86fa956bcc3cd87fb0940ce97 The code's a bit of a mess so I am not super-confident in the results. Another iteration of the script computed these statistics & estimates:
considering that the size of the best 2-tier system was 4156 and the original is 6480, jimmo's slotting idea looks real tasty for our most constrained boards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this! I hope we don't fill the space up too quickly. :-)
FWIW here's some (rough) timing data on the different approaches. This is milliseconds, for 1000000 iterations of calling (str was a bit of a bad choice in hindsight, because it uses a different path in mp_obj_get_type). Baseline:
Split
Index:
Index (bits)
At some stage I will implement the index approach for MicroPython and run the full performance test suite. |
The principle is that the "less frequently used" fields of a type object are placed in a flexible array member; this flexible array either has 0 (hopefully the usual case) or 1 member. The new flag indicates whether the flexible portion of the structure exists.
Fields that reside there (plus 2 that I decided NOT to move after starting) are now hidden behind accessors in most cases; where the verification had already been performed somewhere else (you could tell because there was no NULL-pointer checking), it is accessed directly via the flexible member
ext[0]
.I selected an arbitrary split based on estimates of how frequently each field was used... [I had a grep, but I lost it]
This gives back 1784 bytes of firmware size on trinket_m0.
The unix port also builds and passes basic, but many other builds will have problems because their type objects need to be converted.
We would need a plan for how ulab could be compatible with both CP and MP since it both needs to declare types with the extended fields, and also query an optional field.
I know @dhalbert has suggested a change like this, but I couldn't find the issue to tie it to. Credit goes to him for the concept.