Skip to content

Split type objects #4903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jul 12, 2021
Merged

Split type objects #4903

merged 26 commits into from
Jul 12, 2021

Conversation

jepler
Copy link

@jepler jepler commented Jun 19, 2021

The principle is that the "less frequently used" fields of a type object are placed in a flexible array member; this flexible array either has 0 (hopefully the usual case) or 1 member. The new flag indicates whether the flexible portion of the structure exists.

Fields that reside there (plus 2 that I decided NOT to move after starting) are now hidden behind accessors in most cases; where the verification had already been performed somewhere else (you could tell because there was no NULL-pointer checking), it is accessed directly via the flexible member ext[0].

I selected an arbitrary split based on estimates of how frequently each field was used... [I had a grep, but I lost it]

This gives back 1784 bytes of firmware size on trinket_m0.

The unix port also builds and passes basic, but many other builds will have problems because their type objects need to be converted.

We would need a plan for how ulab could be compatible with both CP and MP since it both needs to declare types with the extended fields, and also query an optional field.

I know @dhalbert has suggested a change like this, but I couldn't find the issue to tie it to. Credit goes to him for the concept.

@jepler
Copy link
Author

jepler commented Jun 20, 2021

Here's how I estimated which fields were used most. The sed takes the ranges of lines within an mp_obj_type_t definition, and the awk prints the first field of each. sort | uniq -c counts unique instances of each one, then sort -n puts them in numeric order.

Here I've manually removed the lines which are just noise, so your local results will differ.

$ git grep -l 'mp_obj_type_t .*= {'  | xargs sed -n '/mp_obj_type_t.* = {/,/;/p' | awk '{print $1}' | sort | uniq -c | sort -n
      4 .base
      7 .parent
     11 .buffer_p
     17 .call
     17 .flags
     23 .binary_op
     24 .attr
     24 .subscr
     32 .protocol
     34 .iternext
     49 .unary_op
     54 .getiter
     79 .print
    147 .make_new
    170 .locals_dict
    227 .name

This also doesn't count exception types or other types declared via macros; dan had noted that a LOT of types are exceptions.

It's possible that setting the split differently could improve space a little bit. I left attr in the basic struct so that the attribute reference path was not complicated by additional checks; I think my rationale around .parent was similar, though .parent is probably not as performance-critical.

@jepler
Copy link
Author

jepler commented Jun 20, 2021

OK, here's one way to work towards compatibility:

  • Change the declaration of an extended slot to look like
     const mp_obj_type_t mp_type_bytes = {
         { &mp_type_type },
         .flags = MP_TYPE_FLAG_FULL,
         .name = MP_QSTR_bytes,
         .print = str_print,
         .make_new = bytes_make_new,
         .locals_dict = (mp_obj_dict_t *)&str8_locals_dict,
         .ext[0].binary_op = mp_obj_str_binary_op,
         .ext[0].subscr = bytes_subscr,
         .ext[0].getiter = mp_obj_new_bytes_iterator,
         .ext[0].buffer_p = { .get_buffer = mp_obj_str_get_buffer },
     };
    which is just the same to the compiler
  • Now, introduce a macro for filling the slot:
    #define BINARY_OP_SLOT(x) .ext[0].binary_op = x
    An implementation that doesn't want/have split types can just #define BINARY_OP_SLOT(x) binary_op = x instead
  • MP_TYPE_FLAG_FULL of course would be defined to expand to 0 when the implementation doesn't support split types
  • Of course, in this case the accessors would be provided as inlines or macros that just return the field unconditionally

In this case, the namedtuple way of defining the extended fields would still be different ... So the other alternative would be to provide a new type mp_obj_fulltype_t which has all the fields directly in them, and when there is no split they would simply be the same types via typedef. Maybe that's preferable. (you still need the flag bit of course, to get RTTI-lite). Maybe mp_obj_shorttype_t and MP_TYPE_FLAG_SHORT would be better for allowing piecemeal migration.

I also think we probably don't want to carry this as a delta from micropython if we can avoid it, so working on adding this upstream first might be a better idea. This code can still serve as a good proof of concept, though.

@dhalbert
Copy link
Collaborator

dhalbert commented Jun 21, 2021

I know @dhalbert has suggested a change like this, but I couldn't find the issue to tie it to. Credit goes to him for the concept.

@tannewt has been talking for years about shrinking these type objects. I'm not sure I actually ever mentioned the technique you are using. In days of old I saw complicated union-like types when passing arguments with varying parameters, and I mentioned that (lotsa flag bits to indicate what was being passed) but not the split technique with pointers you have adopted.

I misunderstood, it's not pointers, just an optional part. I think I did suggest that orally at one point. 😄

@tannewt
Copy link
Member

tannewt commented Jun 21, 2021

@dpgeorge @jimmo any thoughts?

@dpgeorge
Copy link

Yeah, nice! I had this idea for a long time as well, and is part of the reason for adding the .flags member to mp_obj_type_t.

I wanted to do a full analysis of how frequently certain slots were used, and then write some code to search for the optimal separation into "standard" and "full" types. I also thought about having multiple flags, or even a count, to say how many extended slots there were. Eg sort the slots on how frequently they are used and then the count says how many are supported, roughly something like this:

#define BINARY_OP_THRESHOLD (4)
#define mp_type_get_binary_op(t) ((t->flags & 0xff) > BINARY_OP_THRESHOLD ? t->ext->binary_op : NULL)

@jepler
Copy link
Author

jepler commented Jun 23, 2021

To support fully variable (not just 2 sizes), we can set aside some bits in type->flags as suggested. It would be nice if we could

  • allow builds that always make all type objects full size (binary size vs performance trade-off)
  • have some freedom to re-order fields, or even move them between the mandatory and optional parts of the type object (with optional→mandatory being easy and →optional requiring more work)

Anyhow, I think the below design does this.

One critical part of this is a good way to find the right thing to put in flags. If there were a "compile-time variadic max macro", that would be great, you'd just write .flags = MAGIC_MAX(MP_TYPE_FIELD1, MP_TYPE_FIELD5, MP_TYPE_FIELD2). Or we could provide MAX2 .. MAXN macros where N was the total number of variadic fields.

The next best thing I found was to use __builtin_clz() which seems to be a compile-time constant all the way back to the oldest gcc on godbolt.org, 4.1.2. So then you just let .flags = 32-__builtin_clz(MP_BIT_FIELD1 | MP_BIT_FIELD5 | MP_BIT_FIELD2) (and .flags gets 5 since the first (highest) set bit is 5). There's a slight complication when the bits are all 0, but this can be hidden in a macro such as MP_TYPE_FLAG_EXT. This also breaks down when you pass 32 optional fields, but that's nonsense.

The end of mp_obj_type_t becomes a flexible array void *ext[];.

Which, I think, gets you to a declaration that looks like this, assuming the last 4 fields are among the optional ones, and the rest are always present:

 const mp_obj_type_t mp_type_bytes = {
     { &mp_type_type },
     .flags = MP_TYPE_FLAG_EXT(MP_BIT_BINARY_OP | MP_BIT_SUBSCR | MP_BIT_GETITER | MP_BIT_BUFFER_P),
     .name = MP_QSTR_bytes,
     .print = str_print,
     .make_new = bytes_make_new,
     .locals_dict = (mp_obj_dict_t *)&str8_locals_dict,
     MP_FIELD_BINARY_OP = mp_obj_str_binary_op,
     MP_FIELD_SUBSCR = bytes_subscr,
     MP_FIELD_GETITER = mp_obj_new_bytes_iterator,
     MP_FIELD_GETBUFFER = mp_obj_str_get_buffer,
 };

One thing that is unfortunately missing is automatic checking between the MP_FIELD and MP_BITs. I haven't run across any cunning ways to check that. (something that goes around parsing elf files? ugh, well, you could probably do that)

I think there are probably better names for things but I wanted to get these ideas written down without worrying too much about the perfect names.

@jimmo
Copy link

jimmo commented Jun 23, 2021

Very cool, and good to know that clz can be resolved at compile time on const inputs. There's a significant win here for ROM so keen to investigate, and would be great to ensure that CPy and MPy take the same approach.

Not sure if useful, but another approach I had considered for fully variable is to add an indirection via an index table (i.e. all types would pay for a fixed N-byte table, but then only for the slots they use, where N ~= 12). Sketch below:

#define SLOT_MAKE_NEW (0)
...
#define SLOT_BINARY_OP (4)
...
#define SLOT_MAX_SLOT   (8)

typedef struct {
  // the regular stuff (base, flags, name, etc)
  uint8_t slot_idx[MAX_SLOT];
  void* slots[];
} obj_type_t;

#define TYPE_MAKE_NEW(t) ((mp_make_new_fun_t)t->slots[0])
#define TYPE_BINARY_OP(t) ((mp_binary_op_fun_t)(t->slot_idx[SLOT_BINARY_OP] == 0 ? NULL : t->slots[t->slot_idx[SLOT_BINARY_OP]]))

const obj_type_t my_obj = {
  .name = ... etc,
  .slot_idx = { [SLOT_MAKE_NEW] = 0, [SLOT_BINARY_OP] = 1},
  .slots = { &my_type_make_new, &my_type_binary_op }
};

There's a bit of a subtle trick there where slot zero must be make_new in order to make the comparison to zero work (but almost all types have a make_new). Alternatively you could always require that slot zero is NULL, in which case no comparison is required and all the accessors become simply:

#define TYPE_FOO_OP(t) ((mp_foo_op_fun_t)(t->slots[t->slot_idx[SLOT_FOO_OP]]))

I don't have a good feel for how this will compare performance-wise with the bitwise ops to extract the relevant info out of flags. My suspicion is that just finding a good cutoff point between "short" and "full" is probably sufficient.

@dpgeorge
Copy link

It could be easiest to decide on a small (say 5-10) set of allowed variations of mp_obj_type_t and then provide macros to initialise types. Eg:

// Only with print, make_new, and a locals dict.
MP_DEFINE_TYPE_V1(
    mp_type_simple,
    {& my_type_type },
    MP_QSTR_simple,
    simple_print,
    simple_make_new,
    (mp_obj_dict_t *)&simple_locals_dict,
);

// Lots but not all slots filled.
MP_DEFINE_TYPE_V4(
    mp_type_bytes,
    { &mp_type_type },
    MP_QSTR_bytes,
    str_print,
    bytes_make_new,
    (mp_obj_dict_t *)&str8_locals_dict,
    NULL, // unary_op not implemented for this type
    mp_obj_str_binary_op,
    bytes_subscr,
    mp_obj_new_bytes_iterator,
    mp_obj_str_get_buffer
);

@tannewt
Copy link
Member

tannewt commented Jun 23, 2021

My preference would be to have as few options as possible to reduce complexity. clz is a neat trick but I'd like it to be simple to describe how it works. Simple is also easier to implement.

@dhalbert
Copy link
Collaborator

dhalbert commented Jul 2, 2021

@jepler How are you feeling about the imminence of some version of this? Even a simple version frees up a lot of flash we could use at the moment.

@jepler
Copy link
Author

jepler commented Jul 3, 2021

This (& my calculator project) are my main targets for next week!

@jepler
Copy link
Author

jepler commented Jul 7, 2021

There's a difference in the level of support for static initialization of flexible array members in gcc vs clang.

typedef struct flex {
   int i;
   struct {
       int a, b, c;
   } j[];
} flex_t;

flex_t f1 = {.i = 3};
flex_t f2 = {.i = 4, .j[0].a = 1, };
flex_t f3 = {.i = 4, .j = {{.a = 1,}}, };

Styles f1 and f3 are supported in gcc and clang; style f2 is supported in gcc but not in clang. Therefore, we must select style f3 when extended fields are being initialized.

@dpgeorge
Copy link

dpgeorge commented Jul 7, 2021

There's a difference in the level of support for static initialization of flexible array members in gcc vs clang.

Will this work on older compilers (eg older gcc, v4.8)? If it doesn't then we'll need a compile-time option to disable this feature.

@jepler
Copy link
Author

jepler commented Jul 7, 2021

https://godbolt.org/z/h8coM6MPv

This is set up to test it on current gcc and clang, as well as gcc 4.1.2, the oldest available. Not in that link but I also pushed clang back to 3.0.0, the oldest available, and it accepted the trick too.

The "f3" style being the only one supported means I only see how to enable a single static split, none of the more sophisticated things we conjectured about.

jepler added 7 commits July 7, 2021 08:31
By comparing the address of the initial 'name' field instead of the
addresses of the objects themselves, a small amount of type safety is
added back, vs just casting to void.

In the event that some other kind of object is passed in as 't',
which happens to have a 'name' field of the right type, the construct
would be (undesirably) accepted but it would almost certainly evaluate
to false at runtime.
This appears to work with clang versions at least since 3.0.
.. These files can be created if the developer runs
"./build.sh" or "./build-cp.sh" inside extmod/ulab.
@v923z
Copy link

v923z commented Jul 7, 2021

We would need a plan for how ulab could be compatible with both CP and MP since it both needs to declare types with the extended fields, and also query an optional field.

I think it would be really great, if this proposal ended up in micropython proper. However, I just want to re-iterate my position with regards to ulab: if micropython needs more time, this fact shouldn't block development on the circuitpython side, and I am more than happy to accommodate temporary patches, if by doing so we can guarantee compatibility over the whole spectrum.

@jepler
Copy link
Author

jepler commented Jul 7, 2021

Our ultimate plan will be to offer this upstream to micropython!

@jepler jepler force-pushed the split-type-objects branch from 51781b8 to a19a48f Compare July 7, 2021 20:50
@jepler
Copy link
Author

jepler commented Jul 12, 2021

Is there an intention to add a compile-time config macro for this feature, so that ports that have the space can opt out of this to improve performance?

In CircuitPython, I think we decided to always use this size savings. However, the preprocessor CAN disable this feature, with macros something like

#if !defined(MP_TYPE_FLAG_EXTENDED)
#define MP_TYPE_CALL call
#define mp_type_get_call_slot(t) t->call
#define MP_TYPE_FLAG_EXTENDED (0)
#define MP_TYPE_EXTENDED_FIELDS(...) __VA_ARGS__
#endif

which will be used in ulab for compatibility with old circuitpython, micropython, and circuitpython with this change.

@jepler
Copy link
Author

jepler commented Jul 12, 2021

Size savings by board, sorted by savings:
image
The smallest savings is 1340 bytes, on fomu. The biggest change of 3472 bytes was on electronut_labs_blip. The average is 2554 and the median is 2792.

@jepler
Copy link
Author

jepler commented Jul 12, 2021

Here's my own stab at a script to analyze slot usage: https://gist.github.com/jepler/5cb392d43c16845645b49296c27e0850
.. it works on an elf file

The resulting spreadsheet for trinket_m0:
https://docs.google.com/spreadsheets/d/1zcEAkoOfHaMzsCcB9Z1x8MyClU_jt_xDHfFMPpE-rbs/edit?usp=sharing

A further iteration tried all the combinations and found that my split was best, again for trinket_m0:
https://gist.github.com/d480e10a509c0af8de69bdb247df9f4a

Furthermore, the data savings was estimated at 1952 bytes, vs a measured flash usage change of 1784 bytes so code growth must have been on the order of 168 bytes.

@dpgeorge
Copy link

Using @jimmo 's spreadsheet I found a slightly better split (based on PYBv1.1) that puts the protocol slot in the base set. That saves an extra 92 bytes (on PYBv1.1 at least).

Using a 3-way split (base + ext1 + ext2) of the following form:

base = (print, make_new, attr, parent, locals_dict)
ext1 = (call, unary_op, getiter, iternext, protocol)
ext2 = (binary_op, subscr, buffer_p)

saves an extra 672 bytes compared to @jepler 's scheme (on PYBv1.1).

A 4-way split can do a little better again, but it's diminishing returns (saves 108 bytes more).

@jepler
Copy link
Author

jepler commented Jul 12, 2021

To re-cap some estimated size savings, now shifting to pybv11 (I think all these figures were from pyb types):

est. savings, exclusive of code size increase:

scheme savings
perfect 5216
jimmo-index-table 3284
damien-4tier 3284
damien-3tier 3176
damien-2tier 2596
jepler-2tier 2504

@jimmo
Copy link

jimmo commented Jul 12, 2021

I did some experiments on the generated code for the different approaches.

I have three branches that add an extra slot to mp_obj_type_t in three different ways:

They all add a void mp_do_extra(mp_obj_t t); to runtime.h that invokes the function in this slot, and defines a function in this slot on list.

Here's the baseline disassembly:

0802c748 <mp_do_extra>:
 802c748:       b510            push    {r4, lr}
 802c74a:       f000 fd5d       bl      802d208 <mp_obj_get_type>
 802c74e:       6bc3            ldr     r3, [r0, #60]   ; 0x3c
 802c750:       b113            cbz     r3, 802c758 <mp_do_extra+0x10>
 802c752:       e8bd 4010       ldmia.w sp!, {r4, lr}
 802c756:       4718            bx      r3
 802c758:       bd10            pop     {r4, pc}

And for the split (note extra function to extract the field)

0802c748 <mp_do_extra>:
 802c748:       b510            push    {r4, lr}
 802c74a:       f000 fd5d       bl      802d208 <mp_obj_get_type>
 802c74e:       f001 f876       bl      802d83e <mp_type_getextra>
 802c752:       b110            cbz     r0, 802c75a <mp_do_extra+0x12>
 802c754:       e8bd 4010       ldmia.w sp!, {r4, lr}
 802c758:       4700            bx      r0
 802c75a:       bd10            pop     {r4, pc}

0802d83e <mp_type_getextra>:
 802d83e:       8883            ldrh    r3, [r0, #4]
 802d840:       f013 0380       ands.w  r3, r3, #128    ; 0x80
 802d844:       bf14            ite     ne
 802d846:       6bc0            ldrne   r0, [r0, #60]   ; 0x3c
 802d848:       4618            moveq   r0, r3
 802d84a:       4770            bx      lr

And for the index:

0802c748 <mp_do_extra>:
 802c748:       b510            push    {r4, lr}
 802c74a:       f000 fd61       bl      802d210 <mp_obj_get_type>
 802c74e:       f890 303d       ldrb.w  r3, [r0, #61]   ; 0x3d
 802c752:       b12b            cbz     r3, 802c760 <mp_do_extra+0x18>
 802c754:       3310            adds    r3, #16
 802c756:       e8bd 4010       ldmia.w sp!, {r4, lr}
 802c75a:       f850 3023       ldr.w   r3, [r0, r3, lsl #2]
 802c75e:       4718            bx      r3
 802c760:       bd10            pop     {r4, pc}

The compiler does a very nice job of that! It also gets an added benefit for the fast path that it doesn't have to do the second check (if the slot index is zero, it doesn't have to check if the slot itself is NULL). (Edit: updated the disassembly, there was an error in mp_do_extra).

@jepler
Copy link
Author

jepler commented Jul 12, 2021

As long as there are at most 15 optional slots, can you use bitfields:4 instead of uint8_t for the slot indices?

@jimmo
Copy link

jimmo commented Jul 12, 2021

As long as there are at most 15 optional slots, can you use bitfields:4 instead of uint8_t for the slot indices?

I think so! I do wonder if you end up doing a lot more work to extract the bits out for a relatively small saving, but I will test it out.

@jepler
Copy link
Author

jepler commented Jul 12, 2021

Using @jimmo 's spreadsheet I found a slightly better split (based on PYBv1.1) that puts the protocol slot in the base set. That saves an extra 92 bytes (on PYBv1.1 at least).

This is different than trinket_m0; I think we use protocols less heavily than you.

Is there a better way for "3 tier" to look than

STATIC const mp_obj_type_t uctypes_struct_type = {
    { &mp_type_type },
    .flags = MP_TYPE_FLAG_EXTENDED2 /* (implies EXTENDED1) */ ,
    .name = MP_QSTR_struct,
    .print = uctypes_struct_print,
    .make_new = uctypes_struct_make_new,
    .attr = uctypes_struct_attr,
    MP_TYPE_EXTENDED_FIELDS1(
        .subscr = uctypes_struct_subscr,
        .unary_op = uctypes_struct_unary_op,
        MP_TYPE_EXTENDED_FIELDS2(
            .buffer_p = { .get_buffer = uctypes_get_buffer },
        ))
};

@jimmo
Copy link

jimmo commented Jul 12, 2021

As long as there are at most 15 optional slots, can you use bitfields:4 instead of uint8_t for the slot indices?

I think so! I do wonder if you end up doing a lot more work to extract the bits out for a relatively small saving, but I will test it out.

Here's the disassembly for this approach (see code at https://github.com/jimmo/micropython/tree/obj-slots-index-bits)

0802c748 <mp_do_extra>:
 802c748:       b510            push    {r4, lr}
 802c74a:       f000 fd65       bl      802d218 <mp_obj_get_type>
 802c74e:       f890 303c       ldrb.w  r3, [r0, #60]   ; 0x3c
 802c752:       f013 0ff0       tst.w   r3, #240        ; 0xf0
 802c756:       d008            beq.n   802c76a <mp_do_extra+0x22>
 802c758:       f343 1303       sbfx    r3, r3, #4, #4
 802c75c:       b25b            sxtb    r3, r3
 802c75e:       3310            adds    r3, #16
 802c760:       e8bd 4010       ldmia.w sp!, {r4, lr}
 802c764:       f850 3023       ldr.w   r3, [r0, r3, lsl #2]
 802c768:       4718            bx      r3
 802c76a:       bd10            pop     {r4, pc}

Note there was an error in the disassembly I posted for the index approach in the earlier comment. I have updated that comment and the branch.

@jimmo
Copy link

jimmo commented Jul 12, 2021

Here's the corresponding disassembly for M0 (rp2040)

Baseline:

100178e4 <mp_do_extra>:
100178e4:       b510            push    {r4, lr}
100178e6:       f7f5 f99f       bl      1000cc28 <mp_obj_get_type>
100178ea:       6bc3            ldr     r3, [r0, #60]   ; 0x3c
100178ec:       2b00            cmp     r3, #0
100178ee:       d000            beq.n   100178f2 <mp_do_extra+0xe>
100178f0:       4798            blx     r3
100178f2:       bd10            pop     {r4, pc}

Split

100178f4 <mp_do_extra>:
100178f4:       b510            push    {r4, lr}
100178f6:       f7f5 f997       bl      1000cc28 <mp_obj_get_type>
100178fa:       f7f5 fcb9       bl      1000d270 <mp_type_getextra>
100178fe:       2800            cmp     r0, #0
10017900:       d000            beq.n   10017904 <mp_do_extra+0x10>
10017902:       4780            blx     r0
10017904:       bd10            pop     {r4, pc}

1000d270 <mp_type_getextra>:
1000d270:       8881            ldrh    r1, [r0, #4]
1000d272:       0003            movs    r3, r0
1000d274:       2280            movs    r2, #128        ; 0x80
1000d276:       0008            movs    r0, r1
1000d278:       4010            ands    r0, r2
1000d27a:       4211            tst     r1, r2
1000d27c:       d000            beq.n   1000d280 <mp_type_getextra+0x10>
1000d27e:       6bd8            ldr     r0, [r3, #60]   ; 0x3c
1000d280:       4770            bx      lr

Index

100178e4 <mp_do_extra>:
100178e4:       b510            push    {r4, lr}
100178e6:       f7f5 f99f       bl      1000cc28 <mp_obj_get_type>
100178ea:       0003            movs    r3, r0
100178ec:       333d            adds    r3, #61 ; 0x3d
100178ee:       781b            ldrb    r3, [r3, #0]
100178f0:       2b00            cmp     r3, #0
100178f2:       d003            beq.n   100178fc <mp_do_extra+0x18>
100178f4:       3310            adds    r3, #16
100178f6:       009b            lsls    r3, r3, #2
100178f8:       581b            ldr     r3, [r3, r0]
100178fa:       4798            blx     r3
100178fc:       bd10            pop     {r4, pc}

Index-with-bitfield

100178ec <mp_do_extra>:
100178ec:       b510            push    {r4, lr}
100178ee:       f7f5 f99b       bl      1000cc28 <mp_obj_get_type>
100178f2:       0003            movs    r3, r0
100178f4:       333c            adds    r3, #60 ; 0x3c
100178f6:       781b            ldrb    r3, [r3, #0]
100178f8:       220f            movs    r2, #15
100178fa:       0019            movs    r1, r3
100178fc:       4391            bics    r1, r2
100178fe:       d005            beq.n   1001790c <mp_do_extra+0x20>
10017900:       b25b            sxtb    r3, r3
10017902:       111b            asrs    r3, r3, #4
10017904:       3310            adds    r3, #16
10017906:       009b            lsls    r3, r3, #2
10017908:       581b            ldr     r3, [r3, r0]
1001790a:       4798            blx     r3
1001790c:       bd10            pop     {r4, pc}

@jepler
Copy link
Author

jepler commented Jul 12, 2021

Code size, per access of type object field (@jimmo's code, counted manually by me)

style m4 m0
baseline 18 16
2-tier 34 36
uint8_t index table 26 26
bitfield index table 36 34

@jepler
Copy link
Author

jepler commented Jul 12, 2021

Looking again at 3-tier, based on trinket_m0 types, I found the best saved 2152 bytes, or an additional 200 bytes compared to 1952 with the two-tier system.

My optimal 3-tier was:

tier fields
base (base flags name) print make_new attr parent
first getiter iternext protocol locals_dict
second call unary_op binary_op subscr buffer_p

code (runs for quite awhile) & output (just the few lines at top and bottom) in gist: https://gist.github.com/254a06c86fa956bcc3cd87fb0940ce97

The code's a bit of a mess so I am not super-confident in the results.

Another iteration of the script computed these statistics & estimates:

# types 108
# filled slots 364
Slot print#0 filled 63 times
Slot make_new#1 filled 72 times
Slot call#2 filled 10 times
Slot unary_op#3 filled 25 times
Slot binary_op#4 filled 14 times
Slot attr#5 filled 37 times
Slot subscr#6 filled 11 times
Slot getiter#7 filled 30 times
Slot iternext#8 filled 18 times
Slot bufer_p#9 filled 5 times
Slot protocol#10 filled 8 times
Slot parent#11 filled 30 times
Slot locals_dict#12 filled 41 times
optimum 2320
slotted 2752

considering that the size of the best 2-tier system was 4156 and the original is 6480, jimmo's slotting idea looks real tasty for our most constrained boards.

@jepler
Copy link
Author

jepler commented Jul 12, 2021

@tannewt @dhalbert I did the suggested re-namings. As discussed in the meetings I'd like to merge this now into CP knowing we may have to re-do it if upstream does even more/better than we did here, but it needs a fresh review.

Copy link
Member

@tannewt tannewt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this! I hope we don't fill the space up too quickly. :-)

@tannewt tannewt merged commit 9fdecac into adafruit:main Jul 12, 2021
@jepler
Copy link
Author

jepler commented Jul 13, 2021

@dpgeorge @jimmo really appreciate your feedback on this, let me know if I can pitch in on the implementation in micropython.

@jimmo
Copy link

jimmo commented Jul 13, 2021

FWIW here's some (rough) timing data on the different approaches. This is milliseconds, for 1000000 iterations of calling mp_do_extra with the code in those branches (but with the print removed from list_extra). Calling it for three different types (testing the "extended, has the slot" (list), "extended, doesn't have the slot" (dict) and "not-extended" (str) paths).

(str was a bit of a bad choice in hindsight, because it uses a different path in mp_obj_get_type).

Baseline:

str: 214
dict: 203
list: 202

Split

str: 262
dict: 257
list: 256

Index:

str: 214
dict: 208
list: 226

Index (bits)

str: 214
dict: 208
list: 250

At some stage I will implement the index approach for MicroPython and run the full performance test suite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants