Skip to content

Add extra delta latency on common tickers lp ticker speed greentea test #12880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

rajkan01
Copy link
Contributor

@rajkan01 rajkan01 commented Apr 28, 2020

Summary of changes

lp ticker speed test cases from common_tickers test suite were failed while running greentea test on bare metal profile with a small c library on NUCLEO_F303RE target with ARMC6 toolchain but works with bare metal profile with std c library.

Below table shows the execution time of lp ticker failed APIs in micro second on different profiles:

Ticker API Full profile Bare metal profile with std C lib Bare metal profile with small C lib
clear_interrupt 10 10 51
disable_interrupt 10 10 51

As it is a problem with small C library (microlib) slow division limitation and this target slow clock behaviour on deep sleep as like similar issue faced with NUCLEO_L073RZ Cortex-M0 target on deep sleep green tea test

As a workaround, I have added the additional extra delta latency of 25us on expected execution time calculation to allow clear_interrupt and disable_interrupt test cases to pass.

Impact of changes

With these changes, lp ticker speed greentea test cases are passed.

Migration actions required

None.

Documentation

None.


Pull request type

[x] Patch update (Bug fix / Target update / Docs update / Test update / Refactor)
[] Feature update (New feature / Functionality change / New API)
[] Major update (Breaking change E.g. Return code change / API behaviour change)

Test results

[] No Tests required for this change (E.g docs only update)
[x] Covered by existing mbed-os tests (Greentea or Unittest)
[] Tests / results supplied as part of this PR

Reviewers

cc @MarceloSalazar
@evedon


@rajkan01 rajkan01 changed the title Add the extra us delta latency on lp ticker speed greentea test Add the extra delta latency on common tickers lp ticker speed greentea test Apr 28, 2020
@rajkan01 rajkan01 changed the title Add the extra delta latency on common tickers lp ticker speed greentea test Add extra delta latency on common tickers lp ticker speed greentea test Apr 28, 2020
@ciarmcom ciarmcom requested review from evedon, MarceloSalazar and a team April 28, 2020 21:00
@ciarmcom
Copy link
Member

@rajkan01, thank you for your changes.
@evedon @MarceloSalazar @ARMmbed/mbed-os-hal @ARMmbed/mbed-os-test @ARMmbed/mbed-os-maintainers please review.

@MarceloSalazar
Copy link

@ARMmbed/team-st-mcd please review as it's related to our conversations in #12221

Copy link
Collaborator

@jeromecoutant jeromecoutant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that STM32F303 is a cortex M4,
so maybe #12221 scope has to be updated ?

@kjbracey
Copy link
Contributor

kjbracey commented Apr 29, 2020

The problem is the test exists for a reason. The moment total timer interrupt execution time exceeds 80us, you're losing serial data (at 115200, if no FIFO, which is common, and true here, I think). A platform failing this test is likely to have comms issues.

Copy link
Contributor

@0xc0170 0xc0170 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add details from the PR description to the commit to explain reasoning for this extra time for microlib.

@@ -54,6 +54,11 @@ extern "C" {

#define MAX_FUNC_EXEC_TIME_US 20
#define DELTA_FUNC_EXEC_TIME_US 5
#if defined(__MICROLIB)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont have small std rather but specific one (microlib) only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Microlib slow division causes this issue of slow execution and perfectly works with newlib-nano

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be an option to disable microlib on this platform, pending resolution of the speed issue?

Copy link
Contributor Author

@rajkan01 rajkan01 Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evedon @MarceloSalazar please comment on your view to disable microlib on this platform

Copy link

@MarceloSalazar MarceloSalazar Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not keen on adding specific target dependencies as the code becomes a bit of a mess and it's hard to maintain/scale to support other targets.
But we can detect the situation that prevents code from being executed and skip accordingly (for now). Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant not support use of microlib on this target - I believe it can be flagged as unsupported in targets.json?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a question for @jeromecoutant on how he thinks this could/should be handled.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I will "unsupport" some specific targets, before we understand the real reason!
Please correct first #12221

Copy link
Contributor

@kjbracey kjbracey Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach would be to optimise HAL_GetTick. Once the OS is running, it is going via the generic ticker_read_us, which is a fully-generalised routine to get microseconds.

And I believe it's only ever used to get small timing delays. A suggestion would be to use us_ticker_read directly, as you do when mbed_sdk_initted is false. I don't see what the if condition is gaining you there - why can't you always use the "uninited" code?

I'm busy space/speed optimising ticker_read_us, but still, you can just bypass it.

And then the next trick is to optimise the division for locality. When speed matters, you know little time has elapsed. So divide the difference by 1000, not the total. The standard divide routines should terminate fast for small inputs, but if they don't you could code them yourself. And it's a 32-bit division, not a 64-by-32 one.

(Were you using ticker_read_us to get a properly wrapping 32-bit tick by dividing down the 64-bit microseconds? I guess the non-initted code doesn't give you that because it's dividing down 32-bits, but the below code deals with it).

Something like

 uint32_t prev_time, remainder;
 uint32_t new_time = us_ticker_read();
 uint32_t elapsed_time = ((new_time - prev_time) & US_TICKER_MASK) + remainder;
 uint32_t elapsed_ticks = elapsed_time / 1000;
 // XXX compiler should terminate above line fast. Probably worth helping it with
 //     if (elapsed_time < 1000) { elapsed_ticks = 0; }
 //     else if (elapsed_time < 2000) { elapsed_ticks = 1; }
 //     else { elapsed_ticks = elapsed_ticks / 1000; }
 // or even do the division totally manually?
 //     remainder = elapsed_time;
 //     elapsed_ticks = 0;
 //     while (elapsed_time >= 1000) {
 //         elapsed_ticks++; remainder -= 1000;
 //     }
 // Or hybrid - do the loop if elapsed_time < 65536? (Which would be always if 16-bit)
 remainder = elapsed_time % 1000;
 // XXX compiler should optimise the above line when immediately following `/ 1000`
 // if it doesn't, or not immediately following `/ 1000`  use this instead:
 //     remainder = elapsed_time - elapsed_ticks * 1000;
 // could special-case remainder inside the ifs suggested above
 return prev_time += elapsed_ticks;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Kevin suggestion, HAL_GetTick API source has optimized to achieve better performance on all profile and test results on one-time execution as below

Ticker API HAL_GetTick without optimization on bare metal profile with small C lib HAL_GetTick with optimization on bare metal profile with small C lib
clear_interrupt 51us 5us
disable_interrupt 51us 5us

@rajkan01 rajkan01 force-pushed the lpticker_speed_test branch from fccf649 to 140e6b2 Compare April 29, 2020 10:56
@0xc0170 0xc0170 self-requested a review April 29, 2020 11:06
@0xc0170 0xc0170 removed the request for review from a team April 29, 2020 11:06
@mergify mergify bot dismissed 0xc0170’s stale review April 29, 2020 11:42

Pull request has been modified.

@evedon
Copy link
Contributor

evedon commented Apr 29, 2020

The problem is the test exists for a reason. The moment total timer interrupt execution time exceeds 80us, you're losing serial data (at 115200, if no FIFO, which is common, and true here, I think). A platform failing this test is likely to have comms issues.

In that case I think we shoud disable microlib for this platform in targets.json by removing "small" for the ARM toolchain and not proceed with the extra latency delay.

"supported_c_libs": {
            "arm": ["std"],
            "gcc_arm": ["std", "small" ],
            "iar": ["std" ]
        }

And we should increase the scope #12221

uint32_t elapsed_ticks = 0;
uint32_t elapsed_time = (((new_time - prev_time) & US_TICKER_MASK) + prev_tick_remainder);
prev_time = new_time;
if (elapsed_time < 1000) { // elapsed time less than 1ms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two special cases seem to me to be pointless code bloat. I don't see how they're going to be noticeably faster than just going into the while loop. We generally want to optimise for space, except where we absolutely have to.

They'd make sense if you didn't have the while loop and the alternative was the standard C library division.

@@ -130,8 +131,9 @@ void init_16bit_timer(void)
__HAL_TIM_DISABLE_IT(&TimMasterHandle, TIM_IT_CC1);

// Used by HAL_GetTick()
total_ticks = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why any of these need to be initialised here.

And the fact they weren't in the 32-bit case would seem to confirm that.

// Variables also reset in us_ticker_init()
uint32_t total_ticks = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can make these static - see other comment.

Just to potentially save 2 bytes of RAM, I think I'd have the type of prev_time conditional on TIM_MST_BIT_WIDTH <= 16

} else if (elapsed_time < 2000) { // elapsed time less than 2ms
elapsed_ticks = 1;
prev_tick_remainder = elapsed_time - 1000;
} else if (elapsed_time < 65536) { // For 16-bit ticker max value of 2^16 = 65536
Copy link
Contributor

@kjbracey kjbracey May 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The maximum elapsed_time value here is actually 65535+999, but I don't think that's a relevant factor for the threshold.

We just want to not spend a huge time going round the while loop, when the C division would be faster.

That crossover point would be pretty low for most C libraries, but maybe not microlib.

I had a vague idea the compiler could optimise out the test for a 16-bit timer if it was set high though - a space saving rather than a speed saving.

Maybe that would work, if we help it:

if (TIM_MST_BIT_WIDTH <= 16 || elapsed_time < 65536)

That will basically force the compiler to leave out the /% case when 16-bit. Going 65 times round the loop is probably fine. (Will almost never be that high anyway).

I'm not sure which (if any) compilers could figure out the maximum possible value of elapsed_time and eliminate the test when a 16-bit timer couldn't possibly reach the threshold.

@evedon
Copy link
Contributor

evedon commented May 4, 2020

@rajkan01 It would be cleaner to raise a new PR for the improvment to HAL_GetTick (and close this PR)

- Add the extra us delta latency on lp ticker speed test in particularly on clear_interrupt and disable_interrupt APIs test case expected execution time calculation due to issue with microlib slow division and target slow clock issue.
@rajkan01 rajkan01 force-pushed the lpticker_speed_test branch from 24c1111 to b5e14b4 Compare May 4, 2020 14:38
@rajkan01
Copy link
Contributor Author

rajkan01 commented May 4, 2020

Improvement of HAL_GetTick API is done in the #12915 and closing this PR

@rajkan01 rajkan01 closed this May 4, 2020
@mergify mergify bot removed the needs: work label May 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants