STM targets are extremely inefficient at toggling GPIOs between input and output #3299

Sissors · 2016-11-19T15:24:40Z

Note: This is just a template, so feel free to use/remove the unnecessary things

Description

Type: Bug (they are so slow it is a bug and not an enhancement imo)
Priority: Major

This is related to for example: https://developer.mbed.org/questions/69344/Detection-problem-of-a-DS1820-on-a-NUCLE/, especially the slower STM ones can't handle a OneWire sensor using the mbed code. I just spend some time modifying the mbed code, after which it didn't work, I spend alot of time walking through everything, and in the end ended up with the same code that now did work (so I screwed something up in the first try). Conclusion: If I make the toggling between input and output a normal speed, they work properly.

In the past I already noticed that the mbed STM lib is simply horrible at switching to input/output, where my FastIO lib was 100 times faster at this than the mbed lib. So what is the situation? Like pretty much any other MCU, also STM targets have a few registers to set the direction of a pin. Switching the direction is either clearing or setting a single bit in a register, so: read a register, modify bit, write register back. Takes a few clock cycles, lets say with mbed overhead 10 clock cycles, which is fine.

Now look at the current code: https://github.com/ARMmbed/mbed-os/blob/master/targets/TARGET_STM/TARGET_STM32F0/gpio_api.c#L71, pin_function is called. (NOTE: when this is optimized, do realise that the gpio_init function does not actually init the pin, this is currently done by this gpio_dir function, so the pin_function call should be moved to the gpio_init code!).

pin_function (https://github.com/ARMmbed/mbed-os/blob/master/targets/TARGET_STM/TARGET_STM32F0/pinmap.c#L92) does a bunch of calculations regarding pin settings we don't care about, since we just wanted to switch between input and output.
Next step it enables the GPIO clock, every single time you switch your pin direction. Finally it calls HAL_GPIO_Init.

HAL_GPIO_Init (https://github.com/ARMmbed/mbed-os/blob/master/targets/TARGET_STM/TARGET_STM32F0/device/stm32f0xx_hal_gpio.c#L188) starts with some asserts, after which it has a while loop. Not a clue what is being looped exactly, but it seems a bit excessive for toggling a pin direction. In this loop we first start by setting the alternate function of the pin. Next on line 224-227 the actual direction of the pin is set. This would make us happy, but this isn't the end. The next step the output speed and things like that are being set. Followed by the interrupt mode (of course no interrupt is active, so the section is skipped, but yet another check).

In the end, this is all a bit excessive when the only requirement is a single bit that needs to be toggled, see for example: https://developer.mbed.org/users/Sissors/code/FastIO/file/45b32f07e790/Devices/FastIO_NUCLEO_F030.h (scroll down).

I would change it myself, but since STM seems to maintain their mbed libs fairly well (kudos for that), and since I only have a few devices myself, it seems to me to be more something for them to do :). Unless you want to send me every STM board you have ;).

LMESTM · 2016-11-21T08:31:57Z

@Sissors : thanks for the detailed description. We will have a look at it and see if any enhancement to the current implementation can be done. Please also feel free to propose enhancements and we would review and organize the non regression tests for the complete list of supported boards when needed.

hagibr · 2016-11-23T12:40:00Z

As I know, you don't have to toggle between input and output in this case. Just configure it as output with open drain, you can read the input data register and it will reflect the state of the pin.

Configure the pin as output with open drain (you are counting on the pull-up resistor of this pin)
When you are using as output, when you want to write a '0', what you are actually doing is pulling the pin down by closing the output transistor.
When you want to write an '1', you are actually opening the output transistor and letting the pull-up resistor pull the pin up.
When you want to read as input, you must set the output register as '1' and let the slave device pull the pin up/down.
I hope it help you someway. I've already worked with ds18b20 sensors with STM32F103, without mbed OS and I know it will work this way I've explained.

Sissors · 2016-11-23T13:01:33Z

Those are very good points, however I do have some counter points:

Opendrain is indeed the best, but not all mbed MCUs support it. So since it is possible I might actually do just that, but then I do need target dependent macros in the library. (I just made myself a software I2C lib, which isn't as timing critical, but there I could do the same).
A reason not to do it in the library, is that it is sometimes used in push-pull configuration. This is not the original intended way to do OneWire, but it is a known way to get better performance out of the bus. Of course I could also switch between PushPull and OpenDrain, but I don't know how long that mode switch takes :D. (And again it depends on the target if OpenDrain is available)
While for the specific DS1820 lib I think you got a very good point, in general there is no reason why toggling a single bit in a register takes hunderds of clock cycles.

odinthenerd · 2016-12-01T17:02:51Z

shouldn't mbed provide a "mux" function which does all the turning on of power clock and multiplexing whatnot? Also why is the function which is doing something as trivial as changing one bit (at least it should be that trivial) in its own translation unit? This stops the compiler from inlining the function (at least without LTO). Also why are you asserting on the state of input parameters which are arguably always known at compile time? Why no static checking, would be much more safe and efficient. This kind of thing really should be using a split header strategy rather than a split backed.

Plus isn't there a race condition here
https://github.com/ARMmbed/mbed-os/blob/master/targets/TARGET_STM/TARGET_STM32F0/device/stm32f0xx_hal_gpio.c#L224

what if two threads configure two different pins and one interrupts the other.

oh and here too
https://github.com/ARMmbed/mbed-os/blob/master/targets/TARGET_STM/TARGET_STM32F0/device/stm32f0xx_hal_gpio.c#L236

and here
https://github.com/ARMmbed/mbed-os/blob/master/targets/TARGET_STM/TARGET_STM32F0/device/stm32f0xx_hal_gpio.c#L242

and here
https://github.com/ARMmbed/mbed-os/blob/master/targets/TARGET_STM/TARGET_STM32F0/device/stm32f0xx_hal_gpio.c#L249

and here
https://github.com/ARMmbed/mbed-os/blob/master/targets/TARGET_STM/TARGET_STM32F0/device/stm32f0xx_hal_gpio.c#L267

and here
https://github.com/ARMmbed/mbed-os/blob/master/targets/TARGET_STM/TARGET_STM32F0/device/stm32f0xx_hal_gpio.c#L275

and here
https://github.com/ARMmbed/mbed-os/blob/master/targets/TARGET_STM/TARGET_STM32F0/device/stm32f0xx_hal_gpio.c#L284

and here
https://github.com/ARMmbed/mbed-os/blob/master/targets/TARGET_STM/TARGET_STM32F0/device/stm32f0xx_hal_gpio.c#L292

Anyway I see at least most of the fault at ARM for trying to oversimplify things and cutting corners in API design. See my full rant here http://odinthenerd.blogspot.de/2016/12/rant-on-mbed-design-decisions.html

Sissors · 2016-12-01T17:40:44Z

There is protection at higher levels: https://github.com/ARMmbed/mbed-os/blob/master/drivers/DigitalInOut.h#L80 (I do wonder how much overhead that gives).

While I can see your point, and agree on at least some of them, I do seriously wonder if you can keep the API accessible if you would make everything 100% safe and efficient for every possible MCU architecture. If you are going to make an automotive grade embedded system, then blindly using mbed API is a bad idea. But not every system is that critical. Improvements I am all for, and also sometimes the API does need an overhaul, even if it breaks older things. But its primary goal should be that it is accessible and platform independent, otherwise there is no reason to use the API at all.

odinthenerd · 2016-12-01T18:27:07Z

Ok I stand corrected, looks like that is safe if you always use the critical section. Is there documentation keeping the user from calling the wrong function? And oh my god the overhead. This means that although 95% of all cores can change direction in atomic manor the vendors have no way of actually taking advantage of that since the OS will lock any way?

Direction should really should be a bit banded / dedicated register write, three or four instructions. This is like 2 orders of magnitude less efficient, which was your initial point, so I guess we agree on that ;)

To my point on APIs: you have all the CMSIS SVD data, you know what bit fields there are and what operations on them are ATOMIC or not. Why not create a TMP DSL to express register interactions, allow the vendor to translate the desired IO action to concrete SFR actions expressing themselves in said DSL and then use TMP to decide at compile time if what the vendor did was atomic or not and lock accordingly if needed. You could also move almost all your asserting to compile time and have guaranteed efficiency and safety.

odinthenerd · 2016-12-03T22:26:00Z

Picking up the conversation about API improvements: wat do you think about allowing vendor code to add an extra parameter to their functions in order to declare them atomic and then use tag dispatch to either lock or not depending on what the the vendor code specifies, something like this could be done without SFINAE or templates:

void input(){
        do_input(&gpio_dir);
    }
    void do_input(void (*)(gpio_t *, PinDirection, IsAtomic)) {
        gpio_dir(&gpio, PIN_INPUT);
    }
    void do_input(void (*)(gpio_t *, PinDirection)) {
        core_util_critical_section_enter();
        gpio_dir(&gpio, PIN_INPUT);
        core_util_critical_section_exit();
    }

On compilers: are you guys moving to require support for at lest C++98 templates? I think a lazy evaluated and compile time optimizing io lib could be written in C++98 template meta programming. If you were to require C++11 we could just use kvasir.io as the foundation but as I understand it that is not coming for a while.

Sissors · 2017-01-08T17:20:52Z

@odinthenerd , you mean like the FastIO libs? They are more efficient indeed, although some cases (like here toggling between input and output) can be alot better even in the current setup. Personally I dislike the requirement to use the driver libs of manufacturers. If it works, fine, but don't use workarounds just to be able to use those driver libs.

Now back to my 'problem', since yet again someone had issues with it, I decided to add specifically for STM devices OpenDrain mode (why not make it default? Because PushPull outputs allow larger distances). So coding a bit, trying it out, first it didn't work, on K22F it did work, but eventually it also worked on the STM. This is mainly because apparently the DS1820 has a higher drive strength than my F030 (at least pulling it down while the F030 is pulling it up): There is one small problem with using OpenDrain mode on STM devices, it isn't implemented...

Here is the relevant code: https://github.com/ARMmbed/mbed-os/blob/master/targets/TARGET_STM/TARGET_STM32F4/pinmap.c#L164, it only sets pull resistors, it never enables OpenDrain mode. So test program where open drain output is set at '1', results in 26mA current if I pull it low externally through my multimeter. If I manually set the registers correct it is 0.

Btw gpio->OTYPER |= (uint32_t)(1 << pin_index); is the required function to enable OpenDrain mode.

0xc0170 · 2017-01-09T14:05:45Z

at do you think about allowing vendor code to add an extra parameter to their functions in order to declare them atomic

Looks promising.

On compilers: are you guys moving to require support for at lest C++98 templates? I think a lazy evaluated and compile time optimizing io lib could be written in C++98 template meta programming. If you were to require C++11 we could just use kvasir.io as the foundation but as I understand it that is not coming for a while.

C++98 is the default one currently.

There is one small problem with using OpenDrain mode on STM devices, it isn't implemented...

Here is the relevant code: https://github.com/ARMmbed/mbed-os/blob/master/targets/TARGET_STM/TARGET_STM32F4/pinmap.c#L164, it only sets pull resistors, it never enables OpenDrain mode. So test program where open drain output is set at '1', results in 26mA current if I pull it low externally through my multimeter. If I manually set the registers correct it is 0.

@bcostm @adustm @LMESTM @jeromecoutant

LMESTM · 2017-01-09T14:08:31Z

@0xc0170 I know - this is in my to do list since a while. Should be one of my next tasks

LMESTM · 2017-01-26T14:50:12Z

hello @Sissors @odinthenerd
I've been preparing a basic rework of pins and gpios of STM targets. https://github.com/LMESTM/mbed/tree/dev_stm32_gpio_pins_rework
Goal is to have a separate access to IRQ / GPIOs / PINs registers, and also to have a common code base. The rework should improve a bit performances and solve few issues, like offering OpenDrain mode capability.
This is under test before a PR will be sent - but any preliminary feedback is welcome.

LMESTM · 2017-01-26T16:02:55Z

@Sissors

The FastIO bench results with this branch.
You may want to run them as well and update your page once it's merged.

F410RB

Starting test bench
Verifying basic behavior
Basic behavior verified

Measuring fixed write pattern speed
Standard took 7.52 cycles, FastIO took 2.02 cycles, which is 27%
Standard took 78ns, FastIO took 21ns

Measuring variable write pattern speed
Standard took 11.01 cycles, FastIO took 6.50 cycles, which is 59%
Standard took 115ns, FastIO took 68ns

Measuring read speed
Standard took 5.00 cycles, FastIO took 5.01 cycles, which is 100%
Standard took 52ns, FastIO took 52ns

Measuring toggling using operators speed
Standard took 18.52 cycles, FastIO took 4.99 cycles, which is 27%
Standard took 193ns, FastIO took 52ns

Measuring switching between input and output
Standard took 100.62 cycles, FastIO took 5.02 cycles, which is 5%
Standard took 1048ns, FastIO took 52ns

L030R8

Starting test bench
Verifying basic behavior
Basic behavior verified

Measuring fixed write pattern speed
Standard took 9.04 cycles, FastIO took 3.78 cycles, which is 42%
Standard took 188ns, FastIO took 79ns

Measuring variable write pattern speed
Standard took 14.03 cycles, FastIO took 6.01 cycles, which is 43%
Standard took 292ns, FastIO took 125ns

Measuring read speed
Standard took 7.02 cycles, FastIO took 7.03 cycles, which is 100%
Standard took 146ns, FastIO took 146ns

Measuring toggling using operators speed
Standard took 23.06 cycles, FastIO took 8.03 cycles, which is 35%
Standard took 480ns, FastIO took 167ns

Measuring switching between input and output
Standard took 133.36 cycles, FastIO took 6.04 cycles, which is 5%
Standard took 2778ns, FastIO took 126ns

Sissors · 2017-01-26T17:33:21Z

At least on first glance to me it looks good. Any idea why switching between input and output, while having a very nice improvement in speed, isn't faster yet? Looking at the code I don't see anything weird really. Although of course FastIO simply has the advantage of being a template class what you can never get from the (current) mbed API.

Generally I only update the table when I got a new device, but I'll make an exception for you and update it once this is merged in the main repository, and removing the footnote under the table.

Edit: Btw STMs continuous support of their code base is at least by me really appreciated.

LMESTM · 2017-01-27T08:21:21Z

At least on first glance to me it looks good. Any idea why switching between input and output, while having a very nice improvement in speed, isn't faster yet?

From what I quickly tested, there are 2 major root cause for this. The MBED API not being a template class indeed, and the critical sections protection in input()/output() of the class.

Generally I only update the table when I got a new device,

Sure I undertsand. And do you retest all devices when introducing a new device ?
Critical sections were introduced in June , 2016. Any device being tested before this date would definitely show better results.

but I'll make an exception for you and update it once this is merged in the main repository, and removing the footnote under the table.

Thanks !

odinthenerd · 2017-01-27T14:33:38Z

From what I quickly tested, there are 2 major root cause for this. The MBED API not being a template class indeed, and the critical sections protection in input()/output() of the class.

I have not had time to make a detailed proposal of how to eliminate the need for the critical section and looking at my schedule I will not have time in the near future either but if someone else wants to take up the cause here is what needs to be done:

in a new headder add a conditional version and a tag for every use case of a critical section

struct if_atomic{
    static void critical_section_enter(){
        core_util_critical_section_enter(); //default uses critical section
    };
    static void critical_section_exit(){
        core_util_critical_section_exit();
    };
};

struct io_input_is_atomic{};
struct io_output_is_atomic{};

struct is_atomic{    //derive from this class if you operation does not need to be in a critical section
    static void critical_section_enter(){};
    static void critical_section_exit(){};
}

make input() (and other functions which are applicable) call the conditional enter and exit functions

        if_atomic<io_input_is_atomic>::critical_section_enter();
        gpio_dir(&gpio, PIN_INPUT);
        if_atomic<io_input_is_atomic>::critical_section_exit();
    }

if a vendor has made an atomic implementation they can specialize the traits calls to turn off the critical section

struct if_atomic<io_input_is_atomic> : is_atomic {}; //base class provides empty stubs which do not lock

This is all theoretical implementation from my head so expect typeos ;) If anything is unclear feel free to contact me (holmes.odin@gmail.com or @odinthenerd on here or twitter)

adustm · 2017-05-05T16:07:43Z

Hello,
@LMESTM did a complete gpio rework.

@Sissors Do you agree to close this issue now ?
Cheers
Armelle

Sissors · 2017-05-05T16:51:49Z

Yes, it seems to be much improved now :)

adustm · 2017-05-09T16:24:48Z

Thank you for your answer ! @Sissors
Would there be any chance that you update the values in your FastIO code here https://developer.mbed.org/users/Sissors/code/FastIO/file/45b32f07e790/Devices/FastIO_NUCLEO_F030.h ? I'd be interested into the new values :)
Cheers
Armelle

Sissors · 2017-05-09T17:36:25Z

I did promise to do that :).

It is updated now. I verified that indeed it now meets what @LMESTM posted: #3665 (comment), so the table now reflects this also. A 4-5x increase in input/output switching is nothing to sneeze at :). Although there is still plenty of reason to use FastIO if more is required ;).

On a completely unrelated note: Who broke the automatic clock setup code? My F401 board does not get a clock from the ST-Link module. No idea why, I probably broke something. But with old mbed it immediatly starts up on its internal oscillator. (Which I don't know if you are still only ones who implemented automatic clock selection during runtime, but for sure you were the first ones, and it is really nice). With new mbed it takes roughly 10 seconds before it finally decides to use its internal oscillator. So it still works, but someone put the timeout value higher again :).

Sissors changed the title ~~STM targets are horribly inefficient at toggling GPIOs between input and output~~ STM targets are extremely inefficient at toggling GPIOs between input and output Nov 19, 2016

0xc0170 added the type: question label Nov 21, 2016

LMESTM mentioned this issue Jan 31, 2017

Dev stm32 gpio pins rework #3665

Merged

sg- added the devices: st label Feb 3, 2017

Sissors closed this as completed May 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STM targets are extremely inefficient at toggling GPIOs between input and output #3299

STM targets are extremely inefficient at toggling GPIOs between input and output #3299

Sissors commented Nov 19, 2016 •

edited

Loading

LMESTM commented Nov 21, 2016

hagibr commented Nov 23, 2016

Sissors commented Nov 23, 2016

odinthenerd commented Dec 1, 2016

Sissors commented Dec 1, 2016

odinthenerd commented Dec 1, 2016

odinthenerd commented Dec 3, 2016

Sissors commented Jan 8, 2017 •

edited

Loading

0xc0170 commented Jan 9, 2017

LMESTM commented Jan 9, 2017

LMESTM commented Jan 26, 2017

LMESTM commented Jan 26, 2017

Sissors commented Jan 26, 2017 •

edited

Loading

LMESTM commented Jan 27, 2017

odinthenerd commented Jan 27, 2017 •

edited

Loading

adustm commented May 5, 2017

Sissors commented May 5, 2017

adustm commented May 9, 2017

Sissors commented May 9, 2017

STM targets are extremely inefficient at toggling GPIOs between input and output #3299

STM targets are extremely inefficient at toggling GPIOs between input and output #3299

Comments

Sissors commented Nov 19, 2016 • edited Loading

Description

LMESTM commented Nov 21, 2016

hagibr commented Nov 23, 2016

Sissors commented Nov 23, 2016

odinthenerd commented Dec 1, 2016

Sissors commented Dec 1, 2016

odinthenerd commented Dec 1, 2016

odinthenerd commented Dec 3, 2016

Sissors commented Jan 8, 2017 • edited Loading

0xc0170 commented Jan 9, 2017

LMESTM commented Jan 9, 2017

LMESTM commented Jan 26, 2017

LMESTM commented Jan 26, 2017

Sissors commented Jan 26, 2017 • edited Loading

LMESTM commented Jan 27, 2017

odinthenerd commented Jan 27, 2017 • edited Loading

adustm commented May 5, 2017

Sissors commented May 5, 2017

adustm commented May 9, 2017

Sissors commented May 9, 2017

Sissors commented Nov 19, 2016 •

edited

Loading

Sissors commented Jan 8, 2017 •

edited

Loading

Sissors commented Jan 26, 2017 •

edited

Loading

odinthenerd commented Jan 27, 2017 •

edited

Loading