Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BCM270X_DT: mz61581: Revert to spi-bcm2708 #1077

Merged
merged 2 commits into from
Jul 19, 2015

Conversation

notro
Copy link
Contributor

@notro notro commented Jul 19, 2015

The MZ61581 display does not work with spi-bcm2835 and software chip selects. It works before the commit: spi: bcm2835: transform native-cs to gpio-cs on first spi_setup

Revert to spi-bcm2708 until the cause has been detected and the issue resolved.

Also increase reliability by adding a patch for missing display controller reset.
This fbtft patch is now in linux-next and will probably land in 4.3.

notro added 2 commits July 19, 2015 17:40
When an init sequence is present in the Device Tree,
fbtft_init_display_dt() is used to initialize the display.
Add missing reset function call and activation of
chip select for parallel bus.

Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
The MZ61581 display does not work with spi-bcm2835 and software
chip select. It works before the commit:
spi: bcm2835: transform native-cs to gpio-cs on first spi_setup

Revert to spi-bcm2708 until the cause has been detected and the
issue resolved.

Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
@popcornmix
Copy link
Collaborator

Okay with me. @pelwell ?

Ideally we would like to drop spi-bcm2708 (and other drivers that have an upstream alternative that appears fully functional), but I guess that won't happen for a while (perhaps rpi-4.2.y kernel tree?)

pelwell added a commit that referenced this pull request Jul 19, 2015
BCM270X_DT: mz61581: Revert to spi-bcm2708
@pelwell pelwell merged commit 5ef4952 into raspberrypi:rpi-4.0.y Jul 19, 2015
@notro
Copy link
Contributor Author

notro commented Jul 19, 2015

I can inform the manufacturer of the display that spi-bcm2708 probably will be removed the next time the Linux version is bumped on the official Pi kernel. That will give them some time to find out why it doesn't work with software controlled chip select.

@pelwell
Copy link
Contributor

pelwell commented Jul 19, 2015

From what I remember, the upstream spi-bcm2835 doesn't have the auto-conversion patch. If the default is to always use cs-gpio in the dtb, and I'm not saying it should be, then there will need to be a way to defeat that with an overlay.

@notro
Copy link
Contributor Author

notro commented Jul 19, 2015

@msperl Can spi-bcm2835 still be used with hw chip selects? Also with DMA?

@msperl
Copy link
Contributor

msperl commented Jul 19, 2015

As explained there are situations when the CS gets asserted/de-asserted for short periods (a few usec) of time when the control register is written to.
So any of those speed-optimizations will not work (as they will touch the register while the transfer is still in progress) - same with DMA.
That is why that auto-conversion from native to gpio-CS was put in place until CS-gpios gets added to the device tree...

Does the issue also happen when the dt already sets up pins 8 and 7 correctly as output and the spi driver node contains "CS-gpios = <&gpio 8>, <&gpio 7>;"?

@notro
Copy link
Contributor Author

notro commented Jul 19, 2015

Does the issue also happen when the dt already sets up pins 8 and 7 correctly as output and the spi driver node contains "CS-gpios = <&gpio 8>, <&gpio 7>;"?

Yes. I tried that when I was trying to understand the spi cs gpio code.

@notro notro deleted the mz61581 branch July 19, 2015 19:34
@msperl
Copy link
Contributor

msperl commented Jul 19, 2015

If I had such a device then I could look into the why comparing the behavior on the bus using a logic-analyzer...

Maybe it is related to a pull-up issue in the hw?

The only difference I have seen is that the Native gpio has slightly less time between CS and clock-start.

So nothing else comes to my mind that could explain this except maybe a different interpretation of a spi-transfer sequence with regards to chip-selects in the framework compared to the 2708 driver.

Another thing: running at slower clocks solves the issue by chance?

@msperl
Copy link
Contributor

msperl commented Jul 19, 2015

Out of curiosity: why not use a "downgrade to spi-bcm2708" device-tree overlay for this specific device?

But obviously we also need to understand the issue and figure out why it does not work...

@pelwell
Copy link
Contributor

pelwell commented Jul 19, 2015

Have you read this Pull Request? 8-)

@notro
Copy link
Contributor Author

notro commented Jul 19, 2015

The only difference I have seen is that the Native gpio has slightly less time between CS and clock-start.

Ok, that's a place to start, adding a small delay after the gpio set in spi_set_cs().
I will try that when I'm done with some firmware API patches I'm working on. Thanks.
The display has a CPLD in front of the display controller, enabling it to max out the SPI bus at 128MHz.

Another thing: running at slower clocks solves the issue by chance?

No it doesn't.

Out of curiosity: why not use a "downgrade to spi-bcm2708" device-tree overlay for this specific device?

This PR does that :-)

@msperl
Copy link
Contributor

msperl commented Jul 20, 2015

As said: if you can have a look at the clocks and CS with a logic-analyzer then look into what is the difference on the bus-level (I can do this at up to 200MHz sample-rates if someone borrows me that device)...

But as said: I actually guess that the implementation behavior is slightly different for the drivers with regards to cs_change and the corresponding delays (actually between spi-bcm2835 and the spi-framework standard implementation)

@msperl
Copy link
Contributor

msperl commented Jul 20, 2015

One thing: is there any data that you read back from the device or can I also run this "blind" (similar to the st7739r I own).
Because then I can look at the SPI bus without a device attached and compare the waveforms...

@notro
Copy link
Contributor Author

notro commented Jul 21, 2015

None of the FBTFT drivers rely on anything read back. I think that only one driver tries to read the device code if possible.

@msperl
Copy link
Contributor

msperl commented Jul 21, 2015

One idea: maybe the switching the gpio 7/8 to output does not remove the internal pull-up which would shift the low level up higher than with the alt2 mode (which would also control the gpio pull-ups I assume) - the cpld may run 1.8v internally but may be 3v3 tolerant, so with low guaranteed to say 0.2 vcc that would mean 0.3v as low limit and with the pull-up we may not get to that level.

So the test could be:

  • Configure gpios 7/8 as output in the device tree and disable pull-ups.
  • configure gpio-CS in spi.

If it works that way then we know it is the pull-ups and you got your way around it (and then I would be actually be surprised that the gpio code allows to change to output without disabling pull-ups).

@notro
Copy link
Contributor Author

notro commented Jul 22, 2015

Neither the DT change, nor adding delay around CS helped. Also tried going down to 4MHz.

For reference, this is what I tried (tested ok on a rpi-display).
I also tried without changing pull.

diff --git a/arch/arm/boot/dts/bcm2708-rpi-b-plus.dts b/arch/arm/boot/dts/bcm2708-rpi-b-plus.dts
index 3ad2e0d..8faea0c 100644
--- a/arch/arm/boot/dts/bcm2708-rpi-b-plus.dts
+++ b/arch/arm/boot/dts/bcm2708-rpi-b-plus.dts
@@ -10,7 +10,8 @@
 &gpio {
        spi0_pins: spi0_pins {
                brcm,pins = <7 8 9 10 11>;
-               brcm,function = <4>; /* alt0 */
+               brcm,function = <1 1 4 4 4>; /* out out alt0 alt0 alt0 */
+               brcm,pull = <0 0 1 1 1>; /* none none down down down */
        };

        i2c0_pins: i2c0 {
@@ -46,6 +47,8 @@
        pinctrl-names = "default";
        pinctrl-0 = <&spi0_pins>;

+       cs-gpios = <&gpio 8 1>, <&gpio 7 1>;
+
        spidev@0{
                compatible = "spidev";
                reg = <0>;      /* CE0 */
diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
index 50910d8..7d579a0 100644
--- a/drivers/spi/spi.c
+++ b/drivers/spi/spi.c
@@ -464,8 +464,11 @@ static void spi_set_cs(struct spi_device *spi, bool enable)
        if (spi->mode & SPI_CS_HIGH)
                enable = !enable;

-       if (spi->cs_gpio >= 0)
+       if (spi->cs_gpio >= 0) {
+udelay(10);
                gpio_set_value(spi->cs_gpio, !enable);
+udelay(10);
+       }
        else if (spi->master->set_cs)
                spi->master->set_cs(spi, !enable);
 }

Thanks for the suggestions.
I can ask the manufacturer if he can send you a display. Will you look at it with your analyzer if he does?

@msperl
Copy link
Contributor

msperl commented Jul 22, 2015

I would try to figure it out...
But I will try first to run the device "blind" when I get back to my analyzer (as soon as it gets a bit cooler - maybe Sunday or Monday...)

@msperl
Copy link
Contributor

msperl commented Jul 22, 2015

I will share my findings with you and you can compare the measurements yourself as well...

popcornmix added a commit to Hexxeh/rpi-firmware that referenced this pull request Jul 24, 2015
kernel: BCM2835_V4L2: Add support for V4L2_EXPOSURE_METERING_MATRIX
See: raspberrypi/linux#1068

kernel: dmaengine: bcm2708-dmaengine: Fix memory leak when stopping a running transfer
See: raspberrypi/linux#1072

kernel: BCM270X_DT: mz61581: Revert to spi-bcm2708
See: raspberrypi/linux#1077

kernel: bcm2708/2835-i2s: Fix for PCM register ranges in device trees
See: raspberrypi/linux#1079

kernel: bcm2835-sdhost: Add the ERASE capability
See: raspberrypi/linux#1076

kernel: bcm2835-sdhost: Ignore CRC7 for MMC CMD1

kernel: BCM270X_DT: Add unit address to gpio node name

kernel: spi-bcm2835: merge upstream patches allowing DMA transfers
See: raspberrypi/linux#1085

kernel: BCM270X_DT: Use i2c_arm for rtc and bmp085 overlays

kernel: BCM2708_DT: CM dtparams for audio, watchdog and RNG

firmware: video_decode: Don't wait for a valid timestamp to output frames
See: raspberrypi/firmware#451
popcornmix added a commit to raspberrypi/firmware that referenced this pull request Jul 24, 2015
kernel: BCM2835_V4L2: Add support for V4L2_EXPOSURE_METERING_MATRIX
See: raspberrypi/linux#1068

kernel: dmaengine: bcm2708-dmaengine: Fix memory leak when stopping a running transfer
See: raspberrypi/linux#1072

kernel: BCM270X_DT: mz61581: Revert to spi-bcm2708
See: raspberrypi/linux#1077

kernel: bcm2708/2835-i2s: Fix for PCM register ranges in device trees
See: raspberrypi/linux#1079

kernel: bcm2835-sdhost: Add the ERASE capability
See: raspberrypi/linux#1076

kernel: bcm2835-sdhost: Ignore CRC7 for MMC CMD1

kernel: BCM270X_DT: Add unit address to gpio node name

kernel: spi-bcm2835: merge upstream patches allowing DMA transfers
See: raspberrypi/linux#1085

kernel: BCM270X_DT: Use i2c_arm for rtc and bmp085 overlays

kernel: BCM2708_DT: CM dtparams for audio, watchdog and RNG

firmware: video_decode: Don't wait for a valid timestamp to output frames
See: #451
@msperl
Copy link
Contributor

msperl commented Jul 27, 2015

Now that I run the logic-analyzer I see that you use:

                                spi-cpha;
                                spi-cpol;

And then the transfer on this changes the initial settings...
So the SPI clock and out level only get set during the first transfer....

Would it be possible to set the first transfer to a dummy command and see if it works then?

I would guess that just running:

rmmod fb_s6d02a1; insmod fb_s6d02a1

would also do the trick...

If it works then, then it is actually a bug in the spi-framework... (in that case I will explain why later - essentially something that has bitten me/us with CS-polarity before)

@msperl
Copy link
Contributor

msperl commented Jul 27, 2015

Please make sure that you only test with a single spi-device!
So do not load the driver for the touch display for this test!

@notro
Copy link
Contributor Author

notro commented Jul 27, 2015

Reloading the module revealed a bug in fbtft which results in using a pointer to free'd memory. I thought I had fixed that issue. I need to build a kernel.

@msperl
Copy link
Contributor

msperl commented Jul 28, 2015

To summarize the issue - as I see it:
The problem is that with CS-gpio set the chip-select gets asserted before the spi-mode is set up.
This results CS being asserted and only then the clock is set up for the requested polarity.
This is not an issue as long as we always run in mode 0 (which is set by default) or as long as all devices use the same spi mode and the first transfer is "ignored".

That is why I was asking if setting up a single dummy transfer solves the issue.

Why does this not happen with native-CS? The reason is that in that case the polarities and chip-selects are set at the same time when writing the CS register.

As for a fix it will require a patch, that moves the setup of polarity to master->prepare_master to set polarity prior to setting gpio-CS.

@msperl
Copy link
Contributor

msperl commented Jul 28, 2015

@notro: sent you a (yet unverified) patch that should solve the polarity issue by moving the polarity settings to run before the chip-select not after.

The patch applies cleanly against upstream (4.2-rc3), but there may be a reject if you apply it against the foundation kernel because of handle_err does not exist in 4.0

@msperl
Copy link
Contributor

msperl commented Jul 28, 2015

Now I have checked and I believe that the problem should be solved by the patch!
Here the first transfer:

Failing when SCK changes the polarity WHILE CS is asserted:
screen shot 2015-07-28 at 15 48 01
Working eith SCK changing polarity while CS is NOT asserted:
screen shot 2015-07-28 at 15 48 21

Please tell me that it works as expected with a real device and I will upstream it...

@msperl
Copy link
Contributor

msperl commented Jul 28, 2015

Sent the patch upstream - when it gets accepted I will create a merge request

@notro
Copy link
Contributor Author

notro commented Jul 28, 2015

Spot on Martin!

Your patch solved the problem as did reloading the driver (I had to apply the patch by hand on rpi-4.0.y).

I tried enabling DMA, but I only got polling transfers. When I disabled polling, I got DMA transfers for pixel data.
However the throughput is not as good as the old spi dma driver:

spi-bcm2835:
[   90.238356] fb_s6d02a1 spi0.0: Display update: 8482 kB/s (35.367 ms), fps=14 (70.009 ms)

spi-bcm2708:
[   56.904256] fb_s6d02a1 spi0.0: Display update: 12273 kB/s (24.443 ms), fps=19 (50.048 ms)

Ref: https://github.com/notro/fbtft/wiki/Performance#mz61581-pi-ext

@notro
Copy link
Contributor Author

notro commented Jul 28, 2015

Just a note: If you try the performance output on any other than rpi-4.0.y, you need to use debug=35, because of a fbtft reset patch recently applied there. If not you get a white display.

@msperl
Copy link
Contributor

msperl commented Jul 28, 2015

I can try to run the same performance tests as you to see where/when we waste time with the logic-analyzer (especially the gaps)...

One note: please try to use a bigger DMA buffer than 4k - 32k for example should work and compare it then... (I guess the spi framework mapping of pages is quite expensive and then dmaengine on top - the original dma driver did not have that but did everything on its own)

@msperl
Copy link
Contributor

msperl commented Jul 28, 2015

As a note: when cleaning you run 51.3us between each 4k transfer with the default settings and 29.3us are spent with cs high. so there is still some interrupts (double maybe?) that run unneccesarily and stall the flow... (see my comment to your upstream patch)

And we run 0.358s for a full screen update at 8MHz.

@notro
Copy link
Contributor Author

notro commented Jul 28, 2015

32k still gives me polling transfers. unsigned long overflows in the equation:

[    9.239777] bcm2835_spi_transfer_one: xfer_time_us= 32768 * 9 * 1000000 / 125000000 = 22

Removing polling, 32k gives comparable values:

[    8.648498] graphics fb1: fb_s6d02a1 frame buffer, 480x320, 300 KiB video memory, 32 KiB DMA buffer memory, fps=33, spi0.0 at 128 MHz

[   75.739626] fb_s6d02a1 spi0.0: Display update: 11314 kB/s (26.514 ms), fps=19 (50.385 ms)

fps=50:

[    8.717523] graphics fb1: fb_s6d02a1 frame buffer, 480x320, 300 KiB video memory, 32 KiB DMA buffer memory, fps=50, spi0.0 at 128 MHz

[   94.849416] fb_s6d02a1 spi0.0: Display update: 11406 kB/s (26.300 ms), fps=25 (39.996 ms)

The fps parameter is a misnomer but fps=50 translates to a HZ/fps=2 jiffies delay between updates.

What upstream patch are you referring to?

@msperl
Copy link
Contributor

msperl commented Jul 28, 2015

So we need to fix the calculation as well (at least make xfer_time_us long long)

@msperl
Copy link
Contributor

msperl commented Jul 29, 2015

Here the calculation patch (spaces instead of tabs, so does not apply cleanly)

diff --git a/drivers/spi/spi-bcm2835.c b/drivers/spi/spi-bcm2835.c
index c9357bb..e7874a6 100644
--- a/drivers/spi/spi-bcm2835.c
+++ b/drivers/spi/spi-bcm2835.c
@@ -480,7 +480,7 @@ static int bcm2835_spi_transfer_one_poll(struct spi_master *
                                         struct spi_device *spi,
                                         struct spi_transfer *tfr,
                                         u32 cs,
-                                        unsigned long xfer_time_us)
+                                        unsigned long long xfer_time_us)
 {
        struct bcm2835_spi *bs = spi_master_get_devdata(master);
        unsigned long timeout;
@@ -531,7 +531,8 @@ static int bcm2835_spi_transfer_one(struct spi_master *maste
 {
        struct bcm2835_spi *bs = spi_master_get_devdata(master);
        unsigned long spi_hz, clk_hz, cdiv;
-       unsigned long spi_used_hz, xfer_time_us;
+       unsigned long spi_used_hz;
+       unsigned long long xfer_time_us;
        u32 cs = bcm2835_rd(bs, BCM2835_SPI_CS);

        /* set clock */
@@ -573,9 +574,10 @@ static int bcm2835_spi_transfer_one(struct spi_master *mast
        bs->rx_len = tfr->len;

        /* calculate the estimated time in us the transfer runs */
-       xfer_time_us = tfr->len
+       xfer_time_us = (unsigned long long)tfr->len
                * 9 /* clocks/byte - SPI-HW waits 1 clock after each byte */
-               * 1000000 / spi_used_hz;
+               * 1000000;
+       do_div(xfer_time_us, spi_used_hz);

        /* for short requests run polling*/
        if (xfer_time_us <= BCM2835_SPI_POLLING_LIMIT_US)

That should fix the polling issue...

@msperl
Copy link
Contributor

msperl commented Jul 29, 2015

set out a patch for this for upstream as well...

@popcornmix
Copy link
Collaborator

I'm not sure xfer_time_us needs to be long long. 32-bit unsigned can handle over an hour.
I think you just need the cast in the calculation.
There is a big divide in the calculation so result is not going to be > 32-bits.

@msperl
Copy link
Contributor

msperl commented Jul 29, 2015

the bigger problem is that if I cast and cast back then we get issues with the 64/64 bit divide is not in eapi...
hence long long all the way and do_div();
No feedback on those 2 patches for upstream yet - and I would only merge when merged upstream...

@msperl
Copy link
Contributor

msperl commented Jul 29, 2015

@notro: as for speed: there does not seem to be a difference between 120MHz and 25MHz - at least from your dmesg reporting:

[  165.813848] graphics fb1: fb_s6d02a1 frame buffer, 480x320, 300 KiB video memory, 32 KiB DMA buffer memory, fps=33, spi0.0 at 25 MHz
[  123.074458] graphics fb1: fb_s6d02a1 frame buffer, 480x320, 300 KiB video memory, 32 KiB DMA buffer memory, fps=33, spi0.0 at 120 MHz

I will have a look at the output lines with the logic-analyzer to see where the "delay" can get found,
but I guess it may be related to interrupt latencies, and as we run interrupts for tx and rx-DMA while we only request rx-dma as an irq source that is some unnecessary overhead introduced by the dmaengine implementation - the original dma driver did not trigger tx-dma, so it may be faster because of that

@notro
Copy link
Contributor Author

notro commented Jul 29, 2015

The fps numbers are used confusingly in fbtft. It first started out as a way to say how often the display should be updated. So the fps DT property and module argument is reflected in the fps value reported when the driver loads: fps=33. This is the default value.

Later when displays with bigger resolutions came along, fbtft couldn't always keep up with the value asked for, so I added a debug option to measure the real fps. This fps value is printed to the log for each frame sent to the display when debug=32, or for this particular display debug=35 has to be used because of a bug (fixed in rpi-4.0.y) in fbtft.

https://github.com/notro/fbtft/wiki/FPS was written to try and clear up this confusion, but this has not been done in the code.

@msperl
Copy link
Contributor

msperl commented Jul 29, 2015

here now some timings:

  • dmesg:
[   74.699015] spi spi0.1: setting up native-CS1 as GPIO 7
[   74.699871] ads7846 spi0.1: touchscreen, irq 484
[   74.700501] input: ADS7846 Touchscreen as /devices/platform/soc/3f204000.spi/spi_master/spi0/spi0.1/input/input0
[   74.735725] fbtft: module is from the staging directory, the quality is unknown, you have been warned.
[   74.739504] fb_s6d02a1: module is from the staging directory, the quality is unknown, you have been warned.
[   74.743648] spi spi0.0: setting up native-CS0 as GPIO 8
[   74.744022] fbtft_of_value: width = 320
[   74.744043] fbtft_of_value: height = 480
[   74.744058] fbtft_of_value: buswidth = 8
[   74.744074] fbtft_of_value: debug = 3
[   74.744087] fbtft_of_value: rotate = 270
[   74.744101] fbtft_of_value: fps = 30
[   74.744114] fbtft_of_value: txbuflen = 32768
[   74.744977] fb_s6d02a1 spi0.0: fbtft_request_one_gpio: 'reset-gpios' = GPIO15
[   74.745033] fb_s6d02a1 spi0.0: fbtft_request_one_gpio: 'dc-gpios' = GPIO25
[   74.745124] fb_s6d02a1 spi0.0: fbtft_request_one_gpio: 'led-gpios' = GPIO18
[   74.745267] fb_s6d02a1 spi0.0: fbtft_verify_gpios()
[   74.745286] fb_s6d02a1 spi0.0: fbtft_init_display_dt()
[   74.745304] fb_s6d02a1 spi0.0: fbtft_reset()
[   74.865525] fb_s6d02a1 spi0.0: init: write_register: B0 00
[   74.865597] fb_s6d02a1 spi0.0: init: write_register: 11
[   74.865621] fb_s6d02a1 spi0.0: init: msleep(255)
[   75.131132] fb_s6d02a1 spi0.0: init: write_register: B3 02 00 00 00
[   75.131192] fb_s6d02a1 spi0.0: init: write_register: C0 13 3B 00 02 00 01 00 43
[   75.131234] fb_s6d02a1 spi0.0: init: write_register: C1 08 16 08 08
[   75.131271] fb_s6d02a1 spi0.0: init: write_register: C4 11 07 03 03
[   75.131308] fb_s6d02a1 spi0.0: init: write_register: C6 00
[   75.131355] fb_s6d02a1 spi0.0: init: write_register: C8 03 03 13 5C 03 07 14 08 00 21 08 14 07 53 0C 13 03 03 21 00
[   75.131395] fb_s6d02a1 spi0.0: init: write_register: 35 00
[   75.131426] fb_s6d02a1 spi0.0: init: write_register: 36 A0
[   75.131456] fb_s6d02a1 spi0.0: init: write_register: 3A 55
[   75.131487] fb_s6d02a1 spi0.0: init: write_register: 44 00 01
[   75.131521] fb_s6d02a1 spi0.0: init: write_register: D0 07 07 1D 03
[   75.131559] fb_s6d02a1 spi0.0: init: write_register: D1 03 30 10
[   75.131597] fb_s6d02a1 spi0.0: init: write_register: D2 03 14 04
[   75.131627] fb_s6d02a1 spi0.0: init: write_register: 29
[   75.131650] fb_s6d02a1 spi0.0: init: write_register: 2C
[   75.131673] fb_s6d02a1 spi0.0: set_var()
[   75.173576] fb_s6d02a1 spi0.0: Display update: 7167 kB/s (41.850 ms), fps=0 (0.000 ms)
[   75.173595] fb_s6d02a1 spi0.0: fbtft_register_backlight()
[   75.174103] graphics fb1: fb_s6d02a1 frame buffer, 480x320, 300 KiB video memory, 32 KiB DMA buffer memory, fps=33, spi0.0 at 120 MHz
[   75.174122] fb_s6d02a1 spi0.0: fbtft_backlight_update_status: polarity=0, power=0, fb_blank=0
  • /etc/interrupts:
           CPU0       CPU1       CPU2       CPU3
 16:          0          0          0          0   ARMCTRL  16  bcm2708_fb dma
 24:         96          0          0          0   ARMCTRL  24  DMA IRQ
 25:       1407          0          0          0   ARMCTRL  25  DMA IRQ
 26:         10          0          0          0   ARMCTRL  26  DMA IRQ
 27:         10          0          0          0   ARMCTRL  27  DMA IRQ
 32:      34186          0          0          0   ARMCTRL  32  dwc_otg, dwc_otg_pcd, dwc_otg_hcd:usb1
 49:          0          0          0          0   ARMCTRL  49  3f200000.gpio:bank0
 50:          0          0          0          0   ARMCTRL  50  3f200000.gpio:bank1
 65:         20          0          0          0   ARMCTRL  65  3f00b880.mailbox
 66:          2          0          0          0   ARMCTRL  66  VCHIQ doorbell
 75:          1          0          0          0   ARMCTRL  75
 80:          0          0          0          0   ARMCTRL  80  3f204000.spi
 83:          6          0          0          0   ARMCTRL  83  uart-pl011
 84:       3321          0          0          0   ARMCTRL  84  mmc0
 99:       2183       2219       1315       1881   ARMCTRL  99  arch_timer
484:          0          0          0          0  pinctrl-bcm2835   4  ads7846
FIQ:              usb_fiq
IPI0:          0          0          0          0  CPU wakeup interrupts
IPI1:          0          0          0          0  Timer broadcast interrupts
IPI2:       2980       3009       2612       3528  Rescheduling interrupts
IPI3:         12         12         14         16  Function call interrupts
IPI4:          1          3          3          2  Single function call interrupts
IPI5:          0          0          0          0  CPU stop interrupts
IPI6:          0          0          0          0  IRQ work interrupts
IPI7:          0          0          0          0  completion interrupts
Err:          0
  • so:
    • 10 RX-DMA-IRQ
    • 10 TX-DMA-IRQ
    • 0 SPI-IRQ
  • basic pattern:
    screen shot 2015-07-29 at 15 03 26
  • 41531 us for the 10 32k Transfers (CS-Start to CS-end)
  • measurement for "gaps" between 32kb transfers:
    • Graph:3
      screen shot 2015-07-29 at 15 08 09
    • 249us from clock end to clock start (B1-D2)
    • 30us from clock end to CS-deassert (B1-C1)
    • 212us from CS-deassert to CS-assert (C1-D1)
    • 7us from CS-asset to CS (D1-D2)
  • a full Transfer of 32k takes:
    • 4194us clock running
    • 4453us from start clock to start clock
    • so an overhead of 259us or 6%
  • when sampling at 500MHz I got 62.5MHz clock

So I came to the conclusion that I need to configure at 125MHz and see if the clock then works as expected.

So with 125MHz I get now the expected 125MHz clock rate as well, but the fps is still fixed to 33.
But here some more measurements:

  • 32k transfers take now:
    • 2097us clock running
    • 2476us from start clock to start clock
    • so an overhead of 379us or 16%

All this would not indicate that we run faster, but maybe that reporting can not do better estimation of the FPS.
If I run mplayer with BigBuckBunny I get the following pattern for updates:

  • screen shot 2015-07-29 at 16 08 44
  • 23374us for the transfer itself (first CS down to last CS up)
  • 50000us between first CS-down and next first CS-down

This means we could fit another full screen update in between, which would bring us to 25ms or 40fps, but that means that this will be about as high as we can go - the big question is: how much other overhead do we have before we can run any updates.

Hope this gives you some insight into what the driver is actually doing...

@notro
Copy link
Contributor Author

notro commented Jul 29, 2015

[   75.174103] graphics fb1: fb_s6d02a1 frame buffer, 480x320, 300 KiB video memory, 32 KiB DMA buffer memory, fps=33, spi0.0 at 120 MHz

The fps=33 here is not the measured frame rate. It's a misnomer that really means: we have a delay of HZ/fps=100/33=3 jiffies from the time of the first page touched in videomem until the update kernel thread is woken.
We will get this 33 fps rate if the time to update the display is smaller that what we have with our current display. A 176x220 display would have yielded this framerate for instance.

I tried fps=100 (1 jiffy delay) but got the same measured 25 fps as with fps=50 (2 jiffies).
So this display needs 2 jiffies to get the videomem dirty pages in order between updates.

This is how fbtft works:

  • Userspace writes to videomemory
  • fbdev deferred io collects page writes and schedules a thread to wake up in HZ/fps jiffies.
  • fbtft is woken and turns the changed pages into a start and end line on the display.
  • This part of video memory is broken up into txbuflen chunks
  • Because the Pi is little endian and the controller interface is big endian, each rgb565 pixel is byte swapped (hence the need for a tx buffer). This is a fixed time between each SPI transfer.

Now SPI takes the buffer, allocates a rx buffer, dma maps buffers and sends them to dma engine.
I don't know how much time is spent on the rx buffer part.

Looking at your numbers I wonder why the overhead goes from 259us to 379us when the speed is doubled. Shouldn't these numbers be more or less the same?

Ideally, fbtft should have dynamically switched to a continous update mode when it detects that all of videomem is updated several times in a row. This could have shaved off the 2 jiffies delay.

I tried increasing txbuflen to 64k, but that gave me some USB errors, and a lousy measured framerate:

[    8.828145] graphics fb1: fb_s6d02a1 frame buffer, 480x320, 300 KiB video memory, 64 KiB DMA buffer memory, fps=50, spi0.0 at 128 MHz

[   66.592321] fb_s6d02a1 spi0.0: Display update: 2506 kB/s (119.647 ms), fps=0 (-628294215.249 ms)
[   66.733336] fb_s6d02a1 spi0.0: Display update: 2485 kB/s (120.690 ms), fps=7 (140.097 ms)
[   66.882983] fb_s6d02a1 spi0.0: Display update: 2299 kB/s (130.443 ms), fps=7 (139.891 ms)
[   67.029706] fb_s6d02a1 spi0.0: Display update: 2359 kB/s (127.151 ms), fps=6 (150.011 ms)
[   67.163600] fb_s6d02a1 spi0.0: Display update: 2494 kB/s (120.249 ms), fps=7 (140.793 ms)
[   67.311376] Transfer to device 4 endpoint 0x1 frame 1657 failed - FIQ reported NYET. Data may have been lost.
[   67.314969] fb_s6d02a1 spi0.0: Display update: 2265 kB/s (132.411 ms), fps=7 (139.203 ms)
[   67.450592] fb_s6d02a1 spi0.0: Display update: 2541 kB/s (118.009 ms), fps=6 (150.010 ms)
[   67.465342] fb_s6d02a1 spi0.0: Display update: 9901 kB/s (2.083 ms), fps=7 (130.686 ms)
[   67.602961] fb_s6d02a1 spi0.0: Display update: 2493 kB/s (120.309 ms), fps=51 (19.392 ms)
[   67.750837] fb_s6d02a1 spi0.0: Display update: 2337 kB/s (128.272 ms), fps=7 (139.910 ms)
[   67.886339] fb_s6d02a1 spi0.0: Display update: 2437 kB/s (123.042 ms), fps=7 (140.730 ms)
[   68.033103] fb_s6d02a1 spi0.0: Display update: 2298 kB/s (130.469 ms), fps=7 (139.334 ms)

Really nice to have access to an analyzer like this. Much easier to make descisions with access to hard facts. I have returned to working on fbtft now, especially the io path, so this is a helpful walkthrough for me.

@msperl
Copy link
Contributor

msperl commented Jul 29, 2015

@notro: Why does it vary?

I have only taken a single example to show the "situation" and did not go over it taking 10s of 100s of samples.

Well, there are lots of factors here - some of them have to do with scheduling and also I know that the first time a code-path is used it is about 70% slower - this has to do with the code being in the different CPU caches - the more often it runs the quicker. So whenever a "thread" is switched CPU everything is slower as it needs to fill the caches...

If you want I can run some more samples and I can send you a link to download the data - the SW itself you can get from the producer of the logic analyzer and you can read the files and analyze them...

But I guess that this is better happening in a different discussion or offline.

@msperl
Copy link
Contributor

msperl commented Jul 29, 2015

note that the patch for the "time estimate" has been accepted already.
the "bugfix" for alternate modes is still pending - you may want to acknowledge it or send "tested-by" to help maybe...

@notro
Copy link
Contributor Author

notro commented Jul 29, 2015

Thanks, no need to run more samples. I don't know much about cpu caches and process/thread switching time. Have you tried running the queue with realtime priority (rt=true), what's your experience with that?

@msperl
Copy link
Contributor

msperl commented Jul 29, 2015

Rt does not make a huge difference any longer on a multi-CPU system.

Also the designs of running the spi message pump inline for spi_sync calls when there is no activity makes this somewhat redundant - you would need to run the calling thread with rt priority to make it work.

neuschaefer pushed a commit to neuschaefer/raspi-binary-firmware that referenced this pull request Feb 27, 2017
kernel: BCM2835_V4L2: Add support for V4L2_EXPOSURE_METERING_MATRIX
See: raspberrypi/linux#1068

kernel: dmaengine: bcm2708-dmaengine: Fix memory leak when stopping a running transfer
See: raspberrypi/linux#1072

kernel: BCM270X_DT: mz61581: Revert to spi-bcm2708
See: raspberrypi/linux#1077

kernel: bcm2708/2835-i2s: Fix for PCM register ranges in device trees
See: raspberrypi/linux#1079

kernel: bcm2835-sdhost: Add the ERASE capability
See: raspberrypi/linux#1076

kernel: bcm2835-sdhost: Ignore CRC7 for MMC CMD1

kernel: BCM270X_DT: Add unit address to gpio node name

kernel: spi-bcm2835: merge upstream patches allowing DMA transfers
See: raspberrypi/linux#1085

kernel: BCM270X_DT: Use i2c_arm for rtc and bmp085 overlays

kernel: BCM2708_DT: CM dtparams for audio, watchdog and RNG

firmware: video_decode: Don't wait for a valid timestamp to output frames
See: raspberrypi#451
anholt pushed a commit to anholt/linux that referenced this pull request Oct 9, 2017
Our fbdev setup requires the device to be awake for access
through the GTT. If one boots without connected displays and
later plugs one in, we won't have any runtime PM references when
the fbdev setup runs. Explicitly grab a runtime PM reference during
the fbdev setup to avoid the following spew:

[   62.518435] RPM wakelock ref not held during HW access
[   62.518459] ------------[ cut here ]------------
[   62.518546] WARNING: CPU: 3 PID: 37 at ../drivers/gpu/drm/i915/intel_drv.h:1800 i915_vma_pin_iomap+0x144/0x150 [i915]
[   62.518585] Modules linked in: i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm intel_gtt agpgart netconsole nls_iso8859_1 nls_cp437 vfat fat efi_pstore coretemp hwmon intel_rapl x86_pkg_temp_thermal e1000e efivars ptp pps_core video evdev ip_tables x_tables ipv6 autofs4
[   62.518741] CPU: 3 PID: 37 Comm: kworker/3:1 Not tainted 4.13.0-rc7-skl+ raspberrypi#1077
[   62.518770] Hardware name:                  /NUC7i5BNB, BIOS BNKBL357.86A.0048.2017.0704.1415 07/04/2017
[   62.518827] Workqueue: events i915_hotplug_work_func [i915]
[   62.518853] task: ffff88046c00dc00 task.stack: ffffc90000184000
[   62.518896] RIP: 0010:i915_vma_pin_iomap+0x144/0x150 [i915]
[   62.518919] RSP: 0018:ffffc90000187cc8 EFLAGS: 00010292
[   62.518942] RAX: 000000000000002a RBX: ffff880460044000 RCX: 0000000000000006
[   62.518969] RDX: 0000000000000006 RSI: ffffffff819c3e6f RDI: ffffffff819f1c0e
[   62.518996] RBP: ffffc90000187cd8 R08: ffff88046c00e4f0 R09: 0000000000000000
[   62.519022] R10: ffff8804669ca800 R11: 0000000000000000 R12: ffff880461d20000
[   62.519049] R13: ffffc90000187d48 R14: ffff880461d20000 R15: ffff880460044000
[   62.519076] FS:  0000000000000000(0000) GS:ffff88047ed80000(0000) knlGS:0000000000000000
[   62.519107] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   62.519130] CR2: 000056478ae213f0 CR3: 0000000002c0f000 CR4: 00000000003406e0
[   62.519156] Call Trace:
[   62.519190]  intelfb_create+0x176/0x360 [i915]
[   62.519216]  __drm_fb_helper_initial_config_and_unlock+0x1c7/0x3c0 [drm_kms_helper]
[   62.519251]  drm_fb_helper_hotplug_event.part.18+0xac/0xc0 [drm_kms_helper]
[   62.519282]  drm_fb_helper_hotplug_event+0x1a/0x20 [drm_kms_helper]
[   62.519324]  intel_fbdev_output_poll_changed+0x1a/0x20 [i915]
[   62.519352]  drm_kms_helper_hotplug_event+0x27/0x30 [drm_kms_helper]
[   62.519395]  i915_hotplug_work_func+0x24e/0x2b0 [i915]
[   62.519420]  process_one_work+0x1d3/0x6d0
[   62.519440]  worker_thread+0x4b/0x400
[   62.519458]  ? schedule+0x4a/0x90
[   62.519475]  ? preempt_count_sub+0x97/0xf0
[   62.519495]  kthread+0x114/0x150
[   62.519511]  ? process_one_work+0x6d0/0x6d0
[   62.519530]  ? kthread_create_on_node+0x40/0x40
[   62.519551]  ret_from_fork+0x27/0x40
[   62.519569] Code: c4 78 e6 e0 0f ff e9 08 ff ff ff 80 3d d5 bc 0c 00 00 0f 85 0b ff ff ff 48 c7 c7 d8 50 32 a0 c6 05 c1 bc 0c 00 01 e8 9d 78 e6 e0 <0f> ff e9 f1 fe ff ff 0f 1f 44 00 00 0f 1f 44 00 00 0f b6 87 98
[   62.519771] ---[ end trace 5fbe271f991a58ae ]---

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20170901195456.6386-1-ville.syrjala@linux.intel.com
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants