-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increasing SD card write performance RP2040 #269
Comments
Hi Morio, The changes in the PIO are just an attempt to increase clock speeds but are not essential. I moved to clock to a seperate SM using an interrupt. I left the old clock in the code and rerouted it to pin 22 for debugging but it can be removed. The main contribution to the write performance is omitting the CMD12 in the stopTransmission function if a continuous write is happening. At the moment i solved it with the writeStart function that i call before writing a large chunk of data. This sets the multi_write variable and the boundries of the with the multi_write_end variable. As for my application i only write large continuous files of around 1 GB but for for ZuluSCSI the way the data is written is quite different. I can imagine that keeping track when a continous write is happening and when it is ending could be done internally without defining it beforehand. This would make it harder to do pre-erase command but from some tests this does not seem to affect performance that much. I merged some changes from the SD card cache branch of ZuluSCSI but performance does not change for continuous writes but i haven't tested for smaller writes. |
I am the author of SdFat and am interested in adding a SDIO mode for RP2040. I looked at all the versions of SDIO for RP2040 that have evolved from this demo on the Raspberry Pi github site. None of the existing implementations provide improved performance for the way most applications use SD cards. Before I start doing yet another version, I decided see if you are interested in developing a version that is fast for all sizes of transfers, not just very large transfers. First let me describe how SD cards have evolved. The first SD cards were 8 or 16 MB FAT12 and truly had 512 byte flash sectors. Now cards have huge flash pages and emulate 512 byte sectors. There are large RAM caches in modern cards. Here are two definitions from the SD standard for how flash is managed in a card.
Here are sizes of AU and RU for different classes of cards. Every time you do a single-block transfer or end a multiple-block transfer, an RU is programmed. Eventually flash becomes fragmented and data must be moved. Here is a description from the standard.
This means that many things in your implementation only apply to twenty year old cards. Like pre-erasing commands. I don't want to loose the above with an edit error so I will start another post to explain the two mode for card access in SdFat. |
SdFat has two modes for SD card access. I have two modes since it is not always possible to implement or use the faster mode. The fast mode attempts to use the largest read or write transfer possible. For SPI this requires a dedicated SPI interface. For SDIO I have only been able to implement the fast mode on the NXP processors in Teensy 3.x and 4.x boards. I think there is an implemention for a few STM32 chips but its not possible to use it with the Standard STM32 board support package. The fast mode on SDIO has the unfortunate name FIFO_SDIO since I couldn't get DMA to work in this mode on NXP so DMA_SDIO is the slow mode. The fast version depends on implementing these member functions for multi-block transfers:
Here is an example of the difference for 512 byte SPI transfers on Pico with Earle Philhower's package at 133 Mhz. The slow SHARED_SPI mode:
The fast DEDICATED_SPI mode:
Even the Teensy 4.1 SDIO is slow for 512 byte DMA transfers:
Here is Teensy 4.1 FIFO_SDIO for 512 byte transfers at 50MHz SD clock. People have obtained faster rates by over clocking.
I am hoping there is a way to implement the faster mode on RP2040. I have tried on ESP32 and gave up. The SDIO controller or board support packages for other MCUs have stopped me. Let me know if you are interested. Edit; Another problem with SDIO for SD cards is the requirement of 32-bit alignment for most controllers. This means tmp buffers and memcpy. Depending on how a file is written, you can't solve the problem with buffers alignment. Here is a case, there are many more cases. If you write 511 bytes in the first write, then in the next write one byte will be moved to the cache and the cache will be written. Now the remaining data in the second write is not 32-bit aligned. All these problems make fast SDIO for apps difficult. Use of direct access to the FIFO eliminated the 32-bit alignment and DMA problems on the Teensy NXP MCU. |
Here is a good test of whether an SDIO implementation will provide an improvement over dedicated SPI for most apps. Try a test with 511 byte file read/write. Dedicated SPI doesn't degrade much on RP2040. This size causes all data to be copied to/from the internal SdFat cache.
Here is what the SPI transfer looks like for a read. There is a bit of a gap between bytes and the clock seems near 24 MHz. |
Hey @greiman , thanks for all the information. It clearly explains why sending the CMD12 in the middle of a AU create such a latency spikes. I tried my code with a smaller buffer size and smaller file size.
In the case above the read commands are still being terminated using CMD12 but i am not quite sure if this causes the speed difference. It seems likely that the SDIO on the rp2040 would be able to be faster than the SPI mode although maybe not as fast as the FIFO mode from the NXP mcus. At the moment i am mostly interested in high speeds on large files as i am reading out an linear CCD from a scanner and writing it directly to an SD card. The ADC of the CCD can generate up to 40MB/s although i am running it at 10MB/s at the moment. The sensor is used to make a scanning camera that produces images of around 100MP in raw tiff 48 bit format so that needs a lot of throughput on the SD card. I can take a look at implementing these functions although i am no expert as the code i am using is mostly not written by me. |
I use multi-block writes always. I just don't terminate the write. The implementation with infinite transfers is as fast or faster for large writes. Unless you have buffer for a whole AU there will be increased latency occasionally. People with Teensy 4.1 no longer use the DMA mode even with large writes. To achieve max SD performance you need to write at least 4MB as a single multi-block transfer. SdFat does this and if you do writes of multiples of 512 bytes there are no memcpy calls. |
I assume you end the write using the sync function which does send the cmd12 and this automatically gets called when a sector is written which does not start where the last write finished. |
I only use the sync() functions when I switching between read/write mode , the transfer is not contiguous or file close is called. If multiple files are open large transfers are required for high performance since interleaved access will cause sync() to be called. Most MCU SDIO controllers use CMD12. You can't send a stop token like SPI. |
That makes sense, for the write part it seems doable then. I was wondering about the read part, the Teensy has a FIFO which is filled with data when the readstart is called if i read correctly. So when the read function gets called it grabs the data from the FIFO and returns it. This seems to work a lot faster with small buffer sizes as i assume the overhead of initializing and ending a read is causing the low write speeds seen in my benchmark. Perhaps it is possible to mimic this with the rp2040 with a small internal buffer and a way to pace the reading of the sectors as the buffer gets full. |
Yes this is why read is fast on Teensy. I was afraid of this on RP2040. This is also why I have not implemented SDIO on STM32 chips. Only recent STM32 chips have a large FIFO so you can't pause reads. Attempts to use SDIO on most MCUs has not resulted in improved performance for typical Arduino users. On Teensy it only takes about 5 μs to fill the FIFO and the same if the read FIFO is full. Big overlap of I/O and processing is possible. |
I can imagine most of the times the speed of the SD card is not that crucial anyway while using an Arduino. After searching through the non simplified SD card spec it seems like you can pause the read operation by disabling the clock signal during in between blocks. This is specified in "4.12.5.2 Read Block Gap", not sure if this only works for UHS or also for the slower SD protocol. I guess this is possible for the rp2040 as we have full control over the clock signal and can stop/start it any time we want. |
Yes, It does not work for most MCU SDIO controllers. |
I really hope there will be an improved RP2040 SDIO driver. I hate to take-on maintenance of another custom driver. I can't get Arduino to improve any SPI drivers. They won't accept any changes to the Standard Arduino SPI API. |
I wrote some code that should work similar to your write functions, don't have the pi pico with sd card here so i will test it when i get home and see how it manages with the 511 byte buffer size. In terms of the reading part i am bit uncertain what the right approach would be as the current implementation writes the received data directly to the dst buffer using DMA. When introducing a FIFO buffer in software it would write to that location but it would mean that the data has to be transferred from the software FIFO to the dst buffer, basically doubling the memory access. Maybe i am overthinking it but i would imagine that a neater solution with some kind of queue could be possible. Where initially the data would be read from the FIFO but as the FIFO gets empty the requested transfers would go in queue and configure the DMA to directly write to addresses in the queue and skipping the FIFO. The other hurdle would be stopping and starting the clock on the right time as the timing seems quite crucial. I will have to see if i can attach my logic analyzer to see if i can get the timing right. Can i ask which logic analyzer you are using? I only have cheap 24Mhz one from Ali express so i probably have to reduce the clock speed quite a bit. benchmark with new code for 511 bytes:
with the 1024 byte buffer:
Somehow the performance it decreased a bit but atleast the 511 byte write is still faster than the SPI mode. The performance loss seems related to the fact that the the WriteSectors calls WriteData in a loop and WriteData has some initialization so there is some overhead. Moving the initialization to the WriteStart and allowing to queue the writes would reduce the overhead a bit. |
Looks good for write. I expected read to be difficult.
I was also thinking a software FIFO would be a possible solution for read. I expect memcpy will be needed for read and write. Users read/write data that is not 32-bit aligned. Also unless all read/write calls are for a multiple of four bytes it is possible that part of the transfer completes a sector in the cache and the remainder is not 32-bit aligned. This is not a problem with NXP since Teensy is Cortex M4 or M7. I can transfer data in 32-bit chunks between the FIFO and user buffers in a loop. At 600 MHz this takes about 5 μs.
I recently bought a high end Saleae Logic Pro 8. It can do 500 MS/s digital 100 MHz bandwidth. Expensive but I really like it. I have a 200 MHz 2GS/s mixed signal scope but I rarely use it after buying the Saleae.
The NXP has a 512 byte FIFO and stops the clock at the end of a block. Really simple and reliable to use. The number you are getting look good. I will offer both SPI and SDIO on RP2040. For most apps SPI will be fine. After some thought I decided that even if SDIO only offers high speed for big transfers that are 32-bit aligned it will be valuable for sophisticated users. Some users just per-allocate a contiguous file and write by doing raw writes. I make it easy with this member function:
They then can do calls like this:
|
I'm not completely sure, but I think I have an application like that. For audio, I always read 1024 bytes at a time (but the exact size could be reconfigured). Or does that not apply when using the Arduino SDFat & File classes? Thank you so much for looking into SDIO! |
Just back from vacation, so I haven't taken a deep look into what was written above. But this branch may also be of interest, raising SDIO clock rate to 42 MHz: In ZuluSCSI there are other factors currently limiting speed to around 10 MB/s, so I haven't focused on improving SDIO performance further. |
I have already implemented something like that in : https://github.com/juico/pico-sdio-example/blob/78bbb70ba843d1cb6095a4c64a30250720d1bb26/src/sdio/sd_card_sdio.cpp#L267 Also i tried adding the code to get the cache working(from the cache branch) but i get the feeling writing to the extended registers does not work completely, perhaps i messed something up myself. The bit enabling the cache is read as a 0 after writing a 1 to it so that is a bit wierd. Not sure if you did end up getting the cache to work?
I already assumed something like that. But it seems for writing the solution can be quite simple and does yield lower variations in write latency for the bigger files and faster write speeds when smaller chunks are written. The branch https://github.com/juico/pico-sdio-example/tree/fast_test shows the inifite write mode without specifying how many bytes to be written. It does not pre-erase the sectors but it seems that this does not matter that much in terms of write speed. It perhaps not the neatest implementation, but hopefully you can see the mechanism. |
pre-erase is not used in modern cards. See my post above about AUs and RUs. AUs are not related to file system allocation. AU are about flash management and the mapping of emulated blocks to physical flash and wear leveling. Modern cards maintain a pre-erased cash of AUs. AUs can be as large as 64 MB in large SDXC cards. Understanding modern card flash management is important to get performance. Small writes cause incredible amounts of data copying and flash wear. Small reads cause excessive re-reads of huge flash pages into internal RAM buffers in the card. This depends on the cards buffering strategy. Some cards define different strategies by AU. If small write were use, on read the card will optimize for small reads. The amount of RAM buffering and cache policy varies by card class/product. I am now trying to get a basic SDIO driver working with this Arduino board package. I just took files form this repository and got them to compile with SdFat. That was simple but when I tried to init an SD it was not reliable. I think it is a clocking problem. I need to look with a logic analyzer. I just tried the default 133 MHz CPU speed. I plan to start over with just code from this repository for a test with the board package so I don't mix any SdFat code. I will then try to understand any problems. I need to decide which GPIOs to use with the Earle Philhower's Arduino Pico package. I probably should offer options. Any suggestions which GPIOs to use? |
Only A2 class SD cards support the SD-card cache. But it didn't seem to help much for write performance in my tests. |
All cards have internal RAM buffers and an internal cache policy. A2 cards expose an API. I worked with Teensy developers to find the best SDs for multi-track audio recording. At the time CANVAS Go! Plus 256GB provided the best performance. I just let the card use it's default management. This card is capable of recording 16 streams to 16 open files with a max latency of 2022 μs.
What you are doing is fairly simple so just using big transfers and if possible write the file as a single multi-block transfer should work with most cards. Actually the transfer size doesn't matter much. It's the infinite multi-block write that matters. I just put the card in write mode and write GB size files as a single transfer. The Teensy Audio library is an impressive accomplishment - pro audio at low cost. It has graphical design tools so you just you just draw your recording setup and it generates the code. |
Here is single file performance for 512 byte write with a CANVAS Go! Plus 256GB clocked at 100 Mhz:
|
The only crucial thing about the GPIO is that the D0-D3 pins are mapped to ascending and neighboring GPIO pins as this required for the way the PIO operates when sending parallel data.
Yes i am using a A2 card and it shows in the registers that the cache can be enabled.But after enabling the the cache by writing a 1 to byte[260] of the performance register it does not seem to enable it. As i copied the code from the cache branch from ZuluSCSI is was just wondering if it did show that the cache was enabled in your case as you also didn't notice any performance increase. This is is what i see during startup:
I don't expect a lot of improvement in throughput with the cache enabled but for my application i would like get the data off the rp2040 as quick as possible. So i am mostly worried about sudden latency spikes during writes that might cause the buffer capturing the data to get full as stopping the acquisition is not really possible.
At the moment it seems like the current solution works so i will not really focus on getting the cache working but i was just curious. |
When I discovers how adaptive high end cards are I just let the card decide. Did you see above what the Canvas GO can do? Over 40 MB/sec with 512 byte transfers. It means only the the internal SdFat cache and a 512 byte user buffer.
I think I will just offer the DMA solution that I can put together for existing code. I will also look into higher SPI clock. The board package currently limits the rate to less than 25 MHz. on other boards I can get over 4 MB/sec SPI.
Yes that's how I started using Pico before there was Arduino support. |
Those are some impressive speeds you are getting, have not been able to push past the 67.5MHz which resulted in around 29MB/s. Hopefully the rp2040 can reach 100MHz on SDIO, at the moment i am probably limited by my breadboard setup. For large transfers i do seem to have a large latency spike, probably at the end of the write but that is not really a issue. |
Thanks that helps. That's the config I was setting up with the same GPIOs for my next test. Same short wire to the SD. I am definitely going to play with this for a while before releasing anything for Arduino. Too may people use SdFat and Pico. I would be overwhelmed with issues if it isn't solid for beginners. Here is proof that card policy matters. |
One last thing, A counterfeit card kills performance. Here is an bad Evo Select card at 100 MHz:
Here is a $10 real Evo Select. For one file almost any card works.
|
When you switch to High Speed mode do you change how you read card output. I have always thought the change in valid time was strange. Default speed: High Speed: Clock is low for most of the Card Output valid. |
I leave the timing the same, i have tried switching the PIO code when going into high speed mode but i removed it as with the default timings it seems to work fast enough. Maybe with some better measurements the timings could be improved a bit. From the timing diagrams one would say to to read the data a bit earlier in high speed mode if you could get it working it would be nice to see the difference in card output on the logic analyzer. |
You comments and examples have been very helpful. I am making progress with SDIO and Earle Philhower's Arduino board package. I see why it gets 1.3k stars. There have been over 90 contributors to the repo. I think it is mostly used with PlatformIO. So you get existing Arduino libraries , Pico SDK features, and PlatformIO. The first SDIO break-out I tried had weak pull-ups and the code I was using had the default internal pull-downs enabled. So the lines were at about 1.3V and noise caused failures. I changed to weak internal pull-ups. I am testing with three popular SD breakouts to make sure it will work for most users. They vary from no pull-ups to 10K pull-ups. I will need to offer a number of pin configurations. The above package supports over 50 boards. I will pick a few popular boards and try to make adapting simple. pioasm is in the package. |
You might also want to check out https://github.com/carlk3/no-OS-FatFS-SD-SPI-RPi-Pico/tree/sdio which is also based on the ZuluSCSI SDIO code base but did release the code as a standalone library. I have messed around a bit with disabling the clock during read commands. At the moment it does not use a FIFO or anything but it just starts the clock again when it needs more data. Can't say it is working well as the timings are probably messed up. It somehow works when i print some text after the clock is disabled. Without this it does not work and i don't have the proper gear at the moment to measure the timing of the signal at the moment. Perhaps you could check it out if you want to. Without the print command after disabling the clock it does give a checksum error so it does receive some data but not the correct data. As the print function itself consumes a lot of time i cant really test if performance is increased. You can find the code in the https://github.com/juico/pico-sdio-example/tree/clock_test tree. |
I downloaded your clock_test tree and will look at it. I am now doing tests to see what constraints the Arduino board support package presents. I need to refresh my understanding of the RP2040 SDK also. I started using the RP2040 with the SDK as soon as it was released and was one of the people to discover this fault in the ADC INL. ENOB about 8.6 which killed use in my project. I am also considering PIO SPI. There is a hard limit of 24 MHz with the board support package, since SPI is shared. If I could get close to 50 MHz PIO SPI, I would have over 4 MB/sec for small read/write transfers. I am also looking for any RP2040 boards with builtin SDIO. Sparkfun has one but the wiring is wrong for PIO. I am on the list to get this Adafruit board. The Adafruit board looks good for Arduino users who started with Uno and need more performance. I want to rewrite the basic low level PIO for SDIO to look more like the hardware SDIO controllers I have used on STM32, NXP and other chips. If I can get PIO SPI to go fast, I may just limit SDIO to reliable large transfers, at least for a first release. |
I got a version of full-duplex PIO SPI to work fairly fast but I don't need full-duplex. I found this so I may be able to get more than 50 MHz write for a SD in SPI mode.
|
I finished the pio SPI and get good performance for 512 byte buffers.
I am now writing new SDIO pio. I decided not to use DMA so I can overlap transfer with crc calculation. I must allow for stalls to prevent FIFO overruns. Here is the time to receive and checksum the mbr sector of an SD. 512 bytes in 17.3 μs is about 29 MB/sec so I should get very good speed for small buffers. Here are the state machines:
I should get close to 25 MB/sec for 512 byte buffers. |
@greiman That looks good for the SPI, probably more than enough for most use cases. Have you been able to get your hands on the new AdaFruit board yet? The PIO version seems promising as well with it automatically stopping the clock. Not using the DMA would allow for a easier overlap for the CRC calculations but i wonder if it the slight advantage is worth keeping the CPU busy while reading. The overlapping CRC calculation could also be triggered by the end of a DMA transfer. I haven't checked how many cycles it costs for the rp2040 to calculate the CRC, but if this is fast enough it could be started after the data is transferred and be finished before the end of the CRC transfer. |
I have an Adafruit Metro but have not tried it yet. I have been using a Pico for development. I may try the Metro soon. I have a first version of SDIO working but spent a lot of wasted time looking for a timing problem. I tested over 50 SD cards and found three that were flaky. Finally I realized they were the only cards I have with proprietary DDR208 clocking. I looked at the errors with a logic analyzer and discovered the slightest spike on SCK caused the card to clock out an extra nibble on read. I have the wires piled together on a bread board and data-out from the card may be putting spikes on SCK. I will try the Metro and try to make a Pico setup that allows the UHS-I DDR208 cards to run. Here are first results at 250 MHz. For 512 byte transfers.
For 8192 byte transfers:
|
This might just be a flash QSPI speed problem. You can try changing the flash boot stage 2 to |
I am slowly making progress. I have been testing many apps and about 60 SD cards. I finally have most DDR208 cards working. I will try DMA to see if it is more reliable. These cards seem to be very sensitive to noise on SCK. They are extremely fast so I suspect they see glitches in clock when an occasional stall happens. Here is a result with a DDR208 card that in DDR208 mode is capable of 180MB/sec read and 130 MB/sec write.
|
Hey i don't personally own a ZuluSCSI device but am using the 4 bit SDIO implementation of ZuluSCSI. I ran into some limitation in terms of write speed and got stuck around 10 MB/s and the latency of every write seemed inconsistent with spikes between every 2 or 3 multi block writes. In the SD spec they state that sending a CMD12 after a multi block can take quite some time. So i tried to change the code a bit so it can enter a multi block write mode where it pre-erases the amount of block that will be written and sets a internal state such that the CMD12 is not send after the multi_block_write function.
This dramatically changed the performance with latencies that are very consistent throughout the write procedure. The initial or final block write does have a higher latency but for my application this is not a problem. The write speeds i have been able to reach are around 26-28 MB/s(With a Samsung 128GB evo plus) with the clock speed of the SD card at 62.5 MHz.
Commit #253 does mention that it is possible to use of stopTrans tokens but i haven't used them in my implementation which can be found at: https://github.com/juico/pico-sdio-example. The current code is a bit messy and i also tried to implement some commands in order to change to HS mode using CMD6 commands and change the PIO code to try to achieve higher frequencies although i am not convinced this helped a lot. As i am just using a SD card to micro SD card adapter soldered to some header that are plugged in a breadboard i can imagine that higher speeds could be possible on a proper PCB.
Not sure if the write/read performance is a bottleneck for ZuluSCSI but i if needed there seems to be a lot of headroom left.
Edit: I completely missed your high speed and cache enabled branch i might give those i try aswell. I ran into issues with clk_div 3 and a frequency of 150MHz but with clk_div 4 i can reach 250 MHz resulting in a higher SD card frequency.
The text was updated successfully, but these errors were encountered: