mimxrt: Fix SD card deadlock and implement proper timeout handling.#18451
mimxrt: Fix SD card deadlock and implement proper timeout handling.#18451andrewleech wants to merge 1 commit into
Conversation
0565ceb to
ed0899e
Compare
|
Code size report: |
ed0899e to
d955ec9
Compare
|
I tried the PR with a little test program at a Teensy 4.1, which does not have a card detect pin, and a MIMXRT1020_EVK with a card detect pin. Both fail the test at line 26 and no data is written. The master branch code passes. Edit: This fault seems to be caused by the situation, that a previous call (here write sd) was not finished at the time of calling Test script: import os
import random
from machine import SDCard
from time import ticks_us
import gc
sdcard = SDCard(1)
def run(size_mult=1):
gc.collect()
size = random.randint(1, 1 << 16)
data = bytearray(size)
print("size = ", size)
os.chdir("/sdcard")
print(os.getcwd())
print(os.listdir())
ticks_start_us = ticks_us()
mode = "wb"
f = open("test_file.txt" , mode)
for _ in range(size_mult):
print("loop", _)
f.write(data)
f.close()
ticks_end_us = ticks_us()
data_size_mb = (len(data) * size_mult)/(1024.0*1024.0)
duration = (ticks_end_us - ticks_start_us)/1000000.0
data_rate = data_size_mb / duration
print()
print("Block Length :", len(data))
print("Total duration : {:.1f} s".format(duration))
print("Transmitted data: {:.1f} Mb".format(data_size_mb))
print("Data Rate : {:.1f} MB/s".format(data_rate))
return data_rate
run() |
Thanks @robert-hh now I realise I only tested by scanning the filesystem tree, I didn't actually read/write to files on the card itself, sorry! My intention was to keep the commands blocking overall, but be able to timeout and raise an OSError when an issue occurs. As great as it would be to have non-blocking IO that requires much broader support up and down the entire vfs stack (including filesystem libraries) to guard against re-entrancy / buffer conflicts etc. I'll figure out a proper fix to ensure the operation is complete before exiting the function. |
|
Thanks for sharing your test script @robert-hh - I've fixed a few more issues I found and confirmed that VFAT mkfs works correctly now, as do basic read/write along with a few permutations of your script eg. >>> run(200)
size = 43873
/sdcard
['test.txt', 'test_file.txt']
loop 0
loop 1
loop 2
loop 3
loop 4
loop 5
<snip>
loop 197
loop 198
loop 199
Block Length : 43873
Total duration : 16.9 s
Transmitted data: 8.4 Mb
Data Rate : 0.5 MB/s
0.495017336424209664
>>> |
|
Thank you for improving the implementation. I changed the test script into reading back the data and verifying the content. At that point I get verification errors, always at the start of a 512 byte sector. Updated test script: import os
import random
from machine import SDCard
from time import ticks_us, sleep_ms
import gc
sdcard = SDCard(1)
def run(size_mult=5):
gc.collect()
size = random.randint(1, 1 << 16)
data = bytearray(size)
for i in range(size):
data[i] = i & 0xff
print("size = ", size)
os.chdir("/sdcard")
print(os.getcwd())
print(os.listdir())
ticks_start_us = ticks_us()
mode = "wb"
print("Write")
f = open("test_file.txt" , mode)
for _ in range(size_mult):
print("loop", _)
f.write(data)
f.close()
ticks_end_us = ticks_us()
data_size_mb = (len(data) * size_mult)/(1024.0*1024.0)
duration = (ticks_end_us - ticks_start_us)/1000000.0
data_rate = data_size_mb / duration
print("Read back")
f = open("test_file.txt" , "rb")
for _ in range(size_mult):
print("loop", _)
data = f.read(size) # scan up
# for i in range(size - 1, -1, -1): # scan down
for i in range(size):
if data[i] != i & 0xff:
offset = _ * size + i
print("data mismatch at 0x%x, %d %d" % (offset, data[i], i & 0xff))
break
f.close()
print()
print("Block Length :", len(data))
print("Total duration : {:.1f} s".format(duration))
print("Transmitted data: {:.1f} Mb".format(data_size_mb))
print("Data Rate : {:.1f} MB/s".format(data_rate))
return data_rate
run() |
1eac664 to
26d49ef
Compare
Very interesting, I had run some basic tests with reading back data without fail. I've run your updated script a number of times and don't get any failures on my rt1176 board. I then flashed up a seeed arch mix (rt1052) and don't see it there either; I wonder if there are timing or realted issues that relate to the particular chip and/or sd card. I did find some extra potential dma cache issues which I've tried to address, I'd be grateful if you can test this too. This build still passes your test on both my boards. and I tried a few more tests of my own (Interleaved read/write to different files simultaneously, attempt to test with different buffer alignments, some larger read/writes) |
a0194e2 to
354cfe9
Compare
|
The last version of your PR still fails. Until now, I used only the Teensy 4.1 for the test. I tried it with a MIMXRT1020EVK, Arch MIX and MIMXRT1170_EVK. The test passes at these boards. The major difference between especially the Teensy and the other boards is SDRAM. Teensy does not have SDRAM, the other boards have. So I disabled SDRAM at the Arch Mix (mpconfigboard.mk, flag MICROPY_HW_SDRAM_AVAIL). Then, the Arch Mix board shows the same error. |
|
@robert-hh thanks, that's really helpful! I should be able to reproduce that now. |
354cfe9 to
6234090
Compare
|
Hi @robert-hh building with SDRAM disabled did indeed reliably reproduce the issue for me... and I have so far been unable to fix it with the non-blocking api from the SDK! I say reliably; it failed every time I ran your script but in different locations each time, there's a timing related / intermittent nature to it certainly. As best I can figure the interrupt-driven path has cache coherency issues: by the time the ISR signals completion and control returns to the waiting code, speculative cache line fills can cause stale data to be read despite efforts to add explicit cache invalidation. Tried various mitigations (DSB barriers, volatile flags, double invalidation) but I got corruption every time. For now I've reverted to the USDHC_TransferBlocking approach, which actually shouldn't be as unsafe as I'd originally thought; I now realise there are hardware timeouts in the peripheral. I also thought the timeout loop in sdcard.c around this call was superflous, but now realise it was more related to waiting while the card is busy (from previous transfer) before starting a new one. As such this PR's been simplified somewhat to keep the original change to stop if blocking when no card is present, keep a couple of safe M7 cache checks and try to make the transfer code a bit clearer with regard to waiting while the card is busy. |
|
Thanks. I will retest that version. During the initial set-up of this port @alphaFred and me spent quite a bit of time to get sd card handling reliable. We tried the non-blocking approach too, but then selected the blocking variant. I cannot recall why. |
|
Tested with both Teensy 4.1 (No SDRAM) and MIMXRT1020EVK (SDRAM). Both work. I just glanced over the changes in the code. They seem good. |
|
Thanks @robert-hh I had a few more attempts at getting it working with the non blocking SDK API but just could not figure out how to prevent the corruption. I did briefly think we could keep both implementations, The only reason I can think of to push for the non blocking API would be that while it's waiting it can use mp_event_wait() which lets scrolled tasks run. Probably not a big deal, but a tangible benefit. What do you think? |
Do you mean blocking or non-blocking? In any case, SD card access has to be reliable. |
|
Yeah absolutely the reliability is critical. USDHC_TransferNonBlocking would mean we can control the "waiting for transfer" loop and have a scheduler yield in it. I'm not sure how much real world benefit that would actually provide to be fair. But yeah, without knowing what's different about the cpu / memory / dma caching that I assume is causing the problems it's not a viable path forward. I've tried to reference the drivers from zephyr and a NXP / FreeRTOS based one that use USDHC_TransferNonBlocking but can't figure out what they're doing differently to mine (other than using threads rather than just blocking). Most of my testing was with the newer SDK yeah, still shows the problem. |
6234090 to
c09c2d7
Compare
Replace interrupt-based transfer with polling (USDHC_TransferBlocking) and add timeouts throughout to prevent deadlocks. Add ERR050396 workaround to disable cacheable AXI transactions for USDHC, and explicit D-cache management around DMA transfers. Add sdcard_wait_ready polling after writes to ensure programming completes. Guard sdcard_power_off against sending commands when no card is present. Signed-off-by: Andrew Leech <andrew.leech@planetinnovation.com.au>
c09c2d7 to
b3aaf9b
Compare
Summary
Fixes two bugs in the mimxrt SD card driver:
Issue 1: Deadlock on missing card
On a board with no SD Card Detect pin, when attempting to mount an SD card filesystem in
_boot.pywith no card inserted the system would deadlock insdcard_power_off(). The function unconditionally sent GO_IDLE_STATE command even when the card was never successfully initialized. Since the USDHC driver'sUSDHC_WaitCommandDone()has an infinite loop waiting for command completion, this caused a permanent hang when no card was present to respond.This deadlock occurs in mimxrt/machine_sdcard.c during the initial ioctl /
case MP_BLOCKDEV_IOCTL_INITwhich tries tosdcard_power_on()and if that fails, runssdcard_power_off()which calls down tosdcard_cmd_go_idle_state()which deadlocks when there's no card available.Issue 2: Ineffective timeout in blocking transfers
The
sdcard_transfer_blocking()function had a timeout loop that checked if the card was busy before callingUSDHC_TransferBlocking(). However, this timeout was ineffective becauseUSDHC_TransferBlocking()contains infinite loops inUSDHC_WaitCommandDone()andUSDHC_TransferDataBlocking()with no timeout mechanism. Once inside the SDK blocking call, the timeout loop couldn't abort the operation.This PR uses the non-blocking USDHC API from the SDk instead, matching the pattern already used in
ports/mimxrt/sdio.c.Testing
Tested on PHYBOARD-RT1170 with: