Skip to content

mimxrt: Fix SD card deadlock and implement proper timeout handling.#18451

Open
andrewleech wants to merge 1 commit into
micropython:masterfrom
andrewleech:mimx_sdcard_timeouts
Open

mimxrt: Fix SD card deadlock and implement proper timeout handling.#18451
andrewleech wants to merge 1 commit into
micropython:masterfrom
andrewleech:mimx_sdcard_timeouts

Conversation

@andrewleech
Copy link
Copy Markdown
Contributor

Summary

Fixes two bugs in the mimxrt SD card driver:

  • Deadlock when no card is present and ineffective timeout in transfer operations.
  • The timeout fix uses the non-blocking USDHC API with interrupt-driven callbacks, following the pattern used by sdio.c.

Issue 1: Deadlock on missing card

On a board with no SD Card Detect pin, when attempting to mount an SD card filesystem in _boot.py with no card inserted the system would deadlock in sdcard_power_off(). The function unconditionally sent GO_IDLE_STATE command even when the card was never successfully initialized. Since the USDHC driver's USDHC_WaitCommandDone() has an infinite loop waiting for command completion, this caused a permanent hang when no card was present to respond.

This deadlock occurs in mimxrt/machine_sdcard.c during the initial ioctl / case MP_BLOCKDEV_IOCTL_INIT which tries to sdcard_power_on() and if that fails, runs sdcard_power_off() which calls down to sdcard_cmd_go_idle_state() which deadlocks when there's no card available.

Issue 2: Ineffective timeout in blocking transfers

The sdcard_transfer_blocking() function had a timeout loop that checked if the card was busy before calling USDHC_TransferBlocking(). However, this timeout was ineffective because USDHC_TransferBlocking() contains infinite loops in USDHC_WaitCommandDone() and USDHC_TransferDataBlocking() with no timeout mechanism. Once inside the SDK blocking call, the timeout loop couldn't abort the operation.

This PR uses the non-blocking USDHC API from the SDk instead, matching the pattern already used in ports/mimxrt/sdio.c.

Testing

Tested on PHYBOARD-RT1170 with:

  • No SD card inserted (previously deadlocked, now returns error gracefully)
  • SD card insertion/removal during operation
  • Normal read/write operations with SD card present
  • All operations now have working timeouts

@andrewleech andrewleech changed the title mimxrt: Fix SD card deadlock and implement proper timeout handling mimxrt: Fix SD card deadlock and implement proper timeout handling. Nov 21, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 21, 2025

Code size report:

Reference:  all: Bump version to 1.27.0. [78ff170]
Comparison: mimxrt/sdcard: Improve robustness of sdcard driver. [merge of c09c2d7]
  mpy-cross:    +0 +0.000% 
   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
     mimxrt:  +312 +0.083% TEENSY40
        rp2:    +0 +0.000% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:    +0 +0.000% VIRT_RV32

@robert-hh
Copy link
Copy Markdown
Contributor

robert-hh commented Nov 22, 2025

I tried the PR with a little test program at a Teensy 4.1, which does not have a card detect pin, and a MIMXRT1020_EVK with a card detect pin. Both fail the test at line 26 and no data is written. The master branch code passes.

Edit: This fault seems to be caused by the situation, that a previous call (here write sd) was not finished at the time of calling close(). If I add a short delay in the test code e.g. before f.close(), smaller writes pass and larger writes fail. Changing to non-blocking operation seems requiring to wait at every action of read/write/close until it is finished, since VFS is not aware of the non-blocking behavior. Which makes it effectively blocking.

Test script:

import os
import random
from machine import SDCard
from time import ticks_us
import gc

sdcard = SDCard(1)
def run(size_mult=1):
    gc.collect()
    size = random.randint(1, 1 << 16)
    data = bytearray(size)
    print("size = ", size)

    os.chdir("/sdcard")
    print(os.getcwd())
    print(os.listdir())

    ticks_start_us = ticks_us()
    mode = "wb"
    f = open("test_file.txt" , mode)
    for _ in range(size_mult):
        print("loop", _)
        f.write(data)
    f.close()
    ticks_end_us = ticks_us()

    data_size_mb = (len(data) * size_mult)/(1024.0*1024.0)
    duration = (ticks_end_us - ticks_start_us)/1000000.0
    data_rate = data_size_mb / duration

    print()
    print("Block Length    :", len(data))
    print("Total duration  : {:.1f} s".format(duration))
    print("Transmitted data: {:.1f} Mb".format(data_size_mb))
    print("Data Rate       : {:.1f} MB/s".format(data_rate))

    return data_rate

run()

@andrewleech
Copy link
Copy Markdown
Contributor Author

Edit: This fault seems to be caused by the situation, that a previous call (here write sd) was not finished at the time of calling close(). If I add a short delay in the test code e.g. before f.close(), smaller writes pass and larger writes fail. Changing to non-blocking operation seems requiring to wait at every action of read/write/close until it is finished, since VFS is not aware of the non-blocking behavior. Which makes it effectively blocking.

Thanks @robert-hh now I realise I only tested by scanning the filesystem tree, I didn't actually read/write to files on the card itself, sorry!

My intention was to keep the commands blocking overall, but be able to timeout and raise an OSError when an issue occurs. As great as it would be to have non-blocking IO that requires much broader support up and down the entire vfs stack (including filesystem libraries) to guard against re-entrancy / buffer conflicts etc.

I'll figure out a proper fix to ensure the operation is complete before exiting the function.

@andrewleech
Copy link
Copy Markdown
Contributor Author

Thanks for sharing your test script @robert-hh - I've fixed a few more issues I found and confirmed that VFAT mkfs works correctly now, as do basic read/write along with a few permutations of your script eg.

>>> run(200)
size =  43873
/sdcard
['test.txt', 'test_file.txt']
loop 0
loop 1
loop 2
loop 3
loop 4
loop 5
<snip>
loop 197
loop 198
loop 199

Block Length    : 43873
Total duration  : 16.9 s
Transmitted data: 8.4 Mb
Data Rate       : 0.5 MB/s
0.495017336424209664
>>>

@robert-hh
Copy link
Copy Markdown
Contributor

robert-hh commented Nov 25, 2025

Thank you for improving the implementation. I changed the test script into reading back the data and verifying the content. At that point I get verification errors, always at the start of a 512 byte sector. Updated test script:

import os
import random
from machine import SDCard
from time import ticks_us, sleep_ms
import gc

sdcard = SDCard(1)
def run(size_mult=5):
    gc.collect()
    size = random.randint(1, 1 << 16)
    data = bytearray(size)
    for i in range(size):
        data[i] = i & 0xff
    print("size = ", size)

    os.chdir("/sdcard")
    print(os.getcwd())
    print(os.listdir())

    ticks_start_us = ticks_us()
    mode = "wb"
    print("Write")
    f = open("test_file.txt" , mode)
    for _ in range(size_mult):
        print("loop", _)
        f.write(data)
    f.close()
    ticks_end_us = ticks_us()

    data_size_mb = (len(data) * size_mult)/(1024.0*1024.0)
    duration = (ticks_end_us - ticks_start_us)/1000000.0
    data_rate = data_size_mb / duration

    print("Read back")
    f = open("test_file.txt" , "rb")
    for _ in range(size_mult):
        print("loop", _)
        data = f.read(size)  # scan up
        # for i in range(size - 1, -1, -1):  # scan down
        for i in range(size):
            if data[i] != i & 0xff:
                offset = _ * size + i
                print("data mismatch at 0x%x, %d %d" % (offset, data[i], i & 0xff))
                break
    f.close()


    print()
    print("Block Length    :", len(data))
    print("Total duration  : {:.1f} s".format(duration))
    print("Transmitted data: {:.1f} Mb".format(data_size_mb))
    print("Data Rate       : {:.1f} MB/s".format(data_rate))

    return data_rate

run()

@andrewleech
Copy link
Copy Markdown
Contributor Author

andrewleech commented Nov 26, 2025

reading back the data and verifying the content

Very interesting, I had run some basic tests with reading back data without fail. I've run your updated script a number of times and don't get any failures on my rt1176 board. I then flashed up a seeed arch mix (rt1052) and don't see it there either; I wonder if there are timing or realted issues that relate to the particular chip and/or sd card.

I did find some extra potential dma cache issues which I've tried to address, I'd be grateful if you can test this too.

This build still passes your test on both my boards. and I tried a few more tests of my own (Interleaved read/write to different files simultaneously, attempt to test with different buffer alignments, some larger read/writes)

test_sdcard.py

@andrewleech andrewleech force-pushed the mimx_sdcard_timeouts branch 2 times, most recently from a0194e2 to 354cfe9 Compare November 26, 2025 06:07
@robert-hh
Copy link
Copy Markdown
Contributor

The last version of your PR still fails. Until now, I used only the Teensy 4.1 for the test. I tried it with a MIMXRT1020EVK, Arch MIX and MIMXRT1170_EVK. The test passes at these boards. The major difference between especially the Teensy and the other boards is SDRAM. Teensy does not have SDRAM, the other boards have. So I disabled SDRAM at the Arch Mix (mpconfigboard.mk, flag MICROPY_HW_SDRAM_AVAIL). Then, the Arch Mix board shows the same error.

@andrewleech
Copy link
Copy Markdown
Contributor Author

@robert-hh thanks, that's really helpful! I should be able to reproduce that now.

@andrewleech
Copy link
Copy Markdown
Contributor Author

Hi @robert-hh building with SDRAM disabled did indeed reliably reproduce the issue for me... and I have so far been unable to fix it with the non-blocking api from the SDK!

I say reliably; it failed every time I ran your script but in different locations each time, there's a timing related / intermittent nature to it certainly.

As best I can figure the interrupt-driven path has cache coherency issues: by the time the ISR signals completion and control returns to the waiting code, speculative cache line fills can cause stale data to be read despite efforts to add explicit cache invalidation. Tried various mitigations (DSB barriers, volatile flags, double invalidation) but I got corruption every time.

For now I've reverted to the USDHC_TransferBlocking approach, which actually shouldn't be as unsafe as I'd originally thought; I now realise there are hardware timeouts in the peripheral. I also thought the timeout loop in sdcard.c around this call was superflous, but now realise it was more related to waiting while the card is busy (from previous transfer) before starting a new one.

As such this PR's been simplified somewhat to keep the original change to stop if blocking when no card is present, keep a couple of safe M7 cache checks and try to make the transfer code a bit clearer with regard to waiting while the card is busy.

@robert-hh
Copy link
Copy Markdown
Contributor

Thanks. I will retest that version. During the initial set-up of this port @alphaFred and me spent quite a bit of time to get sd card handling reliable. We tried the non-blocking approach too, but then selected the blocking variant. I cannot recall why.
I did not try your non-blocking variant with NXP-Lib V2.16. Maybe it changes the behavior.

@robert-hh
Copy link
Copy Markdown
Contributor

Tested with both Teensy 4.1 (No SDRAM) and MIMXRT1020EVK (SDRAM). Both work. I just glanced over the changes in the code. They seem good.

@andrewleech
Copy link
Copy Markdown
Contributor Author

Thanks @robert-hh I had a few more attempts at getting it working with the non blocking SDK API but just could not figure out how to prevent the corruption.

I did briefly think we could keep both implementations, #ifdef sdram enabled, but that does complicate things a fair bit.

The only reason I can think of to push for the non blocking API would be that while it's waiting it can use mp_event_wait() which lets scrolled tasks run. Probably not a big deal, but a tangible benefit. What do you think?

@robert-hh
Copy link
Copy Markdown
Contributor

The only reason I can think of to push for the non blocking API would be that while it's waiting it can use mp_event_wait() which lets scrolled tasks run.

Do you mean blocking or non-blocking? In any case, SD card access has to be reliable.
Did you try the non-blocking variant with Library 2.16? Maybe the problem has been fixed meanwhile.

@andrewleech
Copy link
Copy Markdown
Contributor Author

andrewleech commented Dec 7, 2025

Yeah absolutely the reliability is critical.
The existing use of USDHC_TransferBlocking does seem to be reliable at least.

USDHC_TransferNonBlocking would mean we can control the "waiting for transfer" loop and have a scheduler yield in it. I'm not sure how much real world benefit that would actually provide to be fair.

But yeah, without knowing what's different about the cpu / memory / dma caching that I assume is causing the problems it's not a viable path forward. I've tried to reference the drivers from zephyr and a NXP / FreeRTOS based one that use USDHC_TransferNonBlocking but can't figure out what they're doing differently to mine (other than using threads rather than just blocking).

Most of my testing was with the newer SDK yeah, still shows the problem.

Replace interrupt-based transfer with polling (USDHC_TransferBlocking)
and add timeouts throughout to prevent deadlocks. Add ERR050396
workaround to disable cacheable AXI transactions for USDHC, and
explicit D-cache management around DMA transfers. Add sdcard_wait_ready
polling after writes to ensure programming completes. Guard
sdcard_power_off against sending commands when no card is present.

Signed-off-by: Andrew Leech <andrew.leech@planetinnovation.com.au>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants