Home > Tracks > Embedded Systems Programming

How to avoid end of life from NAND correctable errors

Thom Denholm - TUXERA - Watch Now - Currently watching: 0

Flash media is fabulous for most use cases, but heavy reads can cause correctable errors. Linux flash file systems actually shorten the life of the media when dealing with these errors. How does this change with multiple bits per cell, including recent QLC NAND? What other sorts of media management can help get the most lifetime out of your flash media based device?

This talk will cover these sorts of problems and impacts in detail, from flash file systems to SSDs and other NAND flash-based media. While we can't speak to what the firmware in your devices are doing, we have an excellent knowledge of what they should be doing, and also detail the sorts of conversations a system designer should have with their flash media vendors.

italicssurround text with
boldsurround text with
**two asterisks**
or just a bare URL
surround text with
strikethroughsurround text with
~~two tilde characters~~
prefix with

Score: 3 | 6 days ago | 1 reply

To answer the most frequently asked question, yes, I would be happy to send slides to whoever wants them. May I recommend instead the whitepaper I wrote on this topic? Available here - https://www.tuxera.com/nand-correctable-errors/

Score: 0 | 4 days ago | no reply yet

Slides now uploaded and available at the link on the left.

Score: 2 | 4 days ago | 1 reply

Hi Mr. Denholm! Thanks so much for the talk! I never thought I'd see Galois fields at 10:16 (or I guess they're more commonly known as generator polynomials in field theory, never been really good at cryptography haha)

Score: 1 | 4 days ago | no reply yet

You are very welcome! The math of protecting a lot of bits with a handful of additional bits is fascinating, and could easily be a separate talk. What I'd like to know more about is RAID, which is indistinguishable from magic for me at this point.

Score: 5 | 4 days ago | 1 reply

I have a ARM single board computer running Linux with ext3 filesystems on both microSD and eMMC. Where would error correction be handled in those situations? The controller inside the eMMC chip and the microSD card?

My understanding is that eMMC is supposed to be more reliable but I am not sure why

Score: 0 | 4 days ago | no reply yet

Hi Drew,
Both microSD and eMMC have NAND memory and firmware. Between those are controllers - internal on the eMMC, either internal or external on the microSD. The firmware works with the controller to detect and correct errors.
The interface for the eMMC is more robust than for the microSD, and allows for more options to control the device. One example is power loss notification, so that a design can stop writing to the media.
Additionally, most vendors of eMMC provide an additional "pseudo-SLC" mode for their MLC chips. This stores only a single bit in each cell, which halves the storage but increases the lifetime and robustness. The microSD vendors give no options like that - the interface doesn't define any.
Those are two examples of how eMMC can be more robust than microSD, but overall reliability is dependent on the design. The microSD has advantages also - it can be removed or replaced, while eMMC is fixed onto the board.
Tuxera is a member of JEDEC (who manages the eMMC specification) and is a board member of the SD Association.

Score: 4 | 4 days ago | 1 reply

Does the Linux MTD driver need to know the maximum lifetime specs for a given NAND chip?

Score: 2 | 4 days ago | no reply yet

Interesting, my answer to this question seems to have been "discarded" :)
My understanding of MTD is that it wouldn't do anything with the lifetime specs if it had them. It is a fairly simple interface, and doesn't do any sort of wear leveling or bad block management - those are all left to the flash driver or flash file system running on top of MTD.

Score: 3 | 4 days ago | 2 replies

Does Linux send Trim commands to a SSD, or just for raw NAND chips?

Score: 2 | 4 days ago | no reply yet

Related to this, if Trim commands are not used at all, a serious performance drop occurs once the system has been completely filled the first time. We noted that in my blog post
It seems likely that SSDs may have some form of firmware garbage collection to improve this, but they don't have the knowledge that the file system has of which data blocks are still in use.

Score: 1 | 4 days ago | no reply yet

Linux Trim commands come from the file system, and they are configured differently. As one example, ext4 uses Trim commands if configured to do so through the discard mount option. They have two options - on or off.
Another option used for better performance is an external daemon to perform bundles of discards on a timed basis - balancing the lifetime benefit of discards with the performance benefit of scheduling them.

Score: 0 | 7 days ago | no reply yet

Hi folks, I am online and ready to answer any questions you might have!