UBI - Unsorted Block Images

Table of contents

  1. Big red note
  2. Overview
  3. Power-cuts tolerance
  4. Kernel source code
  5. Mailing list
  6. User-space tools
  7. UBI headers
  8. UBI volume table
  9. Minimum flash input/output unit
  10. NAND flash sub-pages
  11. UBI headers position
  12. Flash space overhead
  13. Saving erase counters
  14. How UBI flasher should work
  15. Marking eraseblocks as bad
  16. Scalability issues
  17. Reserved blocks for bad block handling (only for NAND chips)
  18. Volume auto-resize
  19. UBI operations
    1. LEB un-map
    2. LEB map
    3. Volume update
    4. Atomic LEB change
  20. Fastmap
  21. R/O block devices on top of UBI volumes
  22. UBI stress testing
  23. More documentation

Big red note

People are often confused about UBI, which is why this section was created. Please, realize that:

Please, do not be confused. Read here for more information about how raw flash devices are different from FTL devices.

Overview

UBI (Latin: "where?") stands for "Unsorted Block Images". It is a volume management system for raw flash devices which manages multiple logical volumes on a single physical flash device and spreads the I/O load (i.e, wear-leveling) across whole flash chip.

In a sense, UBI may be compared to the Logical Volume Manager (LVM). Whereas LVM maps logical sectors to physical sectors, UBI maps logical eraseblocks to physical eraseblocks. But besides the mapping, UBI implements global wear-leveling and transparent error handling.

An UBI volume is a set of consecutive logical eraseblocks (LEBs). Each logical eraseblock is dynamically mapped to a physical eraseblock (PEB). This mapping is managed by UBI and is hidden from users and higher-level software. UBI is the base mechanism which provides global wear-leveling, per-physical eraseblock erase counters, and the ability to transparently move data from more worn-out physical eraseblocks to less worn-out ones.

The UBI volume size is specified when a volume is created, but may later be changed (volumes are dynamically re-sizable). There are user-space tools which may be used to manipulate UBI volumes.

There are 2 types of UBI volumes: dynamic volumes and static volumes. Static volumes are read-only and their contents are protected by CRC-32 checksums, while dynamic volumes are read-write and the upper layers (e.g., a file-system) are responsible for ensuring data integrity.

UBI is aware of bad eraseblocks (i.e. portions of flash which wear out over time) and frees upper-level software from having to handle bad eraseblocks itself. UBI has a pool of reserved physical eraseblocks, and when a physical eraseblock becomes bad, it transparently substitutes it with a good physical eraseblock. UBI moves the data from newly discovered bad physical eraseblocks to good ones. The result is that users of UBI volumes do not notice I/O errors since UBI takes care of them transparently.

NAND flashes are also susceptible to bit-flip errors which occur on read and write operations. Bit-flips are corrected by ECC checksums, but they may accumulate over time and cause data loss. UBI handles this by moving data from physical eraseblocks which have bit-flips to other physical eraseblocks. This process is called scrubbing. Scrubbing is done transparently in the background and is hidden from upper layers.

Here is a short list of UBI's main features:

Here is a comparison of MTD partitions and UBI volumes. They are somewhat similar because:

But UBI volumes have the following advantages over MTD partitions:

UBI also provides a block device that allows regular, block-oriented file systems to be mounted on top of an UBI volume. This is possible because UBI handles bad-blocks transparently.

There is an additional driver called gluebi which emulates MTD devices on top of UBI volumes. This looks a little strange, because UBI works on top of an MTD device, then gluebi emulates other MTD devices on top, but this actually works and makes it possible for existing software (e.g., JFFS2) to run on top of UBI volumes. However, new software may benefit from the advanced UBI features and let UBI solve many issues which the flash technology imposes.

Power-cuts tolerance

Both UBI and UBIFS are designed with tolerance to power-cuts in mind.

UBI has an internal debugging infrastructure that can emulate power failures for testing. The advantage of the emulation is that it emulates power failures at the critical points where control data structures are written to the device, whereas the probability of interrupting the system at those precise moments with physical power-cut testing is rather low.

Kernel source code

UBI has been added to the main-line Linux kernel since version 2.6.22. The UBI git tree may be found at:

https://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs.git/

Mailing list

You are welcome to send feed-back, bug-reports, patches, etc to the MTD mailing list.

User-space tools

UBI user-space tools, as well as other MTD user-space tools, are available from the following git repository:

http://git.infradead.org/mtd-utils.git

This section provides information about how to compile the whole mtd-utils repository tree. You should find the UBI tools under the ubi-utils sub-directory.

The repository contains the following UBI tools:

All UBI tools support an "-h" option which prints basic usage information.

Note, the ubiattach and ubidetach tools won't work if the kernel version is less than 2.6.25, because corresponding UBI features did not exist in these older kernels.

UBI headers

UBI stores 2 small 64-byte headers at the beginning of each non-bad physical eraseblock:

This is why logical eraseblocks are smaller than physical eraseblock - the headers take some flash space.

All UBI headers are protected by a CRC-32 checksum. Please, refer to the drivers/mtd/ubi/ubi-media.h file in the linux kernel for more information about the header's contents.

When UBI attaches an MTD device, it has to scan it, read all headers, check the CRC-32 checksums, and store erase counters and the logical-to-physical eraseblock mapping information in RAM. Please, refer to this section for information about scalability issues related to this.

After UBI has erased a PEB, it increments the erase counter value and writes it to the EC header. This means that PEBs always have a valid EC header, except for the short period of time after the erasure and before the EC header is written. Should an unexpected reboot happen during this short period of time, the EC header is lost or becomes corrupted. In this case UBI writes a new EC header with an average erase counter just after the MTD device scanning is done.

The VID header is written to the PEB when UBI associates it with an LEB. Let's consider what happens to the headers during some UBI operations.

UBI maintains two per-PEB headers because it needs to write different information to flash at different moments of time:

When the EC header is written to a PEB, UBI does not yet know the volume ID nor the LEB number to which this PEB will be associated. This is why UBI needs to do two separate write operations and to have two separate headers.

UBI volume table

The volume table is an on-flash data structure which contains information about each volume on this UBI device. The volume table is an array of volume table records. Each record contains the following information:

Each record describes one UBI volume. The record index in the volume table array corresponds to the volume ID it describes. I.e, UBI volume 0 is described by record 0 in the volume table, and so on. The total number of records in the volume table is limited by the LEB size, and cannot be greater than 128. This means that UBI devices cannot have more than 128 volumes.

Every time an UBI volume is created, removed, re-sized, re-named or updated, the corresponding volume table record is changed. UBI maintains two copies of the volume table for reasons of reliability and power-cut tolerance.

Implementation details

Internally, the volume table resides in a special-purpose UBI volume which is called the layout volume. This volume consists of 2 LEBs - one for each copy of the volume table. The layout volume is an "internal" UBI volume, and users do not see it nor access it. When reading or writing the layout volume, UBI uses the same mechanisms which are used for normal user volumes.

UBI uses the following algorithm when updating a volume table record:

When attaching the MTD device, UBI makes sure that the 2 volume table copies are equivalent. If they are not equivalent, which may be caused by an unclean reboot, UBI picks the one from LEB0 and copies it to LEB1 of the layout volume (because, according to the algorithm specified above, LEB0 is the one that is updated first and therefore considered to have the most up-to-date information). If one of the volume table copies is corrupted, UBI restores it from the other volume table copy.

Minimum flash input/output unit

UBI uses an abstract model of flash. In short, from UBI's point of view the flash (or MTD device) consists of eraseblocks, which may be good or bad. Each good eraseblock may be read from, written to, or erased. Good eraseblocks may also be marked as bad.

Flash reads and writes may only be done in multiples of the minimum input/output unit size, which depends on the flash type.

The minimum I/O unit size is a very important characteristic of the MTD device. It affects many things, e.g.:

NAND flash sub-pages

As mentioned earlier, all UBI I/O is be performed in multiples of the minimum I/O unit size, which is equivalent to the NAND device's page size (in the case of NAND flash). However, some SLC NAND flashes allow for smaller I/O units, which are called sub-pages in MTD terminology. Not all NAND devices have sub-pages.

If the NAND flash supports sub-pages, then ECC codes can be calculated on a per-sub-page basis, instead of a per-page basis. In this case it becomes possible to read and write sub-pages independently.

However, even though the NAND chip may support sub-pages, the NAND controller of your SoC might not. If the flash is managed by a controller which calculates ECC codes only on a per-page basis, then it is impossible to do I/O in sub-page chunks. E.g. this is the case for the OLPC XO-1 laptop) - its NAND chip supports sub-pages, but the NAND controller does not.

Note, the phrase "sub-page" is an MTD term, but this is also referred to as "NOP" which stands for "number of partial programs". NOP1 NAND flashes have no sub-pages - UBI treats them as NANDS with sub-page size equivalent to the NAND page size. NOP2 NAND flashes have 2 sub-pages (half a NAND page each), and NOP4 flashes have 4 sub-pages (a quarter of a NAND page each).

UBI utilizes sub-pages to reduce flash space overhead. This overhead is reduced if sub-pages can be used (see here). Consider a NAND flash with 128KiB eraseblocks and 2048-byte pages. If it does not have sub-pages, UBI puts the VID header at physical offset 2048, so the LEB size becomes 124KiB (128KiB minus one NAND page which stores the EC header and minus another NAND page which stores the VID header). Conversely, if the NAND flash does have sub-pages, UBI puts the VID header at physical offset 512 (the second sub-page), so the LEB size becomes 126KiB (128KiB minus one NAND page which is used for storing both UBI headers). See this section for more information about where the UBI headers are stored.

Sub-pages are only used by UBI internally, and only for storing the headers. The UBI API does not allow users to perform I/O to sub-page units. One of the reasons for this is that sub-page writes may be slow. To write a sub-page, the driver may actually write the whole NAND page, but put 0xFF bytes in the sub-pages which are not relevant to this operation. If this is the case, writing 4 sub-pages will be 4 times slower than writing the whole NAND page at once. Thus, UBI does use sub-pages for the headers, but this trick does not extend to the UBI API.

UBI headers position

The EC header always resides at offset 0 and takes 64 bytes, the VID header resides at the next available minimum I/O unit or sub-page, and also takes 64 bytes. For example:

Flash space overhead

UBI uses some amount of flash space for its own purposes, thus reducing the amount of flash space available for UBI users. Namely:

Let's introduce symbols:

The UBI overhead is (B + 4) * SP + O * (P - B - 4) i.e., this amount of bytes will not be accessible for users. O is different for different flashes:

N.B.: the formula above counts bad blocks as a UBI overhead. The real UBI overhead is: (B - BB + 4) * SP + O * (P - B - 4).

Saving erase counters

When working with UBI, it is important to realize that UBI stores erase counters on the flash media. Namely, each physical eraseblock has an EC (erase counter) header which stores the amount of times this physical eraseblock has been erased (see here). It is important not to lose the erase counters, which means the tools you use to erase the flash and to write the UBI images have to be UBI-aware. The mtd-utils repository contains the ubiformat utility which does things properly.

How a UBI flasher should work

The following is a list of what a UBI flasher program has to do when erasing the flash or when writing UBI images.

In practice the input UBI image is usually shorter than the flash, so the flasher has to flash the used PEBs properly, and erase the unused PEBs properly.

Note, when writing an UBI image, it does not matter where eraseblocks from the input UBI image are written. For example, the first input eraseblock may be written to the first PEB, or to the second one, or to the last one.

Also note, if you create a flasher to write UBI images at the time of production, (i.e., new flash, only once) then the flasher does not have to change the EC headers of the input UBI image, because this is new flash and each PEB has zero erase counter anyway. This means the production-line flasher may be simpler.

If your UBI image contains a UBIFS file system, and your flash is NAND, you may have to insert 0xFF bytes at the end of your input PEB data. This is very important, although not required for all NAND flashes. Sometimes a failure to do this may result in very unpleasant problems which might be difficult to debug later on. So we recommend to always do this.

The reason for this is that UBIFS treats NAND pages which contain only 0xFF bytes (let's refer them to as empty NAND pages) as free. For example, suppose the first NAND page of a PEB has some data, the second one is empty, the third one also has some data, the fourth one and the rest of NAND pages are empty as well. In this case UBIFS will treat all NAND pages starting from the fourth one as free, and will write data there. If the flasher program has already written 0xFF's to these pages, then any new UBIFS data will cause a second write. However, many NAND flashes require NAND pages to be written only once, even if the data contains only 0xFF bytes.

To put it differently, writing 0xFF bytes may have side-effects. What the flasher has to do is to drop all empty NAND pages from the end of the PEB buffer before writing it. It is not necessary to drop all empty NAND pages, just the last ones. This means that the flasher does not have to scan the whole buffer for 0xFF's. It is enough to scan the buffer from the end, and stop on the first non-0xFF byte. This is much faster. Here is the code from UBI which does the right thing:

/**
 * calc_data_len - calculate how much real data are stored in a buffer.
 * @ubi: UBI device description object
 * @buf: a buffer with the contents of the physical eraseblock
 * @length: the buffer length
 *
 * This function calculates how much "real data" is stored in @buf and returns
 * the length. Continuous 0xFF bytes at the end of the buffer are not
 * considered as "real data".
 */
int ubi_calc_data_len(const struct ubi_device *ubi, const void *buf,
                      int length)
{
        int i;

        for (i = length - 1; i >= 0; i--)
                if (((const uint8_t *)buf)[i] != 0xFF)
                        break;

        /* The resulting length must be aligned to the minimum flash I/O size */
        length = ALIGN(i + 1, ubi->min_io_size);
        return length;
}

This function is called before writing the buf buffer to the PEB. The purpose of this function is to drop 0xFF's from the end and prevent the situation described above. The ubi->min_io_size is the minimal I/O unit size, which is equivalent to the NAND page size.

By the way, we experienced similar problems with JFFS2. The JFFS2 images generated by the mkfs.jffs2 program were padded to the physical eraseblock size and were later flashed to our NAND. The flasher did not bother to skip empty NAND pages. When JFFS2 was mounted, it wrote to those NAND pages, and the writes did not fail. But later we observed weird ECC errors. It took a while to find out the problem. In other words, this is also relevant to JFFS2 images.

An alternative to this approach is to enable the "free space fixup" option when generating the UBIFS file system using mkfs.ubifs. This will allow your flasher to not have to worry about 0xFF bytes at the end of PEBs, which is particularly useful if you need to use an industrial flash programmer to write a UBI image. More information is available here.

Marking eraseblocks as bad

This section is relevant for NAND flashes as well as other flashes which exhibit bad eraseblocks. UBI marks physical eraseblocks as bad in the following 2 scenarios:

  1. an eraseblock write operation failed, in which case UBI moves the data from this PEB to some other PEB (data recovery) and schedules this PEB for torturing;
  2. the erase operation failed with EIO error, in which case the eraseblock s marked as bad immediately.

The torturing is done in the background for the purpose of detecting whether the physical eraseblock is actually bad. The write failure could have occurred for one of many reasons, including bugs in the driver or in the upper level stuff like the file system (e.g., the FS mistakenly writes many times to the same NAND page). During the torturing UBI does the following:

The eraseblock is not marked as bad if it survives the torture test. However, a bit-flip during the torture test is a good reason to mark the eraseblock as bad. Please, refer to the torture_peb() function for detailed information.

Scalability issues

Unfortunately, UBI performance scales linearly with flash size. UBI initialization time is directly proportional to the number of physical eraseblocks on the flash. This means that the larger the flash, the more time it takes for UBI to initialize (i.e., to attach the MTD device). Note: Starting with Linux v3.7 UBI offers an optional and experimental feature called "fastmap", which allows attaching in nearly constant time, see Fastmap. The initialization time depends on the flash I/O speed and (slightly) on the CPU speed, because:

Here are some figures:

Unfortunately we do not have more data and the reader is welcome to send it to us via the MTD mailing list.

Implementation details

In general, UBI needs three tables to operate:

The volume table is maintained on-flash. It changes only when UBI volumes are created, deleted, or re-sized, which are rare and not time-critical operations, when UBI can afford slow and simple volume table management.

The EBA and EC tables are changed every time an LEB is mapped to a PEB or a PEB is erased, which happens quite often and means that the table management methods should be fast and efficient.

UBI could maintain the EBA and EC tables on the flash media, but this would inevitably involve journaling, journal replay, journal commit, etc. In other words, this would introduce a lot of complexity. But UBI would be logarithmically scalable in this case.

One of the UBI requirements was simplicity of the on-flash format, because UBI authors had to read UBI volumes from the boot-loader and they had very tight constraints on the boot-loader code size. It was basically impossible to add complex journal scanning and replay code to the boot-loader.

Therefore UBI does not maintain the EBA and EC tables on the flash media. Instead, it builds them in RAM each time it attaches the MTD device. This means that UBI has to scan the entire flash and read the EC and VID headers from each PEB in order to build the in-RAM EC and EBA tables.

The drawbacks of this design are poor scalability and relatively high overhead on NAND flashes (e.g., the overhead is 1.5%-3% of flash space in case of a NAND flash with 2KiB NAND page and a 128KiB eraseblock). The advantages of this simplicity are a simple binary format as well as robustness.

Nonetheless, someday we might see a "UBI2" which would maintain the tables in separate flash areas. UBI2 would not be compatible with UBI because of completely different on-flash formats, but the user interfaces would stay the same, which would guarantee compatibility of all the software built on top of UBI.

Reserved blocks for bad block handling (only for NAND chips)

It is well-known that NAND chips have some amount of physical eraseblocks marked as bad by the manufacturer. During the lifetime of the NAND device, other bad blocks may appear. Nonetheless, manufacturers usually guarantee that the first few physical eraseblocks are not bad and that the total number of bad PEBs will not exceed certain number. For example, a 256MiB (2048 128KiB PEBs) Samsung OneNAND chip is guaranteed to have not more than 40 128KiB PEBs during its endurance lifetime. This is a very common value for NAND devices: 20/1024 PEB, which is about 2% of flash size.

This ratio of 20/1024 is the default number of blocks that UBI reserves for a UBI device. This means that if there are 2 UBI devices on a 4096 PEB NAND, 80 PEB for each UBI device will be reserved. This may appear to be a waste of space, but, given that bad blocks can appear anywhere on the NAND flash, and are not equally distributed on the whole device, it's the safer way. So instead of using several UBI devices on a NAND flash, it's more space-efficient to use only one UBI device which contains several UBI volumes.

The default value of 20 PEB reserved per 1024 PEB is a kernel config option. For each UBI device, this value can be adjusted via a kernel parameter or an ubiattach parameter (since kernel 3.7).

Volume auto-resize

When a UBI image is to be flashed during production, one should specify exact sizes for all volumes (the sizes are stored in the UBI volume table). However, in practice, in the embedded world, we like to have one read only volume for the root file system and one read/write volume for however much space is left (logs, user data, etc.). If the size of the root file system is fixed, the size of the second one can vary from one product to another (given different flash sizes).

This is the purpose of the auto-resize flag. If the volume has the auto-resize flag enabled, its size will expand to fill the remaining unused space when UBI is run for the first time. After the volume size is adjusted, UBI removes the auto-resize flag and the volume is not re-sized anymore. The auto-resize flag is stored in the volume table and only one volume may be marked as auto-resize.

UBI operations

LEB un-map

The LEB un-map operation is implemented by the ubi_leb_unmap() UBI kernel API function. And starting from kernel version 2.6.29 the un-map operation is available to user-space programs via the UBI_IOCEBUNMAP ioctl command. The ioctl should be called for UBI volume character devices.

The LEB un-map operation:

UBI returns all 0xFF bytes when an un-mapped LEB is read, so the un-map operation may be considered as a very fast erase operation. But there is one aspect to which UBI programmers have to be aware:

Suppose you un-map LEB L which is mapped to PEB P. Since P is not synchronously erased, but just scheduled for erasure, there might be "surprises" in the case of unclean reboots: if a reboot happens before P has been physically erased, L will be mapped to P again when UBI attaches the MTD device at the next bootup. Indeed, UBI will scan the MTD device and find the P which refers to L, and it will add this mapping information to the EBA table.

However, once you write any data to L, or map it using the LEB map operation, it gets mapped to a new PEB and the old contents are gone forever, because even in the case of an unclean reboot UBI would pick the newer mapping for L.

Implementation details

This section describes how UBI distinguishes between older and newer versions of an LEB in the case of an unclean reboot. Suppose we un-map LEB L which is mapped to PEB P1, which means UBI schedules P1 for erasure. Then we write some data to L, which means that UBI finds another PEB P2, maps L to P2, and writes the data to P2. If an unclean reboot happens before P1 is physically erased, but after the write operation, we end up with 2 PEBs (P1 and P2) mapped to the same LEB L.

To handle situations like this, UBI maintains a global 64-bit sequence number variable. The sequence number variable is incremented each time a PEB is mapped to a LEB and its value is stored in the VID header of the PEB. So each VID header has a unique sequence number, and the larger the sequence number, the "younger" the VID header. When UBI attaches MTD devices, it initializes the global sequence number variable to the highest value found in the existing VID headers plus one.

In the above situation, UBI simply selects a PEB with the highest sequence number (P2) and drops the PEB with the lower sequence number (P1).

Note, the situation is more difficult if an unclean reboot happens when UBI moves the contents of one PEB to another for wear-leveling purposes, or when the unclean reboot happens during an atomic LEB change operation. In this case it is not enough to just pick the newer PEB, it is also necessary to make sure the data reached the new PEB.

LEB map

The LEB map operation maps a previously un-mapped logical eraseblock (LEB) to a physical eraseblock (PEB). For example, if the operation is run for LEB A, UBI will find an appropriate PEB, write a VID header to the PEB, and amend the in-memory EBA table. The VID header will now refer to LEB A. After this operation all I/O to LEB A will actually go to the mapped PEB.

The LEB map operation is available via the ubi_leb_map() UBI kernel API function, or via the UBI_IOCEBMAP volume character device ioctl command. However, this ioctl interface is available only starting from kernel version 2.6.29.

One of the functions of the LEB map operation is to make sure old LEB contents are removed. As was explained in this section, when an LEB is un-mapped, the corresponding PEB is not erased immediately. If an unclean reboot happens, the LEB may become mapped to the same PEB again, after the UBI attaches the MTD device. So, if you map the LEB immediately after un-mapping it, you are guaranteed that the old LEB contents are deleted. In other words, the LEB is guaranteed to contain only 0xFF bytes after the map operation returns, even in case of an unclean reboot.

Please, use the LEB map operation sparingly. Do not use it unless it is really needed, because mapped LEBs add more overhead on the UBI wear-leveling sub-system, comparing to un-mapped LEBs. Indeed, if an LEB is un-mapped, there is no PEB which contains this LEB's data, and the wear-leveling sub-system does not have to move any data to maintain wear-leveling. Conversely, if the LEB is mapped to a PEB, there is one more PEB for the wear-leveling sub-system to care about, and one more LEB to re-map to another PEB if the erase counter of the current PEB becomes too low (then the LEB is re-mapped to a PEB with higher erase counter and the old PEB is used for other operations).

Volume update

The volume update operation is useful for device software updates. The operation changes the contents of the whole UBI volume with new contents. But if it gets interrupted in the middle of the update, the volume goes into the "corrupted" state and further I/O on the volume ends up with an EBADF error. The only way to get the volume back to the normal state is to start a new volume update operation and finish it.

The volume update operation can detect interrupted updates and re-start the update with the help of, for example, a "mirror" volume which would have the same contents or by showing a dialog window which would inform the user about the problem and request re-flashing. In contrast, it is difficult to detect interrupted updates when using raw MTD partitions.

The volume update operation is available via the user-space UBI interface and not available via the UBI kernel API. To update a volume, you first have to call the UBI_IOCVOLUP ioctl on the corresponding UBI volume character device node and pass it a pointer to a 64-bit value containing the length of the new volume contents in bytes. Then this number of bytes has to be written to the volume character device node. Once the last byte has been sent to the character device node, the update operation is finished. Conceptually, the sequence (in pseudo-code) is:

fd = open("/dev/my_volume");
ioctl(fd, UBI_IOCVOLUP, &image_size);
write(fd, buf, image_size);
close(fd);

See include/mtd/ubi-user.h for more details. Bear in mind, the old contents of the volume are not preserved if the update is interrupted. Also, you do not have to write all the new data in one go. It is OK to call the write() function an arbitrary number of times and pass arbitrary amounts of data each time. The operation will be finished after all the data has been written. If the last write operation contains more bytes than UBI expects, the extra is ignored.

A Special case of the volume update operation is what we call volume truncation, which is done by the same ioctl command when the data length is zero. In this case the volume is wiped out and will contain all 0xFF bytes (all LEBs will be un-mapped).

Note, the /sys/class/ubi/ubiX_X/corrupted sysfs file reflects the "corrupted" state of the volume: it contains ASCII "0\n" if the volume is OK and "1\n" if it is corrupted (i.e. if a volume update was started but was not completed).

The volume update operation does not preserve its previous contents if the update is interrupted; it is not atomic. However, UBI does provide atomic volume updates by means of the volume re-name operation.

Volume updates are implemented with the help of update markers. Once the user has issued the UBI_IOCVOLUP ioctl, UBI sets the update marker flag for the volume in the corresponding record of the UBI volume table. At this point the volume is wiped, and UBI waits for the user to send the data. Only when all the data has been sent and has been written to the flash successfully, will the update marker be cleared. If the update is interrupted (e.g., unclean reboot, crash of the update application, etc.), the update marker is not cleared and the volume is treated as "corrupted". Only once a successful update operation has occurred will the update marker be cleared.

Atomic LEB change

The atomic LEB change operation changes the contents of an LEB atomically, so that the old contents are preserved should the operation be interrupted. In other words, the LEB will always contain either the old contents or the new contents. This functionality is available via the ubi_leb_change() kernel API call.

The user-space interface for this operation was added in kernel version 2.6.25. Its functionality is available to user-space via the UBI_IOCEBCH ioctl command. You have to pass a pointer to a properly-filled request object of struct ubi_leb_change_req type. This object stores the LEB number to change and the length of the new contents. Then you have to write the specified number of bytes to the volume character device. Note the similarity to the volume update operation. Conceptually, the sequence (in pseudo-code) is:

struct ubi_leb_change_req req;

req.lnum = lnum_to_change;
req.len = data_len;
fd = open("/dev/my_volume");
ioctl(fd, UBI_IOCEBCH, &req);
write(fd, data_buf, data_len);
close(fd);

If, for some reason, the user does not write the specified number of bytes to the file descriptor before closing the file, the operation is cancelled and the old contents of the LEB are preserved.

Similarly to the volume update operation, it does not matter how many times the write() function is called and how much data it passes to the UBI volume each time. The atomic LEB change operation finishes only once the last data byte has arrived.

The atomic LEB change operation might be very useful for file-systems, for example UBIFS uses this functionality when it commits the file-system index. This behaviour could also be used to create an FTL layer on top of UBI (see here for a description of the idea).

Keep in mind that the atomic LEB change operation calculates the CRC-32 checksum of the new data, so it has some overhead compared to the "LEB erase" + "LEB write" sequence. The volume update operation does not calculate the data's CRC-32 checksum, so it is faster to update the volume than it is to atomically change all its eraseblocks. Keep this overhead in mind and be sure to only use this operation if/when atomicity is really needed.

Implementation details

Suppose UBI has to change a logical eraseblock L which is mapped to a physical eraseblock P1. First of all, UBI always has one free PEB reserved for the atomic LEB change operation, let it be P2. Before the operation, P1 stores the current contents of the LEB L and P2 is free (it contains only the EC header and 0xFF bytes). The new data is written to P2, not to P1, so should anything go wrong, the old contents of the LEB are maintained.

When the operation finishes, UBI un-maps L from P1, maps in to P2, and schedules P1 for erasure. If the operation is interrupted, L continues to be mapped to P1 and P2 is scheduled for erasure.

If an unclean reboot happens half way through the atomic LEB change operation, it is obvious that UBI has to preserve the L -> P1 mapping and erase P2 when it attaches the MTD device on the next reboot. But if an unclean reboot happens just after the atomic LEB change operation finishes, but before P1 is physically erased, it is obvious that UBI has to preserve the L -> P2 mapping and erase P1.

To resolve situations like that, UBI calculates the CRC-32 checksum of the new contents of the LEB before it is written to the flash, and stores it in the VID header (together with data length). When UBI finds 2 PEBs P1 and P2 mapped to the same LEB L during the initialization, it selects the one with the higher sequence number (P2) only if the data CRC-32 checksum is correct (which means that all data has been written to the flash media), otherwise it selects the PEB with lower sequence number(P1). Of course, UBI has to read the LEB contents in order to verify the CRC-32 checksum.

Fastmap

Fastmap is an experimental and optional UBI feature, which can be enabled by setting CONFIG_MTD_UBI_FASTMAP to 'y'. Once enabled UBI evaluates the module parameter "fm_autoconvert". If it is set to 1 (default is 0) UBI automatically enables fastmap for any attached image. This means UBI creates a new internal volume with the fastmap data such that next time the image is attached, the fast attach mode can be used.

In the default configuration UBI will use the information stored in this fastmap volume to accelerate the attach procedure. If you want to test fastmap, set fm_autoconvert to 1 and attach a volume.

The following settings are possible:

CONFIG_MTD_UBI_FASTMAP fm_autoconvert Result
n 0 fastmap is completely disabled
y 0 UBI will use the fastmap data if it exists on an image, but will not install a fastmap on images that don't already have it
y 1 UBI will use the fastmap data if it exists on an image, and a fastmap is automatically created on all attached images

Backwards compatibility

The fastmap on-disk data structure makes use of delete compatible volumes, therefore fastmap-enabled images are fully backwards compatible with UBI implementations which do not support fastmap. The kernel will remove the fastmap volumes and continue with scanning. This includes not only kernel version v3.6- but also v3.7+ with this option disabled.

Technical design

An on-disk fastmap contains all the information required to attach the whole image, including: all erase counter values, a list of all PEBs and their state, a list of all volumes and their current EBA, etc... To avoid too many writes of the fastmap, it also contains a list of PEBs which may have changed and need a full scan while attaching. This list is called the "fastmap pool" and has a fixed size of 5% of the total number of PEBs. By design UBI needs to write the fastmap data only if the pool contains no free PEBs. Otherwise it would have to write the fastmap each time the EBA of a volume has changed.

A fastmap consists of a super-block (also known as an anchor PEB) and payload data which can live on any PEB. The anchor PEB has to be located within the first 64 PEBs on the MTD device. It contains pointers to the remaining PEBs which carry the actual fastmap data. On modern NAND chips the whole fastmap fits into a single PEB. Hence, the anchor PEB points to itself. After loading the fastmap data, the UBI attach information structure is created from it.

The attach process works as follows:

  1. UBI tries to find the fastmap anchor PEB, if no anchor PEB was found UBI performs a traditional full scan
  2. It follows the pointers stored in the anchor PEB and reads the fastmap payload data
  3. Then it performs a traditional scan only on PEBs in the pool instead of all PEBs

If UBI detects that the fastmap data is invalid or corrupt it automatically falls back to scanning mode and performs a full scan. Using a CRC32 checksum and consistency checks of the internal UBI structures UBI is able to detect whether the fastmap data is invalid or not.

The fastmap data is written to the device: each time the fastmap pool becomes full (i.e. no free PEBs are available), the volume layout changes, or the image is detached. If you are wondering why the fastmap data needs to be written at detach time, it is because otherwise all erase counter modifications since the last fastmap write would be lost.

Overhead

A fastmap-enabled UBI will reserve enough PEBs to carry two complete fastmaps. In practice on modern NAND chips two PEBs are reserved for fastmap.

There is also some runtime overhead. In order to guarantee that the new fastmap is valid and consistent, UBI needs to make sure that all I/O which would cause EBA changes are blocked while attaching. Depending on the specific flash chips, this can take up to one second. Therefore, fastmap only makes sense on fast and large flash devices where a full scan would otherwise take too long. For example: on 4GiB NAND chips a full scan takes several seconds, whereas a fast attach needs less than one second.

Notes

Enabling fastmap does not guarantee that every attach process will be done in optimal time. In some situations a full scan is still needed. This can happen in two cases: (i) if an unexpected reboot occurs while a fastmap is being written to the flash or (ii) UBI runs out of PEBs while writing the fastmap. The latter case can happen if a massive amount of I/O errors happen while writing, and UBI cannot find enough usable PEBs.

R/O block devices on top of UBI volumes

UBI allows the creation of block devices on top of UBI volumes with the following limitations:

Despite these limitations, a block device is still very useful for the purpose of mounting read-only, regular file systems on top of UBI volumes. Take, for example, squashfs, which can be used as a lightweight read-only rootfs on top of a NAND device. In this case, the UBI layer will take care of low-level details such as bit-flip handling and wear-levelling.

Usage

Creating and destroying block devices on a UBI volume is somewhat similar to attaching MTD devices to UBI. You can either use the block UBI module parameter or use the "ubiblock" user-space tool.

In order to create a block device at bootup time (e.g. to mount the rootfs on such a block device) you can specify the block parameter as a kernel boot argument:

ubi.mtd=5 ubi.block=0,0 root=/dev/ubiblock0_0

There are several ways of specifying a volume:

If you've built UBI as a module you can use the following parameters at module load time:

$ modprobe ubi mtd=/dev/mtd5 block=/dev/ubi0_0

A block device can also be created/removed dynamically at runtime, using the ubiblock user-space tool:

$ ubiblock --create /dev/ubi0_0
$ ubiblock --remove /dev/ubi0_0

UBI stress testing

If enabled when configuring (right before building the code), mtd-utils includes user-space tools that can be used to stress test the UBI stack. This is useful if you want to test the stability and correctness of your particular UBI stack implementation.

Example: running various UBI tests:

$ flash_erase /dev/mtd3 0 0
$ ubiattach --mtdn 3
$ /usr/libexec/mtd-utils/runubitests.sh /dev/ubi0

More documentation

Unfortunately, no complete, up-to-date design documents exist for UBI. But there is an old UBI design document which has some out-of-date information which might still be of limited use: ubidesign.pdf.

There is also a PowerPoint UBI presentation available: ubi.ppt. Note, this document contains a lot of animations, so be sure to view it in "slide show" mode (F5 key) so that the animations will be played.

More information may be found in the FAQ section.

And of course just reading the UBI interface C header files (which are well commented) may help: include/mtd/ubi-user.h contains the user-space interface definition (namely, it defines UBI ioctl commands and the associated data structures), include/linux/mtd/ubi.h defines the kernel API, and drivers/mtd/ubi/kapi.c contains comments for each kernel API function (just above the body of the function).

Valid XHTML 1.0! Valid CSS!