UBI - Unsorted Block Images
Table of contents
- Big red note
- Overview
- Power-cuts tolerance
- Kernel source code
- Mailing list
- User-space tools
- UBI headers
- UBI volume table
- Minimum flash input/output unit
- NAND flash sub-pages
- UBI headers position
- Flash space overhead
- Saving erase counters
- How UBI flasher should work
- Marking eraseblocks as bad
- Scalability issues
- Reserved blocks for bad block handling (only for NAND chips)
- Volume auto-resize
- UBI operations
- Fastmap
- R/O block devices on top of UBI volumes
- UBI stress testing
- More documentation
Big red note
People are often confused about UBI, which is why this section was created. Please, realize that:
- UBI is not a Flash Translation Layer (FTL), and it has nothing to do with FTL;
- UBI works with bare flashes, and it does not work with
consumer flashes like
MMC
,RS-MMC
,eMMC
,SD
,mini-SD
,micro-SD
,CompactFlash
,MemoryStick
,USB flash drive
, etc; instead, UBI works with raw flash devices, which are mostly found in embedded devices like mobile phones, etc.
Please, do not be confused. Read here for more information about how raw flash devices are different from FTL devices.
Overview
UBI (Latin: "where?") stands for "Unsorted Block Images". It is a volume management system for raw flash devices which manages multiple logical volumes on a single physical flash device and spreads the I/O load (i.e, wear-leveling) across whole flash chip.
In a sense, UBI may be compared to the Logical Volume Manager (LVM). Whereas LVM maps logical sectors to physical sectors, UBI maps logical eraseblocks to physical eraseblocks. But besides the mapping, UBI implements global wear-leveling and transparent error handling.
An UBI volume is a set of consecutive logical eraseblocks (LEBs). Each logical eraseblock is dynamically mapped to a physical eraseblock (PEB). This mapping is managed by UBI and is hidden from users and higher-level software. UBI is the base mechanism which provides global wear-leveling, per-physical eraseblock erase counters, and the ability to transparently move data from more worn-out physical eraseblocks to less worn-out ones.
The UBI volume size is specified when a volume is created, but may later be changed (volumes are dynamically re-sizable). There are user-space tools which may be used to manipulate UBI volumes.
There are 2 types of UBI volumes: dynamic volumes and static
volumes. Static volumes are read-only and their contents are protected by
CRC-32
checksums, while dynamic volumes are read-write and the
upper layers (e.g., a file-system) are responsible for ensuring data
integrity.
Static volumes are typically used for the kernel, initramfs, and dtb.
Larger static volumes may incur a significant penalty when opening, as the
CRC-32
needs to be calculated at this time. If you are looking
to use static volumes for anything besides the kernel, initramfs, or dtb you
are likely doing something wrong and would be better off using a dynamic volume
instead.
UBI is aware of bad eraseblocks (i.e. portions of flash which wear out over time) and frees upper-level software from having to handle bad eraseblocks itself. UBI has a pool of reserved physical eraseblocks, and when a physical eraseblock becomes bad, it transparently substitutes it with a good physical eraseblock. UBI moves the data from newly discovered bad physical eraseblocks to good ones. The result is that users of UBI volumes do not notice I/O errors since UBI takes care of them transparently.
NAND flashes are also susceptible to bit-flip errors which occur on read and write operations. Bit-flips are corrected by ECC checksums, but they may accumulate over time and cause data loss. UBI handles this by moving data from physical eraseblocks which have bit-flips to other physical eraseblocks. This process is called scrubbing. Scrubbing is done transparently in the background and is hidden from upper layers.
Here is a short list of UBI's main features:
- UBI provides volumes which may be dynamically created, removed, or re-sized;
- UBI implements wear-leveling across the entire flash device (i.e., you might think you're continuously writing/erasing the same logical eraseblock of an UBI volume, but UBI will spread this to all physical eraseblocks of the flash chip);
- UBI transparently handles bad physical eraseblocks;
- UBI minimizes the chances of losing data by means of scrubbing.
Here is a comparison of MTD partitions and UBI volumes. They are somewhat similar because:
- both consist of eraseblocks - logical eraseblocks in the case of UBI volumes, and physical eraseblocks in the case of MTD partitions;
- both support three basic operations: read, write, and erase.
But UBI volumes have the following advantages over MTD partitions:
- UBI implements wear-leveling, so users do not have to care about this at all, which means the upper level software may be simpler;
- UBI handles bad eraseblocks, which also leads to simpler upper-level software;
- UBI volumes are dynamic in a sense that they may be created, removed or re-sized dynamically, while MTD partitions are static;
- UBI handles bit-flips which again makes the upper level software simpler;
- UBI provides a volume update operations which makes it easy to detect interrupted software updates and recover;
- UBI provides an atomic logical eraseblock change operation which allows changing the contents of a logical eraseblock without loosing the data if an unclean reboot happens during the operation; this might be very useful for the upper-level software (e.g., for a file-system);
- UBI has an un-map operation, which just un-maps a logical eraseblock from the physical eraseblock, schedules the physical eraseblock for erasure, and returns; this is very quick and frees upper level software from implementing their own mechanisms to defer erasures (e.g., JFFS2 has to implement such mechanisms).
UBI also provides a block device that allows regular, block-oriented file systems to be mounted on top of an UBI volume. This is possible because UBI handles bad-blocks transparently.
There is an additional driver called gluebi
which emulates MTD
devices on top of UBI volumes. This looks a little strange, because UBI works
on top of an MTD device, then gluebi
emulates other MTD devices
on top, but this actually works and makes it possible for existing software
(e.g., JFFS2) to run on top of UBI volumes. However, new software may benefit
from the advanced UBI features and let UBI solve many issues which the flash
technology imposes.
Power-cuts tolerance
Both UBI and UBIFS are designed with tolerance to power-cuts in mind.
UBI has an internal debugging infrastructure that can emulate power failures for testing. The advantage of the emulation is that it emulates power failures at the critical points where control data structures are written to the device, whereas the probability of interrupting the system at those precise moments with physical power-cut testing is rather low.
Kernel source code
UBI has been added to the main-line Linux kernel since version
2.6.22
. The UBI git tree may be found at:
https://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs.git/
Mailing list
You are welcome to send feed-back, bug-reports, patches, etc to the MTD mailing list.
User-space tools
UBI user-space tools, as well as other MTD user-space tools, are available from the following git repository:
http://git.infradead.org/mtd-utils.git
This section
provides information about how to compile the whole mtd-utils
repository tree. You should find the UBI tools under the ubi-utils
sub-directory.
The repository contains the following UBI tools:
ubinfo
- provides information about UBI devices and volumes found in the system;ubiattach
- attaches MTD devices (which describe raw flash) with UBI which creates corresponding UBI devices;ubidetach
- detaches MTD devices from UBI devices (the opposite to whatubiattach
does);ubimkvol
- creates UBI volumes on UBI devices;ubirmvol
- removes UBI volumes from UBI devices;ubiblock
- manages block interfaces for UBI volumes. See here for more information;ubiupdatevol
- updates UBI volumes; this tool uses the UBI volume update feature which leaves the volume in "corrupted" state if the update was interrupted; additionally, this tool may be used to wipe out UBI volumes;ubicrc32
- calculatesCRC-32
checksum of a file with the same initial seed as UBI would use;ubinize
- generates UBI images;ubiformat
- formats empty flash, erases flash and preserves erase counters, flashes UBI images to MTD devices;mtdinfo
- reports information about MTD devices found in the system.
All UBI tools support an "-h" option which prints basic usage information.
Note, the ubiattach
and ubidetach
tools won't work
if the kernel version is less than 2.6.25
, because corresponding
UBI features did not exist in these older kernels.
UBI headers
UBI stores 2 small 64-byte headers at the beginning of each non-bad physical eraseblock:
- erase counter header (or EC header) which contains the erase counter of the physical eraseblock (PEB) plus other information;
- volume identifier header (or VID header) which stores the volume ID and the logical eraseblock (LEB) number to which this PEB belongs.
This is why logical eraseblocks are smaller than physical eraseblock - the headers take some flash space.
All UBI headers are protected by a CRC-32
checksum. Please,
refer to the drivers/mtd/ubi/ubi-media.h
file in the linux kernel for
more information about the header's contents.
When UBI attaches an MTD device, it has to scan it, read all headers,
check the CRC-32
checksums, and store erase counters and the
logical-to-physical eraseblock mapping information in RAM. Please, refer to
this section for information about
scalability issues related to this.
After UBI has erased a PEB, it increments the erase counter value and writes it to the EC header. This means that PEBs always have a valid EC header, except for the short period of time after the erasure and before the EC header is written. Should an unexpected reboot happen during this short period of time, the EC header is lost or becomes corrupted. In this case UBI writes a new EC header with an average erase counter just after the MTD device scanning is done.
The VID header is written to the PEB when UBI associates it with an LEB. Let's consider what happens to the headers during some UBI operations.
- The LEB un-map operation simply un-maps the LEB from the PEB and schedules the PEB for erasure. When the PEB is erased, the EC header is written immediately. The VID header is not written.
- The LEB map operation or a write operation to an un-mapped LEB makes UBI find an appropriate PEB and writes the VID header to it (the EC header must already be there). Note, the write operation to an already-mapped LEB just writes the data straight to the PEB and does not change the UBI headers.
UBI maintains two per-PEB headers because it needs to write different information to flash at different moments of time:
- after a PEB is erased, the EC header is written immediately, which minimizes the probability of losing the erase counter due to unexpected reboots;
- when UBI associates a PEB with an LEB, the VID header is written to the PEB.
When the EC header is written to a PEB, UBI does not yet know the volume ID nor the LEB number to which this PEB will be associated. This is why UBI needs to do two separate write operations and to have two separate headers.
UBI volume table
The volume table is an on-flash data structure which contains information about each volume on this UBI device. The volume table is an array of volume table records. Each record contains the following information:
- volume size;
- volume name;
- volume type (dynamic or static);
- volume alignment;
- update marker (set on a volume when an update is initiated and cleared when successfully completed);
- auto-resize flag;
CRC-32
checksum for this record.
Each record describes one UBI volume. The record index in the volume table array corresponds to the volume ID it describes. I.e, UBI volume 0 is described by record 0 in the volume table, and so on. The total number of records in the volume table is limited by the LEB size, and cannot be greater than 128. This means that UBI devices cannot have more than 128 volumes.
Every time an UBI volume is created, removed, re-sized, re-named or updated, the corresponding volume table record is changed. UBI maintains two copies of the volume table for reasons of reliability and power-cut tolerance.
Implementation details
Internally, the volume table resides in a special-purpose UBI volume which is called the layout volume. This volume consists of 2 LEBs - one for each copy of the volume table. The layout volume is an "internal" UBI volume, and users do not see it nor access it. When reading or writing the layout volume, UBI uses the same mechanisms which are used for normal user volumes.
UBI uses the following algorithm when updating a volume table record:
- Prepare an in-memory buffer with the new volume table contents.
- Un-map LEB0 of the layout volume.
- Write the new volume table to LEB0.
- Un-map LEB1 of the layout volume.
- Write the new volume table to LEB1.
- Flush the UBI work queue to make sure the PEBs are corresponding to the un-mapped LEBs are erased.
When attaching the MTD device, UBI makes sure that the 2 volume table copies are equivalent. If they are not equivalent, which may be caused by an unclean reboot, UBI picks the one from LEB0 and copies it to LEB1 of the layout volume (because, according to the algorithm specified above, LEB0 is the one that is updated first and therefore considered to have the most up-to-date information). If one of the volume table copies is corrupted, UBI restores it from the other volume table copy.
Minimum flash input/output unit
UBI uses an abstract model of flash. In short, from UBI's point of view the flash (or MTD device) consists of eraseblocks, which may be good or bad. Each good eraseblock may be read from, written to, or erased. Good eraseblocks may also be marked as bad.
Flash reads and writes may only be done in multiples of the minimum input/output unit size, which depends on the flash type.
- NOR flashes usually have a minimum I/O unit size of 1 byte, because NOR flashes usually allow reading and writing single bytes (in fact, it is even be possible to change individual bits).
- Some NOR flashes may have other minimum I/O unit sizes, e.g. 16 or 32 bytes in the case of ECC'd NOR flashes.
- NAND flashes usually have minimum I/O sizes of 512, 2048 or 4096 bytes, which corresponds to their page size. NAND flashes store per-page ECC codes in the OOB area, which means that whole NAND pages have to be written at once to calculate the ECC, and whole NAND pages have to be read at once to check the ECC.
The minimum I/O unit size is a very important characteristic of the MTD device. It affects many things, e.g.:
- the physical position of the VID header depends on the minimum I/O unit size, which means that the LEB size also depends on it; generally, the larger the minimum I/O unit size, the smaller the LEB size, and therefore the greater the UBI flash space overhead;
- all writes to LEBs should be aligned to the minimum I/O unit size, and should be multiples of the minimum I/O unit size; this does not apply to reads, but bear in mind that on the MTD level all reads are done in multiples of the minimum I/O unit size anyway; this is just hidden from users by buffering the read data and copying only the requested amount of bytes to the user buffer.
NAND flash sub-pages
As mentioned earlier, all UBI I/O is be performed in multiples of the minimum I/O unit size, which is equivalent to the NAND device's page size (in the case of NAND flash). However, some SLC NAND flashes allow for smaller I/O units, which are called sub-pages in MTD terminology. Not all NAND devices have sub-pages.
- MLC NANDs do not have sub-pages (at least as of April 2009).
- SLC NANDs usually do have sub-pages. E.g., 512-byte NAND pages usually consist of 2x256-byte sub-pages, and 2048-byte NAND pages usually consist of 4x512-byte sub-pages.
- SLC OneNAND chips with 2048-byte NAND pages have 4x512-byte sub-pages.
If the NAND flash supports sub-pages, then ECC codes can be calculated on a per-sub-page basis, instead of a per-page basis. In this case it becomes possible to read and write sub-pages independently.
However, even though the NAND chip may support sub-pages, the NAND controller of your SoC might not. If the flash is managed by a controller which calculates ECC codes only on a per-page basis, then it is impossible to do I/O in sub-page chunks. E.g. this is the case for the OLPC XO-1 laptop) - its NAND chip supports sub-pages, but the NAND controller does not.
Note, the phrase "sub-page" is an MTD term, but this is also referred to as "NOP" which stands for "number of partial programs". NOP1 NAND flashes have no sub-pages - UBI treats them as NANDS with sub-page size equivalent to the NAND page size. NOP2 NAND flashes have 2 sub-pages (half a NAND page each), and NOP4 flashes have 4 sub-pages (a quarter of a NAND page each).
UBI utilizes sub-pages to reduce flash space overhead. This overhead is reduced if sub-pages can be used (see here). Consider a NAND flash with 128KiB eraseblocks and 2048-byte pages. If it does not have sub-pages, UBI puts the VID header at physical offset 2048, so the LEB size becomes 124KiB (128KiB minus one NAND page which stores the EC header and minus another NAND page which stores the VID header). Conversely, if the NAND flash does have sub-pages, UBI puts the VID header at physical offset 512 (the second sub-page), so the LEB size becomes 126KiB (128KiB minus one NAND page which is used for storing both UBI headers). See this section for more information about where the UBI headers are stored.
Sub-pages are only used by UBI internally, and only for storing the headers.
The UBI API does not allow users to perform I/O to sub-page units. One of the reasons for
this is that sub-page writes may be slow. To write a sub-page, the driver may
actually write the whole NAND page, but put 0xFF
bytes in the sub-pages
which are not relevant to this operation. If this is the case, writing 4
sub-pages will be 4 times slower than writing the whole NAND page at once. Thus,
UBI does use sub-pages for the headers, but this trick does not extend to the
UBI API.
UBI headers position
The EC header always resides at offset 0 and takes 64 bytes, the VID header resides at the next available minimum I/O unit or sub-page, and also takes 64 bytes. For example:
- in the case of NOR flash, which has a 1-byte minimum I/O unit, the VID header resides at offset 64;
- in the case of a NAND flash which does not have sub-pages, the VID header resides at the second NAND page;
- in the case of a NAND flash which has sub-pages, the VID header resides at the second sub-page.
Flash space overhead
UBI uses some amount of flash space for its own purposes, thus reducing the amount of flash space available for UBI users. Namely:
- 2 PEBs are used to store the volume table;
- 1 PEB is reserved for wear-leveling purposes;
- 1 PEB is reserved for the atomic LEB change operation;
- some amount of PEBs are reserved for bad PEB handling; this is applicable for NAND flash but not for NOR flash; the amount of reserved PEBs is configurable and is equal to 20 blocks per 1024 blocks by default;
- UBI stores the EC and VID headers at the beginning of each PEB; the number of bytes used for these purposes depends on the flash type and is explained below.
Let's introduce symbols:
- W - total number of physical eraseblocks on the flash chip (NB: the entire chip, not the MTD partition);
- P - total number of physical eraseblocks on the MTD partition;
- SP - physical eraseblock size;
- SL - logical eraseblock size;
- BB - number of bad blocks on the MTD partition;
- BR - number of PEBs reserved for bad PEB handling (it is 20 * W/1024 for NAND by default, and 0 for NOR and other flash types which do not have bad PEBs);
- B - MAX(BR,BB);
- O - the overhead related to storing EC and VID headers in bytes, i.e. O = SP - SL.
The UBI overhead is (B + 4) * SP + O * (P - B - 4) i.e., this amount of bytes will not be accessible for users. O is different for different flashes:
- in the case of NOR flash, which has a 1-byte minimum I/O unit, O is 128 bytes;
- in the case of a NAND flash which does not have sub-pages (e.g., MLC NAND), O is 2 NAND pages, i.e. 4KiB in the case of 2KiB NAND pages and 1KiB in the case of 512-byte NAND pages;
- in the case of a NAND flash which has sub-pages, UBI optimizes its on-flash layout and puts the EC and VID headers at the same NAND page, but different sub-pages; in this case O is only one NAND page;
- for other flashes the overhead should be 2 minimum I/O units if the minimum I/O unit size is greater or equivalent to 64 bytes, and 2 times 64 bytes aligned to the minimum I/O unit size if the minimum I/O unit size is less than 64 bytes.
N.B.: the formula above counts bad blocks as a UBI overhead. The real UBI overhead is: (B - BB + 4) * SP + O * (P - B - 4).
Saving erase counters
When working with UBI, it is important to realize that UBI stores erase
counters on the flash media. Namely, each physical eraseblock has an EC
(erase counter) header which stores the amount of times this physical eraseblock
has been erased (see here).
It is important not to lose the erase counters, which means the tools
you use to erase the flash and to write the UBI images have to be UBI-aware. The
mtd-utils repository contains the
ubiformat
utility which does things properly.
How a UBI flasher should work
The following is a list of what a UBI flasher program has to do when erasing the flash or when writing UBI images.
- First, scan the flash and collect the erase counters. Namely,
it reads the EC header from each PEB, checks the
CRC-32
checksum of the header, and saves the erase counter in RAM. It is not necessary to read the VID headers. Bad PEBs should be skipped. - Next, calculate the average erase counter. This will be used for PEBs with corrupted or missing EC headers. Such PEBs may occur due to unexpected reboots, but there shouldn't be too many of them.
- If the intention is to just erase the flash, then each PEB has to be erased and a proper EC header has to be written at the beginning of the PEB. The EC header should contain the updated erase counter. Bad PEBs should be skipped. For NAND flashes, in the case of I/O errors while erasing or writing, the PEB should be marked as bad (see here for more information on how UBI marks PEBs as bad).
- If the intention is to flash an UBI image, then the flasher should
do the following for each non-bad PEB.
- Read the contents of this PEB from the UBI image (PEB size bytes) into a buffer.
- Strip minimum I/O units full of
0xFF
bytes from the end of the buffer (the details are given below). - Erase the PEB.
- Change the EC header in the buffer - put the new erase
counter value there and re-calculate the
CRC-32
checksum. - Write the buffer to the physical eraseblock.
In practice the input UBI image is usually shorter than the flash, so the flasher has to flash the used PEBs properly, and erase the unused PEBs properly.
Note, when writing an UBI image, it does not matter where eraseblocks from the input UBI image are written. For example, the first input eraseblock may be written to the first PEB, or to the second one, or to the last one.
Also note, if you create a flasher to write UBI images at the time of production, (i.e., new flash, only once) then the flasher does not have to change the EC headers of the input UBI image, because this is new flash and each PEB has zero erase counter anyway. This means the production-line flasher may be simpler.
If your UBI image contains a UBIFS file system, and
your flash is NAND, you may have to insert 0xFF
bytes at the end of
your input PEB data. This is very important, although not required for all NAND
flashes. Sometimes a failure to do this may result in very unpleasant problems
which might be difficult to debug later on. So we recommend to always do this.
The reason for this is that UBIFS treats NAND pages which contain only
0xFF
bytes (let's refer them to as empty NAND pages) as free.
For example, suppose the first NAND page of a PEB has some data, the second one
is empty, the third one also has some data, the fourth one and the rest of NAND
pages are empty as well. In this case UBIFS will treat all NAND pages starting
from the fourth one as free, and will write data there. If the flasher program
has already written 0xFF
's to these pages, then any new UBIFS data
will cause a second write. However, many NAND flashes require NAND pages to be
written only once, even if the data contains only 0xFF
bytes.
To put it differently, writing 0xFF
bytes may have side-effects.
What the flasher has to do is to drop all empty NAND pages from the end of the
PEB buffer before writing it. It is not necessary to drop all empty NAND pages,
just the last ones. This means that the flasher does not have to scan the whole
buffer for 0xFF
's. It is enough to scan the buffer from the end,
and stop on the first non-0xFF
byte. This is much faster. Here
is the code from UBI which does the right thing:
/** * calc_data_len - calculate how much real data are stored in a buffer. * @ubi: UBI device description object * @buf: a buffer with the contents of the physical eraseblock * @length: the buffer length * * This function calculates how much "real data" is stored in @buf and returns * the length. Continuous 0xFF bytes at the end of the buffer are not * considered as "real data". */ int ubi_calc_data_len(const struct ubi_device *ubi, const void *buf, int length) { int i; for (i = length - 1; i >= 0; i--) if (((const uint8_t *)buf)[i] != 0xFF) break; /* The resulting length must be aligned to the minimum flash I/O size */ length = ALIGN(i + 1, ubi->min_io_size); return length; }
This function is called before writing the buf
buffer to the
PEB. The purpose of this function is to drop 0xFF
's from the end
and prevent the situation described above. The ubi->min_io_size
is the minimal I/O unit size, which is equivalent to the NAND page size.
By the way, we experienced similar problems with JFFS2. The JFFS2 images
generated by the mkfs.jffs2
program were padded to the physical
eraseblock size and were later flashed to our NAND. The flasher did not bother to
skip empty NAND pages. When JFFS2 was mounted, it wrote to those NAND pages,
and the writes did not fail. But later we observed weird ECC errors. It took a
while to find out the problem. In other words, this is also relevant to JFFS2
images.
An alternative to this approach is to enable the "free space fixup" option
when generating the UBIFS file system using mkfs.ubifs
. This will
allow your flasher to not have to worry about 0xFF
bytes at the end
of PEBs, which is particularly useful if you need to use an industrial flash
programmer to write a UBI image. More information is available
here.
Marking eraseblocks as bad
This section is relevant for NAND flashes as well as other flashes which exhibit bad eraseblocks. UBI marks physical eraseblocks as bad in the following 2 scenarios:
- an eraseblock write operation failed, in which case UBI moves the data from this PEB to some other PEB (data recovery) and schedules this PEB for torturing;
- the erase operation failed with
EIO
error, in which case the eraseblock s marked as bad immediately.
The torturing is done in the background for the purpose of detecting whether the physical eraseblock is actually bad. The write failure could have occurred for one of many reasons, including bugs in the driver or in the upper level stuff like the file system (e.g., the FS mistakenly writes many times to the same NAND page). During the torturing UBI does the following:
- erase the eraseblock;
- read it back and make sure it contains only 0xFF bytes;
- write test pattern bytes;
- read the eraseblock back and check the pattern;
- and so on for several patterns (
0xA5
,0x5A
,0x00
).
The eraseblock is not marked as bad if it survives the torture test. However,
a bit-flip during the torture test is a good reason to mark the
eraseblock as bad. Please, refer to the torture_peb()
function
for detailed information.
Scalability issues
Unfortunately, UBI performance scales linearly with flash size. UBI initialization time is directly proportional to the number of physical eraseblocks on the flash. This means that the larger the flash, the more time it takes for UBI to initialize (i.e., to attach the MTD device). Note: Starting with Linux v3.7 UBI offers an optional and experimental feature called "fastmap", which allows attaching in nearly constant time, see Fastmap. The initialization time depends on the flash I/O speed and (slightly) on the CPU speed, because:
- UBI scans the MTD device when attaching - it reads the erase EC and VID headers from every single PEB; the headers are small (64 bytes each), so this means reading 128 bytes from each PEB in the case of NOR flash or one or two NAND pages in the case of NAND flash (this depends on whether the NAND flash supports sub-pages or not); in any case this is much less time than JFFS2 needs to read when it mounts MTD devices, so UBI attaches MTD devices many times faster than JFFS2 would mount a file system on the same MTD device;
- UBI calculates the
CRC-32
checksum of each EC and VID header, which consumes CPU, although this is usually minor compared to the flash I/O overhead.
Here are some figures:
- a 256MiB OneNAND flash found in Nokia N800 devices attaches in less than 1 sec; the flash does support sub-pages so UBI only has to read the first 2KiB NAND page of each PEB while scanning;
- a 1GiB NAND flash found in OLPC XO-1 devices attaches in about 2 seconds; the flash is an SLC NAND and supports sub-pages, but the Cafe controller which is used in the laptop does not allow sub-page writes, so UBI has to read two 2KiB NAND pages from each PEB.
Unfortunately we do not have more data and the reader is welcome to send it to us via the MTD mailing list.
Implementation details
In general, UBI needs three tables to operate:
- volume table which contains per-volume information, like volume size, type, etc;
- eraseblock association (EBA) table which contains the logical-to-physical eraseblock mapping information; for example, when reading an LEB, UBI first looks up the table to find the corresponding PEB number, then reads from this PEB;
- erase counters (EC) table which contains the erase counter value for each physical eraseblock; the UBI wear-leveling sub-system uses this table when it needs to find, for example, a highly worn-out LEB;
The volume table is maintained on-flash. It changes only when UBI volumes are created, deleted, or re-sized, which are rare and not time-critical operations, when UBI can afford slow and simple volume table management.
The EBA and EC tables are changed every time an LEB is mapped to a PEB or a PEB is erased, which happens quite often and means that the table management methods should be fast and efficient.
UBI could maintain the EBA and EC tables on the flash media, but this would inevitably involve journaling, journal replay, journal commit, etc. In other words, this would introduce a lot of complexity. But UBI would be logarithmically scalable in this case.
One of the UBI requirements was simplicity of the on-flash format, because UBI authors had to read UBI volumes from the boot-loader and they had very tight constraints on the boot-loader code size. It was basically impossible to add complex journal scanning and replay code to the boot-loader.
Therefore UBI does not maintain the EBA and EC tables on the flash media. Instead, it builds them in RAM each time it attaches the MTD device. This means that UBI has to scan the entire flash and read the EC and VID headers from each PEB in order to build the in-RAM EC and EBA tables.
The drawbacks of this design are poor scalability and relatively high overhead on NAND flashes (e.g., the overhead is 1.5%-3% of flash space in case of a NAND flash with 2KiB NAND page and a 128KiB eraseblock). The advantages of this simplicity are a simple binary format as well as robustness.
Nonetheless, someday we might see a "UBI2" which would maintain the tables in separate flash areas. UBI2 would not be compatible with UBI because of completely different on-flash formats, but the user interfaces would stay the same, which would guarantee compatibility of all the software built on top of UBI.
Reserved blocks for bad block handling (only for NAND chips)
It is well-known that NAND chips have some amount of physical eraseblocks marked as bad by the manufacturer. During the lifetime of the NAND device, other bad blocks may appear. Nonetheless, manufacturers usually guarantee that the first few physical eraseblocks are not bad and that the total number of bad PEBs will not exceed certain number. For example, a 256MiB (2048 128KiB PEBs) Samsung OneNAND chip is guaranteed to have not more than 40 128KiB PEBs during its endurance lifetime. This is a very common value for NAND devices: 20/1024 PEB, which is about 2% of flash size.
This ratio of 20/1024 is the default number of blocks that UBI reserves for a UBI device. This means that if there are 2 UBI devices on a 4096 PEB NAND, 80 PEB for each UBI device will be reserved. This may appear to be a waste of space, but, given that bad blocks can appear anywhere on the NAND flash, and are not equally distributed on the whole device, it's the safer way. So instead of using several UBI devices on a NAND flash, it's more space-efficient to use only one UBI device which contains several UBI volumes.
The default value of 20 PEB reserved per 1024 PEB is a kernel config option. For each UBI device, this value can be adjusted via a kernel parameter or an ubiattach parameter (since kernel 3.7).
Volume auto-resize
When a UBI image is to be flashed during production, one should specify exact sizes for all volumes (the sizes are stored in the UBI volume table). However, in practice, in the embedded world, we like to have one read only volume for the root file system and one read/write volume for however much space is left (logs, user data, etc.). If the size of the root file system is fixed, the size of the second one can vary from one product to another (given different flash sizes).
This is the purpose of the auto-resize flag. If the volume has the auto-resize flag enabled, its size will expand to fill the remaining unused space when UBI is run for the first time. After the volume size is adjusted, UBI removes the auto-resize flag and the volume is not re-sized anymore. The auto-resize flag is stored in the volume table and only one volume may be marked as auto-resize.
UBI operations
LEB un-map
The LEB un-map operation is implemented by the
ubi_leb_unmap()
UBI kernel API function. And starting from kernel
version 2.6.29
the un-map operation is available to user-space
programs via the UBI_IOCEBUNMAP
ioctl command. The ioctl should be
called for UBI volume character devices.
The LEB un-map operation:
- first un-maps the LEB from the corresponding PEB;
- then schedules the PEB for erasure and returns; it does not wait for the erasure of the PEB to be finished; the PEB is instead erased by the UBI background thread;
UBI returns all 0xFF
bytes when an un-mapped LEB is read, so
the un-map operation may be considered as a very fast erase operation. But there
is one aspect to which UBI programmers have to be aware:
Suppose you un-map LEB L which is mapped to PEB P. Since P is not synchronously erased, but just scheduled for erasure, there might be "surprises" in the case of unclean reboots: if a reboot happens before P has been physically erased, L will be mapped to P again when UBI attaches the MTD device at the next bootup. Indeed, UBI will scan the MTD device and find the P which refers to L, and it will add this mapping information to the EBA table.
However, once you write any data to L, or map it using the LEB map operation, it gets mapped to a new PEB and the old contents are gone forever, because even in the case of an unclean reboot UBI would pick the newer mapping for L.
Implementation details
This section describes how UBI distinguishes between older and newer versions of an LEB in the case of an unclean reboot. Suppose we un-map LEB L which is mapped to PEB P1, which means UBI schedules P1 for erasure. Then we write some data to L, which means that UBI finds another PEB P2, maps L to P2, and writes the data to P2. If an unclean reboot happens before P1 is physically erased, but after the write operation, we end up with 2 PEBs (P1 and P2) mapped to the same LEB L.
To handle situations like this, UBI maintains a global 64-bit sequence number variable. The sequence number variable is incremented each time a PEB is mapped to a LEB and its value is stored in the VID header of the PEB. So each VID header has a unique sequence number, and the larger the sequence number, the "younger" the VID header. When UBI attaches MTD devices, it initializes the global sequence number variable to the highest value found in the existing VID headers plus one.
In the above situation, UBI simply selects a PEB with the highest sequence number (P2) and drops the PEB with the lower sequence number (P1).
Note, the situation is more difficult if an unclean reboot happens when UBI moves the contents of one PEB to another for wear-leveling purposes, or when the unclean reboot happens during an atomic LEB change operation. In this case it is not enough to just pick the newer PEB, it is also necessary to make sure the data reached the new PEB.
LEB map
The LEB map operation maps a previously un-mapped logical eraseblock (LEB) to a physical eraseblock (PEB). For example, if the operation is run for LEB A, UBI will find an appropriate PEB, write a VID header to the PEB, and amend the in-memory EBA table. The VID header will now refer to LEB A. After this operation all I/O to LEB A will actually go to the mapped PEB.
The LEB map operation is available via the ubi_leb_map()
UBI kernel API function, or via the UBI_IOCEBMAP
volume character
device ioctl command. However, this ioctl interface is available only starting
from kernel version 2.6.29
.
One of the functions of the LEB map operation is to make sure old LEB contents are removed. As was explained in this section, when an LEB is un-mapped, the corresponding PEB is not erased immediately. If an unclean reboot happens, the LEB may become mapped to the same PEB again, after the UBI attaches the MTD device. So, if you map the LEB immediately after un-mapping it, you are guaranteed that the old LEB contents are deleted. In other words, the LEB is guaranteed to contain only 0xFF bytes after the map operation returns, even in case of an unclean reboot.
Please, use the LEB map operation sparingly. Do not use it unless it is really needed, because mapped LEBs add more overhead on the UBI wear-leveling sub-system, comparing to un-mapped LEBs. Indeed, if an LEB is un-mapped, there is no PEB which contains this LEB's data, and the wear-leveling sub-system does not have to move any data to maintain wear-leveling. Conversely, if the LEB is mapped to a PEB, there is one more PEB for the wear-leveling sub-system to care about, and one more LEB to re-map to another PEB if the erase counter of the current PEB becomes too low (then the LEB is re-mapped to a PEB with higher erase counter and the old PEB is used for other operations).
Volume update
The volume update operation is useful for device software updates.
The operation changes the contents of the whole UBI volume with new contents. But if
it gets interrupted in the middle of the update, the volume goes into the
"corrupted" state and further I/O on the volume ends up with an
EBADF
error. The only way to get the volume back to the normal
state is to start a new volume update operation and finish it.
The volume update operation can detect interrupted updates and re-start the update with the help of, for example, a "mirror" volume which would have the same contents or by showing a dialog window which would inform the user about the problem and request re-flashing. In contrast, it is difficult to detect interrupted updates when using raw MTD partitions.
The volume update operation is available via the user-space UBI interface and
not available via the UBI kernel API. To update a volume, you first have to call
the UBI_IOCVOLUP
ioctl on the corresponding UBI volume character
device node and pass it a pointer to a 64-bit value containing the length of the new
volume contents in bytes. Then this number of bytes has to be written to the
volume character device node. Once the last byte has been sent to the character
device node, the update operation is finished. Conceptually, the sequence (in
pseudo-code) is:
fd = open("/dev/my_volume"); ioctl(fd, UBI_IOCVOLUP, &image_size); write(fd, buf, image_size); close(fd);
See include/mtd/ubi-user.h
for more details. Bear in mind, the
old contents of the volume are not preserved if the update is interrupted.
Also, you do not have to write all the new data in one go. It is OK to call
the write()
function an arbitrary number of times and pass arbitrary
amounts of data each time. The operation will be finished after all the data
has been written. If the last write operation contains more bytes than UBI
expects, the extra is ignored.
A Special case of the volume update operation is what we call volume
truncation, which is done by the same ioctl command when the data length is
zero. In this case the volume is wiped out and will contain all
0xFF
bytes (all LEBs will be un-mapped).
Note, the /sys/class/ubi/ubiX_X/corrupted
sysfs file reflects
the "corrupted" state of the volume: it contains ASCII "0\n" if the volume is OK
and "1\n" if it is corrupted (i.e. if a volume update was started but was not
completed).
The volume update operation does not preserve its previous contents if the update is interrupted; it is not atomic. However, UBI does provide atomic volume updates by means of the volume re-name operation.
Volume updates are implemented with the help of update
markers. Once the user has issued the UBI_IOCVOLUP
ioctl, UBI
sets the update marker flag for the volume in the corresponding record of the
UBI volume table. At this point the volume
is wiped, and UBI waits for the user to send the data. Only when all the data
has been sent and has been written to the flash successfully, will the update
marker be cleared. If the update is interrupted (e.g., unclean reboot, crash
of the update application, etc.), the update marker is not cleared and the
volume is treated as "corrupted". Only once a successful update operation has
occurred will the update marker be cleared.
Atomic LEB change
The atomic LEB change operation changes the contents of an LEB
atomically, so that the old contents are preserved should the operation be
interrupted. In other words, the LEB will always contain either the old
contents or the new contents. This functionality is available via the
ubi_leb_change()
kernel API call.
The user-space interface for this operation was added in kernel version
2.6.25
. Its functionality is available to user-space via the
UBI_IOCEBCH
ioctl command. You have to pass a pointer to a
properly-filled request object of struct ubi_leb_change_req
type. This object stores the LEB number to change and the length of
the new contents. Then you have to write the specified number of
bytes to the volume character device. Note the similarity to the
volume update operation. Conceptually, the
sequence (in pseudo-code) is:
struct ubi_leb_change_req req; req.lnum = lnum_to_change; req.len = data_len; fd = open("/dev/my_volume"); ioctl(fd, UBI_IOCEBCH, &req); write(fd, data_buf, data_len); close(fd);
If, for some reason, the user does not write the specified number of bytes to the file descriptor before closing the file, the operation is cancelled and the old contents of the LEB are preserved.
Similarly to the volume update operation, it does not matter how many times
the write()
function is called and how much data it passes to the
UBI volume each time. The atomic LEB change operation finishes only once the last
data byte has arrived.
The atomic LEB change operation might be very useful for file-systems, for example UBIFS uses this functionality when it commits the file-system index. This behaviour could also be used to create an FTL layer on top of UBI (see here for a description of the idea).
Keep in mind that the atomic LEB change operation calculates the
CRC-32
checksum of the new data, so it has some overhead compared
to the "LEB erase" + "LEB write" sequence. The volume update operation does
not calculate the data's CRC-32
checksum, so it is faster to
update the volume than it is to atomically change all its eraseblocks. Keep
this overhead in mind and be sure to only use this operation if/when atomicity
is really needed.
Implementation details
Suppose UBI has to change a logical eraseblock L which is mapped to a
physical eraseblock P1. First of all, UBI always has one free
PEB reserved for the atomic LEB change operation, let it be
P2. Before the operation, P1 stores the
current contents of the LEB L and P2 is free (it contains only
the EC header and 0xFF
bytes). The new data is written to
P2, not to P1, so should anything go wrong,
the old contents of the LEB are maintained.
When the operation finishes, UBI un-maps L from P1, maps in to P2, and schedules P1 for erasure. If the operation is interrupted, L continues to be mapped to P1 and P2 is scheduled for erasure.
If an unclean reboot happens half way through the atomic LEB change operation, it is obvious that UBI has to preserve the L -> P1 mapping and erase P2 when it attaches the MTD device on the next reboot. But if an unclean reboot happens just after the atomic LEB change operation finishes, but before P1 is physically erased, it is obvious that UBI has to preserve the L -> P2 mapping and erase P1.
To resolve situations like that, UBI calculates the CRC-32
checksum
of the new contents of the LEB before it is written to the flash, and stores it in
the VID header (together with data length). When UBI finds 2 PEBs
P1 and P2 mapped to the same LEB L
during the initialization, it selects the one with the higher sequence number
(P2) only if the data CRC-32
checksum is correct (which
means that all data has been written to the flash media), otherwise it selects
the PEB with lower sequence number(P1). Of course, UBI has to
read the LEB contents in order to verify the CRC-32
checksum.
Fastmap
Fastmap is an experimental and optional UBI feature, which can be enabled by setting CONFIG_MTD_UBI_FASTMAP to 'y'. Once enabled UBI evaluates the module parameter "fm_autoconvert". If it is set to 1 (default is 0) UBI automatically enables fastmap for any attached image. This means UBI creates a new internal volume with the fastmap data such that next time the image is attached, the fast attach mode can be used.
In the default configuration UBI will use the information stored in this fastmap volume to accelerate the attach procedure. If you want to test fastmap, set fm_autoconvert to 1 and attach a volume.
The following settings are possible:
CONFIG_MTD_UBI_FASTMAP | fm_autoconvert | Result |
---|---|---|
n | 0 | fastmap is completely disabled |
y | 0 | UBI will use the fastmap data if it exists on an image, but will not install a fastmap on images that don't already have it |
y | 1 | UBI will use the fastmap data if it exists on an image, and a fastmap is automatically created on all attached images |
Backwards compatibility
The fastmap on-disk data structure makes use of delete compatible volumes, therefore fastmap-enabled images are fully backwards compatible with UBI implementations which do not support fastmap. The kernel will remove the fastmap volumes and continue with scanning. This includes not only kernel version v3.6- but also v3.7+ with this option disabled.
Technical design
An on-disk fastmap contains all the information required to attach the whole image, including: all erase counter values, a list of all PEBs and their state, a list of all volumes and their current EBA, etc... To avoid too many writes of the fastmap, it also contains a list of PEBs which may have changed and need a full scan while attaching. This list is called the "fastmap pool" and has a fixed size of 5% of the total number of PEBs. By design UBI needs to write the fastmap data only if the pool contains no free PEBs. Otherwise it would have to write the fastmap each time the EBA of a volume has changed.
A fastmap consists of a super-block (also known as an anchor PEB) and payload data which can live on any PEB. The anchor PEB has to be located within the first 64 PEBs on the MTD device. It contains pointers to the remaining PEBs which carry the actual fastmap data. On modern NAND chips the whole fastmap fits into a single PEB. Hence, the anchor PEB points to itself. After loading the fastmap data, the UBI attach information structure is created from it.
The attach process works as follows:
- UBI tries to find the fastmap anchor PEB, if no anchor PEB was found UBI performs a traditional full scan
- It follows the pointers stored in the anchor PEB and reads the fastmap payload data
- Then it performs a traditional scan only on PEBs in the pool instead of all PEBs
If UBI detects that the fastmap data is invalid or corrupt it automatically falls back to scanning mode and performs a full scan. Using a CRC32 checksum and consistency checks of the internal UBI structures UBI is able to detect whether the fastmap data is invalid or not.
The fastmap data is written to the device: each time the fastmap pool becomes full (i.e. no free PEBs are available), the volume layout changes, or the image is detached. If you are wondering why the fastmap data needs to be written at detach time, it is because otherwise all erase counter modifications since the last fastmap write would be lost.
Overhead
A fastmap-enabled UBI will reserve enough PEBs to carry two complete fastmaps. In practice on modern NAND chips two PEBs are reserved for fastmap.
There is also some runtime overhead. In order to guarantee that the new fastmap is valid and consistent, UBI needs to make sure that all I/O which would cause EBA changes are blocked while attaching. Depending on the specific flash chips, this can take up to one second. Therefore, fastmap only makes sense on fast and large flash devices where a full scan would otherwise take too long. For example: on 4GiB NAND chips a full scan takes several seconds, whereas a fast attach needs less than one second.
Notes
Enabling fastmap does not guarantee that every attach process will be done in optimal time. In some situations a full scan is still needed. This can happen in two cases: (i) if an unexpected reboot occurs while a fastmap is being written to the flash or (ii) UBI runs out of PEBs while writing the fastmap. The latter case can happen if a massive amount of I/O errors happen while writing, and UBI cannot find enough usable PEBs.
R/O block devices on top of UBI volumes
UBI allows the creation of block devices on top of UBI volumes with the following limitations:
- Read-only operation.
- Serialized I/O operation, but keep in mind the NAND driver core already serializes all I/O too.
Despite these limitations, a block device is still very useful for the purpose of mounting read-only, regular file systems on top of UBI volumes. Take, for example, squashfs, which can be used as a lightweight read-only rootfs on top of a NAND device. In this case, the UBI layer will take care of low-level details such as bit-flip handling and wear-levelling.
Usage
Creating and destroying block devices on a UBI volume is somewhat similar to
attaching MTD devices to UBI. You can either use the block
UBI
module parameter or use the "ubiblock
" user-space tool.
In order to create a block device at bootup time (e.g. to mount the rootfs
on such a block device) you can specify the block
parameter as
a kernel boot argument:
ubi.mtd=5 ubi.block=0,0 root=/dev/ubiblock0_0
There are several ways of specifying a volume:
Using the UBI volume path:
ubi.block=/dev/ubi0_0
Using the UBI device, and the volume name:
ubi.block=0,rootfs
Using both the UBI device number and the UBI volume number:
ubi.block=0,0
If you've built UBI as a module you can use the following parameters at module load time:
$ modprobe ubi mtd=/dev/mtd5 block=/dev/ubi0_0
A block device can also be created/removed dynamically at runtime, using the
ubiblock
user-space tool:
$ ubiblock --create /dev/ubi0_0 $ ubiblock --remove /dev/ubi0_0
UBI stress testing
If enabled when configuring (right before building the code), mtd-utils includes user-space tools that can be used to stress test the UBI stack. This is useful if you want to test the stability and correctness of your particular UBI stack implementation.
Example: running various UBI tests:
$ flash_erase /dev/mtd3 0 0 $ ubiattach --mtdn 3 $ /usr/libexec/mtd-utils/runubitests.sh /dev/ubi0
More documentation
Unfortunately, no complete, up-to-date design documents exist for UBI. But there is an old UBI design document which has some out-of-date information which might still be of limited use: ubidesign.pdf.
There is also a PowerPoint UBI presentation available:
ubi.ppt. Note, this document contains a lot of
animations, so be sure to view it in "slide show" mode (F5
key)
so that the animations will be played.
More information may be found in the FAQ section.
And of course just reading the UBI interface C header files (which are
well commented) may help: include/mtd/ubi-user.h
contains the user-space interface definition (namely, it defines UBI ioctl
commands and the associated data structures),
include/linux/mtd/ubi.h
defines the kernel API, and
drivers/mtd/ubi/kapi.c
contains comments for each kernel API
function (just above the body of the function).