Friday, September 12, 2008

ext filesystem online resizing.

Below notes are based on my understanding of code and most of the stuff is copied directly from the paper presented by Andreas Dilger in OLS-2002 "Online ext2 and ext3 Filesystem Resizing". So if you find any errors below they are due to me, and you may want to read the full paper for proper detailed explaination.

Primary operations involved in resizing :-
a) Increase the total number of blocks in primary and backup superblock.
b) Increase the count of free blocks in primary superblock and group descriptor for that group.
c) Increase the number of reserved filesystem blocks.
d) Remount with "-o remount,resize=<newsize>" option.

NOTE: You cannot shrink the filesystem. Shrinking is in general discouraged by filesystems and most of them don't support it, because this might lead to stale/invalid nfs file handles if the client has a file opened and you shrunk the filesystem.

Case 1: Adding blocks to the end of the last partially filled block group.

In order to increase the free blocks count and update block bitmaps, a fake inode is created which spans the newly added blocks at the end of last block group. Since the blocks are already being used by this fake inode we don't have risk of getting it used by anyone else. Now to make these blocks available to filesystem all that needs to be done is to delete this inode by ext2_free_blocks(). During freeing of this inode, appropriate counts and block bitmaps will be updated. This inode is only in memory and has enough fields as required by ext2_free_blocks().
Next is to update the superblock in backup superblocks. Since the backup superblocks are not accessed by kernel they can be directly modified by userspace. If the resize operations is done via mount then since backup superblocks are not updated, but they will be updated to proper values when e2fsck is run next time.

Case 2: Adding new block groups at the end of full block group.

Main steps required to handle this case are :-
a) Create new block bitmaps and inode bitmaps for each added block group and updated them appropriately to reflect the available block and inodes within the group.
b) Add the new entry to group descriptor table. Increase the number of groups in the filesystem so that the inode/block allocation routines know that there are new groups available with free resources.

Inorder to avoid doing all this stuff from kernel, increasing the resources is done from userspace before the kernel becomes aware of new groups. Since as long as the kernel is unaware of new groups, it cannot allocate anything from there and moreover since we are adding at the last, it is safe to do these operations from userspace.

Case 3 : Adding a new block group in new group descriptor block.

This is a bit complex situation because backup superblocks and group descriptors are situated at the start of a block group (so as to be easily identifiable by e2fsck in case of corrruptions), followed by inode and block bitmap. Adding a new block group would mean overwriting these bitmaps and thus they need to be shifted to accomodate the new block. This relocation is not of a serious problem since gdt stores the location of bitmaps. So all we need to do is shift the blocks and then update the new information in the gdt. But this movement of blocks cannot be done while the filesystem is mounted, to avoid any mess within kernel.
Such cases need an unmount of the filesystem (commonly known as filesystem preparation) and they have the advantage that the changes made are compatible with older kernels. However there is another alternative without offlining the filesystem if the user is ready to break this compatibility.

Andreas nicely explains this case in his paper, and I would suggest to read section 3.4 for those who are interested.


Above strategy is for ext2, however resizing in ext3 using the above poses certain problems like :-

a) Writing directly to the device is unsafe since, there is a journalling layer whose job is to make sure that everything is consistent. Thus if you bypass it, it might lead to ondisk corruptions in the filesystem.
b) Unlike ext2, there may be copies of data in ext3 made by journalling and thus there may be differences in what was read and what actually is there if you try to do it from userspace. (See paper)
c) Since ext3 resizing is done through journal, in case if the resize has made to journal and then system crashed. In such cases during reboot journal will be replayed, however since superblock is read first the new size will not be updated. This case is detected by comparing filesystem size before and after journal replay and then update the appropriate datastructures.

No comments: