Devil's den: September 2008

Thursday, September 18, 2008

Easy way to convert negative integers to hexadecimal

If you came here looking for some magic......sorry .... I had some issues representing negative integers in hex. Every time I had to convert them to binary, do a 2's complement and then reconvert to hex :-( so was looking for a way which I could use to quickly do my job. I got this trick from net and thought it is pretty easy. You just need to keep the following mapping in mind (or paper :-) though it is pretty simple).

0 1 2 3 4 5 6 7
F E D C B A 9 8

that's all. The above table is just numbers 1-15 in clockwise direction, or more technically the upper number is inverse of the below one.

Now to convert any number say -7218, follow below steps :-

a) Convert 7218 to hex = 0x1C32
b) Interchange each digit with its corresponding counter bit from mapping = 0xE3CD
c) Shift the last digit to left by 1 place...so D becomes E = 0xE3CE = -7218 = 0xFFFFE3CE (don't forget to fill it with 'F's on left. If you don't fill it with 'F' it becomes 65536-7218 = 58318)

NOTE : If the last digit is 'F' and you shift left, you go to '0' and then shift the next digit to left. (Much like carry over).

Friday, September 12, 2008

ext filesystem online resizing.

Below notes are based on my understanding of code and most of the stuff is copied directly from the paper presented by Andreas Dilger in OLS-2002 "Online ext2 and ext3 Filesystem Resizing". So if you find any errors below they are due to me, and you may want to read the full paper for proper detailed explaination.

Primary operations involved in resizing :-
a) Increase the total number of blocks in primary and backup superblock.
b) Increase the count of free blocks in primary superblock and group descriptor for that group.
c) Increase the number of reserved filesystem blocks.
d) Remount with "-o remount,resize=<newsize>" option.

NOTE: You cannot shrink the filesystem. Shrinking is in general discouraged by filesystems and most of them don't support it, because this might lead to stale/invalid nfs file handles if the client has a file opened and you shrunk the filesystem.

Case 1: Adding blocks to the end of the last partially filled block group.

In order to increase the free blocks count and update block bitmaps, a fake inode is created which spans the newly added blocks at the end of last block group. Since the blocks are already being used by this fake inode we don't have risk of getting it used by anyone else. Now to make these blocks available to filesystem all that needs to be done is to delete this inode by ext2_free_blocks(). During freeing of this inode, appropriate counts and block bitmaps will be updated. This inode is only in memory and has enough fields as required by ext2_free_blocks().
Next is to update the superblock in backup superblocks. Since the backup superblocks are not accessed by kernel they can be directly modified by userspace. If the resize operations is done via mount then since backup superblocks are not updated, but they will be updated to proper values when e2fsck is run next time.

Case 2: Adding new block groups at the end of full block group.

Main steps required to handle this case are :-
a) Create new block bitmaps and inode bitmaps for each added block group and updated them appropriately to reflect the available block and inodes within the group.
b) Add the new entry to group descriptor table. Increase the number of groups in the filesystem so that the inode/block allocation routines know that there are new groups available with free resources.

Inorder to avoid doing all this stuff from kernel, increasing the resources is done from userspace before the kernel becomes aware of new groups. Since as long as the kernel is unaware of new groups, it cannot allocate anything from there and moreover since we are adding at the last, it is safe to do these operations from userspace.

Case 3 : Adding a new block group in new group descriptor block.

This is a bit complex situation because backup superblocks and group descriptors are situated at the start of a block group (so as to be easily identifiable by e2fsck in case of corrruptions), followed by inode and block bitmap. Adding a new block group would mean overwriting these bitmaps and thus they need to be shifted to accomodate the new block. This relocation is not of a serious problem since gdt stores the location of bitmaps. So all we need to do is shift the blocks and then update the new information in the gdt. But this movement of blocks cannot be done while the filesystem is mounted, to avoid any mess within kernel.
Such cases need an unmount of the filesystem (commonly known as filesystem preparation) and they have the advantage that the changes made are compatible with older kernels. However there is another alternative without offlining the filesystem if the user is ready to break this compatibility.

Andreas nicely explains this case in his paper, and I would suggest to read section 3.4 for those who are interested.

Above strategy is for ext2, however resizing in ext3 using the above poses certain problems like :-

a) Writing directly to the device is unsafe since, there is a journalling layer whose job is to make sure that everything is consistent. Thus if you bypass it, it might lead to ondisk corruptions in the filesystem.
b) Unlike ext2, there may be copies of data in ext3 made by journalling and thus there may be differences in what was read and what actually is there if you try to do it from userspace. (See paper)
c) Since ext3 resizing is done through journal, in case if the resize has made to journal and then system crashed. In such cases during reboot journal will be replayed, however since superblock is read first the new size will not be updated. This case is detected by comparing filesystem size before and after journal replay and then update the appropriate datastructures.

Sunday, September 7, 2008

ioctls - An easy interface to talk to kernel

There are many *good* resources available on net if you want to know about ioctls. One of them is "man ioctl"
If you want to know about the list of ioctls, one quick way is "man ioctl_list". What follows below is a layman explaination and may not be liked by "strict" technical persons.

a) What are ioctls ?
ioctls are like swiss knife, one interface to do lot of things. A simple way for userspace to talk to kernel.
Among other available options (syscalls, procfs, sysfs, debugfs .. etc.), probably ioctls are the easiest one if you want to talk to kernel space as well as get that into mainline kernel. One of the reasons is because they are easy to implement and everyone follows its own convention (though it is discouraged) due to lack of no single standard.

To implement an ioctl, all you need to do is tell kernel that I will be sending some *codenumber* from userspace and depending on the codenumber you have to do something (execute a function). Much like switch-case or RPC.

b) How to implement ioctls ?
Depending on the way you want to talk to kernel, you can have an in, out, none or both arguments and you need to tell kernel accordingly so that it can take the appropriate action. Remember that you have to send a *codenumber* to kernel, so you must embed this information in it along with the functionality that you desire.

ioctls are declared using the standard macros defined in asm/ioctl.h. Macros are of type _IO(type,nr) and {_IOR,_IOW,_IOWR}(type,nr,size). Note that R and W are from user perspective (same as read & write). The last parameter 'size' is actually the datatype that needs to be transferred.

An example of ioctl by which you need to change the version of a file in kernel is :-
#define EXMPL_IOC_CHANGE_VERSION _IOW('f',7,long)

This gets expanded into a *codenumber* due to below defined macros :-
#define _IOW(type,nr,size) _IOC(_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOC(dir,type,nr,size) \
(((dir) << _IOC_DIRSHIFT) | \ /* DIRSHIFT == 30 */
((type) << _IOC_TYPESHIFT) | \ /* TYPESHIFT == 8 */
((nr) << _IOC_NRSHIFT) | \ /* NRSHIFT == 0 */
((size) << _IOC_SIZESHIFT)) /* SIZESHIFT == 16 */
# define _IOC_WRITE 1U

Based on the arguments passed this will create a unique *codenumber* which your module needs to understand. _IOC_TYPECHECK is just a macro for compiler to check for invalid uses of size argument. The ioctl number 'nr' should be chosen such that it doesn't conflict with others (that is one reason why ppl hate ioctls). Some devices use their major number for this.

Once I have defined my ioctl, I need to make kernel understand it. This is done by installing your ioctl handler in the kernel as below :-


struct file_operations my_file_operations = {
              .read = my_read_func,
              .write = my_write_func,
              ........
              .ioctl = my_ioctl_handler,
              ........
              ........
}

and then define my_ioctl_handler() which performs the actual action.


long my_ioctl_handler(struct file *filp, unsigned int cmd, unsigned long arg) {
       struct inode *inode = filp->f_dentry->d_inode;
       unsigned int version;
       switch(cmd) {
        ...........
       case EXMPL_IOC_CHANGE_VERSION : 
                        if (get_user(version, (int __user *) arg)) {
            err = -EFAULT;
           goto err_out;
          }
                        inode->i_version = version;
                        return 0;
       ............
}

And then in userspace, you need to invoke your ioctl as :-


main() {
  int fd;
  fd = open("myfile",O_RDWR);
  if(fd<0)
    exit(1);
  if(ioctl(fd,EXMPL_IOC_CHANGE_VERSION,123)))
      fprintf(stderr,"Some error in ioctl ...\n")
  ........
}

As you can see from the example above, ioctls are very easy interface to notify kernel to do something and are largely used in device drivers. Read Documentation/ioctl-numbers.txt for more on them.

Devil's den