Saturday, July 2, 2011

Journalling layer in ext3 (jbd)

Terminology
Journal handle - A handle pointer each of which represents a single atomic filesystem operation. It tracks all the modifications done as part of one atomic operation.
Transaction - A single atomic sequence of events which guarantees filesytem consistency. It can consist of a single handle or multiple handles for batching efficiency.
Transaction commit - Flushing the in-memory contents of journal to appropriate blocks in journal along with writing a commit record on disk in journal.
Transaction checkpoint - Flushing the contents from journal to their actual location on disk. This is done periodically to make journal space reusable.

Typically journalling a filesystem operation consists of following three steps :-

a) Starting a handle - journal_start(). We need to specify how many fs blocks this op can potentially modify. This is required to ensure that there would be enough space in the journal to completely write contents of this operation.
The number of blocks required is the total number of blocks, including the data which is going to change, metadata blocks, quota blocks if any etc. As an example see EXT3_DATA_TRANS_BLOCKS. These are called buffer credits for the handle.

b) After getting a handle, next step is to associate the modified blocks with the journal handle, so that journal knows that it has to write these blocks in journal. This is done via following APIs journal_get_write_access(handle, bh) which tells the journal that this buffer is going to be modified. A buffer which is of interest to journalling layer has BH_JBD set on it and has a non-zero b_count. At this point a "journal_head" is attached to the buffer. A journal_head can only be part of 1 transaction.

journal_get_write_access(handle, bh) {
    journal_add_journal_head(bh);
    do_get_write_access(handle, jh, 0);
    journal_put_journal_head(bh);
}

journal_add_journal_head(bh) {
     jh = journal_alloc_journal_head();
         set_buffer_jbd(bh);
         bh->b_private = jh;
         jh->b_bh = bh;
     jh->b_jcount++;
}

PS: A buffer is already part of a transaction if its journal_head's b_transaction or b_next_transaction is set. Most of the times, only b_transaction is set. b_next_transaction will be set incase the buffer is getting committed from previous transaction and we are changing it for the current transaction. The b_next_transaction tells journal that this buffer is going in next transaction. In this case a copy on write is performed and the frozen copy is stored in jh->b_frozen_data.
NB : Buffer's b_transaction will only be set if its part of running or committing transaction and not if it resides on some other list like checkpoint list etc.

c) Stop the handle - journal_stop() : As the name suggests, journal stop marks the completion of an op wrt to journal. It returns any left over unused buffer credits to the transaction, drops appropriate references and frees the handle pointer.
If the filesystem requested this op in sync mode, we also need to start committing the transaction to the journal on completion of handle. However in the current code there are some optimizations built around it to figure it out whether it is beneficial to start writing to disk immediately, or based on the op rate wait for sometime and let other op do it.

See the following code as example.

int journal_stop(handle_t *handle)
{
    ...............
    if (trans_time < commit_time) {
        ktime_t expires = ktime_add_ns(ktime_get(),
                commit_time);
        set_current_state(TASK_UNINTERRUPTIBLE);
        schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
    }    
    ...............
}
Each journal_start/stop pair ie...each handle consists of one atomic filesystem operation. Some fs operations may be atomic in itself but still may not be sufficient enough to have the filesystem in a consistent state. An example of such an op is write which requires a quota update. Nested journal handles will be required to have such atomic op.

Typical sequence would be
a) Start journal handle for write
b) Start journal handle for quota update
c) Stop journal handle for quota update
d) Stop journal handle for write.

Its only after step (d) that the op can be committed to disk.

NB : A buffer is a "journalled" buffer, only if it has a journal head attached to it.


A journal transaction consists of various lists where buffers of interest can reside. Buffers end up on one of the list depending on what flag/state it has. Below is the buffer state to list mapping. See the function __journal_file_buffer() to see how buffers are moved across lists.

List type => buffer state flag
transaction->t_sync_datalist =>  BJ_SyncData
transaction->t_buffers => BJ_Metadata
transaction->t_forget => BJ_Forget
transaction->t_iobuf_list => BJ_IO
transaction->t_shadow_list => BJ_Shadow
transaction->t_log_list => BJ_LogCtl
transaction->t_reserved_list => BJ_Reserved
transaction->t_locked_list => BJ_Locked


No comments: