Saturday, July 2, 2011

Committing a journal transaction in jbd

Journal's transaction commit consists of 8 phases, with the journal's state transitions mentioned as below in each of the phase.

The main function which does the journal commit is journal_commit_transaction(). When we decide to commit the transaction, journal is in running state. (T_RUNNING)

Lock the transaction for new updates. ===> T_LOCKED
---> Wait for any existing handles in the transaction to complete the updates.
---> Discard buffers from reserved list. (t_reserved_list).

If any buffer is part of next transaction, it is transferred to appropriate list of next transanction, otherwise dropped from journal's list.
---> Drop write-back buffers from checkpoint list.(t_checkpoint_list). Unless the buffers belong to the running or commiting transaction, the corresponding transaction will also be freed up.

Phase 1 start

---> Change transaction state to T_FLUSH
---> Switch the revoke tables.
---> At this point there is no running transaction, it is changed to a commiting

Phase 2 start

/* Flushing starts now */
---> Data buffers are flushed first. (t_sync_datalist)
---> Write out revoke records from the revoke hash list and flush to the descriptor blocks in journal.
---> Change transaction state to T_COMMIT

Phase 3 start

---> Flush metadata buffers (present on t_buffers list). See journal_write_metadata_buffer()

Phase 4 start

---> Wait for all the IO submitted buffers above. Wait for metadata buffers which are present on t_iobuf_list. The dummy buffer heads created for metadata buffers are released. The original metadata buffer which was put on shadow list is released, but put into t_forget list.

Phase 5 start

---> Wait for the submitted revoke record and descriptor buffers to complete and written out. This is done by waiting for buffers on t_log_list.

Phase 6 start

---> Change transaction state to T_COMMIT_RECORD
---> IO for data is complete now. Write the commit record in journal.
Phase 7 start

 ---> Walk the journal's t_forget list to get rid of buffers till there are no more buffers on it. As each buffer is examined, we check if it was on the checkpoint io list of previous transaction. If it is, its removed and if required (in case its dirty) its transferred to the checkpoint list of the committing transaction. See __journal_insert_checkpoint()

Phase 8 start

--> We are done committing the transaction now.
---> Change transaction state to T_FINISHED
---> Set committing transaction = NULL.
---> Calculate average commit time for future use.
---> Setup the checkpointing transaction.

Journalling layer in ext3 (jbd)

Journal handle - A handle pointer each of which represents a single atomic filesystem operation. It tracks all the modifications done as part of one atomic operation.
Transaction - A single atomic sequence of events which guarantees filesytem consistency. It can consist of a single handle or multiple handles for batching efficiency.
Transaction commit - Flushing the in-memory contents of journal to appropriate blocks in journal along with writing a commit record on disk in journal.
Transaction checkpoint - Flushing the contents from journal to their actual location on disk. This is done periodically to make journal space reusable.

Typically journalling a filesystem operation consists of following three steps :-

a) Starting a handle - journal_start(). We need to specify how many fs blocks this op can potentially modify. This is required to ensure that there would be enough space in the journal to completely write contents of this operation.
The number of blocks required is the total number of blocks, including the data which is going to change, metadata blocks, quota blocks if any etc. As an example see EXT3_DATA_TRANS_BLOCKS. These are called buffer credits for the handle.

b) After getting a handle, next step is to associate the modified blocks with the journal handle, so that journal knows that it has to write these blocks in journal. This is done via following APIs journal_get_write_access(handle, bh) which tells the journal that this buffer is going to be modified. A buffer which is of interest to journalling layer has BH_JBD set on it and has a non-zero b_count. At this point a "journal_head" is attached to the buffer. A journal_head can only be part of 1 transaction.

journal_get_write_access(handle, bh) {
    do_get_write_access(handle, jh, 0);

journal_add_journal_head(bh) {
     jh = journal_alloc_journal_head();
         bh->b_private = jh;
         jh->b_bh = bh;

PS: A buffer is already part of a transaction if its journal_head's b_transaction or b_next_transaction is set. Most of the times, only b_transaction is set. b_next_transaction will be set incase the buffer is getting committed from previous transaction and we are changing it for the current transaction. The b_next_transaction tells journal that this buffer is going in next transaction. In this case a copy on write is performed and the frozen copy is stored in jh->b_frozen_data.
NB : Buffer's b_transaction will only be set if its part of running or committing transaction and not if it resides on some other list like checkpoint list etc.

c) Stop the handle - journal_stop() : As the name suggests, journal stop marks the completion of an op wrt to journal. It returns any left over unused buffer credits to the transaction, drops appropriate references and frees the handle pointer.
If the filesystem requested this op in sync mode, we also need to start committing the transaction to the journal on completion of handle. However in the current code there are some optimizations built around it to figure it out whether it is beneficial to start writing to disk immediately, or based on the op rate wait for sometime and let other op do it.

See the following code as example.

int journal_stop(handle_t *handle)
    if (trans_time < commit_time) {
        ktime_t expires = ktime_add_ns(ktime_get(),
        schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
Each journal_start/stop pair ie...each handle consists of one atomic filesystem operation. Some fs operations may be atomic in itself but still may not be sufficient enough to have the filesystem in a consistent state. An example of such an op is write which requires a quota update. Nested journal handles will be required to have such atomic op.

Typical sequence would be
a) Start journal handle for write
b) Start journal handle for quota update
c) Stop journal handle for quota update
d) Stop journal handle for write.

Its only after step (d) that the op can be committed to disk.

NB : A buffer is a "journalled" buffer, only if it has a journal head attached to it.

A journal transaction consists of various lists where buffers of interest can reside. Buffers end up on one of the list depending on what flag/state it has. Below is the buffer state to list mapping. See the function __journal_file_buffer() to see how buffers are moved across lists.

List type => buffer state flag
transaction->t_sync_datalist =>  BJ_SyncData
transaction->t_buffers => BJ_Metadata
transaction->t_forget => BJ_Forget
transaction->t_iobuf_list => BJ_IO
transaction->t_shadow_list => BJ_Shadow
transaction->t_log_list => BJ_LogCtl
transaction->t_reserved_list => BJ_Reserved
transaction->t_locked_list => BJ_Locked

checkpointing transactions in journal (jbd)

Journal checkpointing : jbd/checkpoint.c
The main functions involved in doing journal checkpointing are :-
a) log_do_checkpoint
b) __process_buffer
c) __flush_batch
d) __wait_cp_io

log_do_checkpoint picks up the first transaction on the checkpoint list and then iterates over all the buffers present in the transaction by calling __process_buffer on each of them. As it traverses, it keeps accumulating them in a local array for batching of disk writes. As part of processing it also moves the buffer from checkpoint_list to checkpoint_io_list to indicate that io is pending on these buffers.
Once the array is full or we have no more buffers to process __flush_batch is called to send those buffers to disk for writing.

After the buffers are submitted to disk, __wait_cp_io() is called to wait on each of the buffers for write to complete. After they get cleaned they are removed from the checkpoint_io_list. After all the buffers are freed, transaction itself is freed.

* Helper functions to clear all the clean buffers from the checkpoint list.
__journal_clean_checkpoint_list : Traverses the transactions in checkpoint transactions list (j_checkpoint_transactions) and frees memory by walking each list (j_checkpoint_list) at a time.

Friday, July 1, 2011

Journal (jbd) revoke mechanism

Journal revoke :- jbd/revoke.c

Revoke is a method of preventing journal from corrupting filesystem by not replaying ops and overwriting the contents of a deleted block on a newer block. For example consider the following sequence of steps when the filesystem is mounted in metadata only journalling mode.

a) A metadata block 'B' is journalled and contents are copied to journal.
b) Later 'B' gets freed
c) 'B' is now used to write contents of user data, this is not journalled.

Now if we crash and replay, we need to avoid replaying the contents of block 'B' in journal over the user contents.

Revoke mechanism:- During commits of a transaction all the blocks which are revoked are stored in journal. This record of revoked blocks is used during journal recovery and journal is scanned for the revoked blocks before any ops is replayed. If there are transactions for the block after the last revoke record of a block, these ops are safe to replay. Any transactions which appear before the revoke record aren't replayed. The basic idea is that you don't want to replay ops corresponding to a block which may have been freed. Also note that if there are multiple revoke records corresponding to a block in a journal, we only need to worry about the latest record with highest transaction id.

From file jbd/revoke.c.
 * We can get interactions between revokes and new log data within a
 * single transaction:
 * Block is revoked and then journaled:
 *   The desired end result is the journaling of the new block, so we
 *   cancel the revoke before the transaction commits.
 * Block is journaled and then revoked:
 *   The revoke must take precedence over the write of the block, so we
 *   need either to cancel the journal entry or to write the revoke
 *   later in the log than the log block.  In this case, we choose the
 *   latter: journaling a block cancels any revoke record for that block
 *   in the current transaction, so any revoke for that block in the
 *   transaction must have happened after the block was journaled and so
 *   the revoke must take precedence.
 * Block is revoked and then written as data:
 *   The data write is allowed to succeed, but the revoke is _not_
 *   cancelled.  We still need to prevent old log records from
 *   overwriting the new data.  We don't even need to clear the revoke
 *   bit here.

 There are two hash tables to store the revoked entries. These two tables are required one for the running transaction and one for the committing transaction (if any). As you can guess new entries are always logged into the revoke table pointed by current journal->j_revoke pointer which points to the one corresponding to running transaction. You can think of it as a double buffering mechanism. These tables are switched alternately during the commit from kjounald. Access to these hash table entries is protected by the j_revoke_lock.

Important functions:-
Initialize revoke hash : journal_init_revoke()
Inserts in hash : insert_revoke_hash()
Find in hash : find_revoke_record()
Transfer the in-memory revoke table to ondisk journal :    journal_write_revoke_records()

NB: Note that you need to revoke a block before freeing it in bitmap and not the viceversa to prevent races.

The buffer heads maintains two set of flags to indicate the revoke status of a buffer.
a) RevokeValid : The revoke status of this buffer is known and can be trusted. If this is not set we can't say much about the buffer and need to search for it in hash.
b) Revoke{set/clear} : These flags make sense when above is set. They tell whether the block is revoked or not.

Sunday, June 26, 2011

Journal recovery in jbd

Journal recovery :- jbd/recovery.c

   Journal recovery is quite simple. It basically consists of below steps.

a) Readahead journal blocks in memory.

b) Do first pass (PASS_SCAN) to see if we need a recovery. If yes what all transactions do we need to replay, if the journal is valid etc other sanity checks. After the first scan pass, an incore data structure about the journal (struct recovery_info) is populated which contains the required information about the recovery.

c) Do second pass (PASS_REVOKE). This traverses all the revoke block types and builds the incore hash of block numbers which are revoked. This ensures that we don't replay ops corresponding to these blocks when we do the actual replay.

d) Do the third/final pass (PASS_REPLAY) which actually does the job of replaying the journal and copies the data from journal to the real filesystem. Replaying a op simply consists of reading the corresponding block number from filesystem, copying the contents from journal to buffer and then marking the buffer dirty which would be written back to the actual location in filesystem.

NB: Steps (b), (c) and (d) are done through a common function do_one_pass().
e) Once the replay is complete, throw away the in-memory revoke hash.

f) Sync the blockdevice.

g) Once the recovery is done, journal_reset() is called to setup the inmemory fields of journal, and journal is ready for business again.

--> do_one_pass(PASS_SCAN)
--> do_one_pass(PASS_REVOKE)
--> do_one_pass(PASS_REPLAY)
--> journal_clear_revoke()
--> sync_blockdev() and journal_reset()