Friday, July 1, 2011

Journal (jbd) revoke mechanism

Journal revoke :- jbd/revoke.c
------------------

Revoke is a method of preventing journal from corrupting filesystem by not replaying ops and overwriting the contents of a deleted block on a newer block. For example consider the following sequence of steps when the filesystem is mounted in metadata only journalling mode.

a) A metadata block 'B' is journalled and contents are copied to journal.
b) Later 'B' gets freed
c) 'B' is now used to write contents of user data, this is not journalled.


Now if we crash and replay, we need to avoid replaying the contents of block 'B' in journal over the user contents.

Revoke mechanism:- During commits of a transaction all the blocks which are revoked are stored in journal. This record of revoked blocks is used during journal recovery and journal is scanned for the revoked blocks before any ops is replayed. If there are transactions for the block after the last revoke record of a block, these ops are safe to replay. Any transactions which appear before the revoke record aren't replayed. The basic idea is that you don't want to replay ops corresponding to a block which may have been freed. Also note that if there are multiple revoke records corresponding to a block in a journal, we only need to worry about the latest record ie...one with highest transaction id.

From file jbd/revoke.c.
 * We can get interactions between revokes and new log data within a
 * single transaction:
 *
 * Block is revoked and then journaled:
 *   The desired end result is the journaling of the new block, so we
 *   cancel the revoke before the transaction commits.
 *
 * Block is journaled and then revoked:
 *   The revoke must take precedence over the write of the block, so we
 *   need either to cancel the journal entry or to write the revoke
 *   later in the log than the log block.  In this case, we choose the
 *   latter: journaling a block cancels any revoke record for that block
 *   in the current transaction, so any revoke for that block in the
 *   transaction must have happened after the block was journaled and so
 *   the revoke must take precedence.
 *
 * Block is revoked and then written as data:
 *   The data write is allowed to succeed, but the revoke is _not_
 *   cancelled.  We still need to prevent old log records from
 *   overwriting the new data.  We don't even need to clear the revoke
 *   bit here.

 There are two hash tables to store the revoked entries. These two tables are required one for the running transaction and one for the committing transaction (if any). As you can guess new entries are always logged into the revoke table pointed by current journal->j_revoke pointer which points to the one corresponding to running transaction. You can think of it as a double buffering mechanism. These tables are switched alternately during the commit from kjounald. Access to these hash table entries is protected by the j_revoke_lock.

Important functions:-
Initialize revoke hash : journal_init_revoke()
Inserts in hash : insert_revoke_hash()
Find in hash : find_revoke_record()
Transfer the in-memory revoke table to ondisk journal :    journal_write_revoke_records()

NB: Note that you need to revoke a block before freeing it in bitmap and not the viceversa to prevent races.

The buffer heads maintains two set of flags to indicate the revoke status of a buffer.
a) RevokeValid : The revoke status of this buffer is known and can be trusted. If this is not set we can't say much about the buffer and need to search for it in hash.
b) Revoke{set/clear} : These flags make sense when above is set. They tell whether the block is revoked or not.

No comments: