<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-3103156096521938896</id><updated>2011-07-07T14:38:04.451-07:00</updated><category term='interacting with kernel'/><category term='ext3 write'/><category term='ext2'/><category term='ext3'/><category term='fsdb'/><category term='ioctl'/><category term='UFS copying from userspace'/><category term='online resizing'/><category term='vfs write'/><category term='negative integers'/><category term='mcopy'/><category term='UFS'/><category term='journalling'/><category term='ufsutils'/><category term='ext filesystem resizing'/><category term='hexadecimal conversion'/><title type='text'>Devil's den</title><subtitle type='html'>This blog mostly contains my random scribbling of linux notes. I try to make it little structured so that its useful if anyone else also happens to read it.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://mkatiyar.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://mkatiyar.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>11</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-3103156096521938896.post-3472636566725290142</id><published>2011-07-02T02:38:00.000-07:00</published><updated>2011-07-02T02:38:27.520-07:00</updated><title type='text'>Committing a journal transaction in jbd</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;br /&gt;Journal's transaction commit consists of &lt;i&gt;&lt;b&gt;8 phases&lt;/b&gt;&lt;/i&gt;, with the journal's state transitions mentioned as below in each of the phase.&lt;br /&gt;&lt;br /&gt;The main function which does the journal commit is &lt;i style="color: red;"&gt;&lt;b&gt;journal_commit_transaction().&lt;/b&gt;&lt;/i&gt; When we decide to commit the transaction, journal is in running state. &lt;b&gt;&lt;i&gt;(T_RUNNING)&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Lock the transaction for new updates. ===&amp;gt; &lt;b&gt;&lt;i&gt;T_LOCKED&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;i&gt;&lt;/i&gt;&lt;/b&gt;---&amp;gt; Wait for any existing handles in the transaction to complete the updates.&lt;br /&gt;---&amp;gt; Discard buffers from reserved list. (&lt;b&gt;&lt;i&gt;t_reserved_list&lt;/i&gt;&lt;/b&gt;).&lt;br /&gt;&lt;br /&gt;If any buffer is part of next transaction, it is transferred to appropriate list of next transanction, otherwise dropped from journal's list.&lt;br /&gt;---&amp;gt; Drop write-back buffers from checkpoint list.(t_checkpoint_list). Unless the buffers belong to the running or commiting transaction, the corresponding transaction will also be freed up.&lt;br /&gt;&lt;br /&gt;&lt;b style="color: red;"&gt;Phase 1 start&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;---&amp;gt; Change transaction state to &lt;b&gt;T_FLUSH&lt;/b&gt;&lt;br /&gt;---&amp;gt; Switch the revoke tables.&lt;br /&gt;---&amp;gt; At this point there is no running transaction, it is changed to a commiting&lt;br /&gt;transaction.&lt;br /&gt;&lt;br /&gt;&lt;b style="color: red;"&gt;Phase 2 start&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;/* Flushing starts now */&lt;br /&gt;---&amp;gt; Data buffers are flushed first. (&lt;i&gt;&lt;b&gt;t_sync_datalist&lt;/b&gt;&lt;/i&gt;)&lt;br /&gt;---&amp;gt; Write out revoke records from the revoke hash list and flush to the descriptor blocks in journal.&lt;br /&gt;---&amp;gt; Change transaction state to &lt;b&gt;T_COMMIT&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;b style="color: red;"&gt;Phase 3 start&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;---&amp;gt; Flush metadata buffers (present on &lt;b&gt;t_buffers&lt;/b&gt; list). See &lt;b&gt;&lt;i&gt;journal_write_metadata_buffer()&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;b style="color: red;"&gt;Phase 4 start&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;---&amp;gt; Wait for all the IO submitted buffers above. Wait for metadata buffers which are present on &lt;b&gt;t_iobuf_list&lt;/b&gt;. The dummy buffer heads created for metadata buffers are released. The original metadata buffer which was put on shadow list is released, but put into &lt;b&gt;t_forget&lt;/b&gt; list.&lt;br /&gt;&lt;br /&gt;&lt;b style="color: red;"&gt;Phase 5 start&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;---&amp;gt; Wait for the submitted revoke record and descriptor buffers to complete and written out. This is done by waiting for buffers on &lt;b&gt;t_log_list.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span style="color: red;"&gt;Phase 6 start&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;---&amp;gt; Change transaction state to &lt;i&gt;&lt;b&gt;T_COMMIT_RECORD&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;---&amp;gt; IO for data is complete now. Write the commit record in journal.&lt;br /&gt;&lt;b style="color: red;"&gt;&amp;nbsp;&lt;/b&gt;&lt;br /&gt;&lt;b style="color: red;"&gt;Phase 7 start&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;---&amp;gt; Walk the journal's &lt;i&gt;&lt;b&gt;t_forget&lt;/b&gt;&lt;/i&gt; list to get rid of buffers till there are no more buffers on it. As each buffer is examined, we check if it was on the checkpoint io list of previous transaction. If it is, its removed and if required (in case its dirty) its transferred to the checkpoint list of the committing transaction. See &lt;i&gt;&lt;b&gt;__journal_insert_checkpoint()&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;b style="color: red;"&gt;Phase 8 start&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;--&amp;gt; We are done committing the transaction now.&lt;br /&gt;---&amp;gt; Change transaction state to &lt;b&gt;&lt;i&gt;T_FINISHED&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;---&amp;gt; Set &lt;b&gt;committing transaction = NULL.&lt;/b&gt;&lt;br /&gt;---&amp;gt; Calculate average commit time for future use.&lt;br /&gt;---&amp;gt; Setup the checkpointing transaction.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3103156096521938896-3472636566725290142?l=mkatiyar.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mkatiyar.blogspot.com/feeds/3472636566725290142/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3103156096521938896&amp;postID=3472636566725290142' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/3472636566725290142'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/3472636566725290142'/><link rel='alternate' type='text/html' href='http://mkatiyar.blogspot.com/2011/07/committing-journal-transaction-in-jbd.html' title='Committing a journal transaction in jbd'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3103156096521938896.post-6359582257615980438</id><published>2011-07-02T01:58:00.000-07:00</published><updated>2011-07-02T01:58:41.882-07:00</updated><title type='text'>Journalling layer in ext3 (jbd)</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;u&gt;&lt;b&gt;Terminology&lt;/b&gt;&lt;/u&gt;&lt;br /&gt;&lt;b&gt;Journal handle&lt;/b&gt; - A handle pointer each of which represents a single atomic filesystem operation. It tracks all the modifications done as part of one atomic operation.&lt;br /&gt;&lt;b&gt;Transaction &lt;/b&gt;- A single atomic sequence of events which guarantees filesytem consistency. It can consist of a single handle or multiple handles for batching efficiency.&lt;br /&gt;&lt;b&gt;Transaction commit&lt;/b&gt; - Flushing the in-memory contents of journal to appropriate blocks in journal along with writing a commit record on disk in journal.&lt;br /&gt;&lt;b&gt;Transaction checkpoint &lt;/b&gt;- Flushing the contents from journal to their actual location on disk. This is done periodically to make journal space reusable.&lt;br /&gt;&lt;br /&gt;Typically journalling a filesystem operation consists of following three steps :-&lt;br /&gt;&lt;br /&gt;a) Starting a handle -&lt;b&gt;&lt;i&gt; journal_start()&lt;/i&gt;&lt;/b&gt;. We need to specify how many fs blocks this op can potentially modify. This is required to ensure that there would be enough space in the journal to completely write contents of this operation.&lt;br /&gt;The number of blocks required is the total number of blocks, including the data which is going to change, metadata blocks, quota blocks if any etc. As an example see &lt;i&gt;EXT3_DATA_TRANS_BLOCKS&lt;/i&gt;. These are called buffer credits for the handle.&lt;br /&gt;&lt;br /&gt;b) After getting a handle, next step is to associate the modified blocks with the journal handle, so that journal knows that it has to write these blocks in journal. This is done via following APIs &lt;b&gt;&lt;i&gt;journal_get_write_access(handle, bh)&lt;/i&gt;&lt;/b&gt; which tells the journal that this buffer is going to be modified. A buffer which is of interest to journalling layer has &lt;i&gt;&lt;b&gt;BH_JBD&lt;/b&gt;&lt;/i&gt; set on it and has a non-zero &lt;b&gt;&lt;i&gt;b_count&lt;/i&gt;&lt;/b&gt;. At this point a &lt;i&gt;&lt;b&gt;"journal_head"&lt;/b&gt;&lt;/i&gt; is attached to the buffer. A journal_head can only be part of 1 transaction.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;journal_get_write_access(handle, bh) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; journal_add_journal_head(bh);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; do_get_write_access(handle, jh, 0);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; journal_put_journal_head(bh);&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;journal_add_journal_head(bh) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;jh = journal_alloc_journal_head();&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; set_buffer_jbd(bh);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; bh-&amp;gt;b_private = jh;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; jh-&amp;gt;b_bh = bh;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;jh-&amp;gt;b_jcount++;&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;PS: A buffer is already part of a transaction if its journal_head's &lt;i&gt;&lt;b&gt;b_transaction&lt;/b&gt;&lt;/i&gt; or &lt;b&gt;&lt;i&gt;b_next_transaction&lt;/i&gt;&lt;/b&gt; is set. Most of the times, only b_transaction is set. b_next_transaction will be set incase the buffer is getting committed from previous transaction and we are changing it for the current transaction. The b_next_transaction tells journal that this buffer is going in next transaction. In this case a copy on write is performed and the frozen copy is stored in jh-&amp;gt;b_frozen_data.&lt;br /&gt;&lt;b&gt;NB : &lt;/b&gt;Buffer's b_transaction will only be set if its part of running or committing transaction and not if it resides on some other list like checkpoint list etc.&lt;br /&gt;&lt;br /&gt;c) Stop the handle -&lt;i&gt; &lt;/i&gt;&lt;b&gt;&lt;i&gt;journal_stop()&lt;/i&gt; &lt;/b&gt;: As the name suggests, journal stop marks the completion of an op wrt to journal. It returns any left over unused buffer credits to the transaction, drops appropriate references and frees the handle pointer.&lt;br /&gt;If the filesystem requested this op in sync mode, we also need to start committing the transaction to the journal on completion of handle. However in the current code there are some optimizations built around it to figure it out whether it is beneficial to start writing to disk immediately, or based on the op rate wait for sometime and let other op do it.&lt;br /&gt;&lt;br /&gt;See the following code as example.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;int journal_stop(handle_t *handle)&lt;br /&gt;{&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ...............&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; if (trans_time &amp;lt; commit_time) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; ktime_t expires = ktime_add_ns(ktime_get(),&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; commit_time);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; set_current_state(TASK_UNINTERRUPTIBLE);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; schedule_hrtimeout(&amp;amp;expires, HRTIMER_MODE_ABS);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ...............&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;Each journal_start/stop pair ie...each handle consists of one atomic filesystem operation. Some fs operations may be atomic in itself but still may not be sufficient enough to have the filesystem in a consistent state. An example of such an op is write which requires a quota update. Nested journal handles will be required to have such atomic op.&lt;br /&gt;&lt;br /&gt;Typical sequence would be&lt;br /&gt;a) Start journal handle for write&lt;br /&gt;b) Start journal handle for quota update&lt;br /&gt;c) Stop journal handle for quota update&lt;br /&gt;d) Stop journal handle for write.&lt;br /&gt;&lt;br /&gt;Its only after step (d) that the op can be committed to disk.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;NB :&lt;/b&gt; A buffer is a &lt;b&gt;"journalled"&lt;/b&gt; buffer, only if it has a journal head attached to it.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;A journal transaction consists of various lists where buffers of interest can reside. Buffers end up on one of the list depending on what flag/state it has. Below is the buffer state to list mapping. See the function &lt;b&gt;__journal_file_buffer()&lt;/b&gt; to see how buffers are moved across lists.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;List type =&amp;gt; buffer state flag&lt;/b&gt;&lt;br /&gt;transaction-&amp;gt;t_sync_datalist =&amp;gt;&amp;nbsp; BJ_SyncData&lt;br /&gt;transaction-&amp;gt;t_buffers =&amp;gt; BJ_Metadata&lt;br /&gt;transaction-&amp;gt;t_forget =&amp;gt; BJ_Forget&lt;br /&gt;transaction-&amp;gt;t_iobuf_list =&amp;gt; BJ_IO&lt;br /&gt;transaction-&amp;gt;t_shadow_list =&amp;gt; BJ_Shadow&lt;br /&gt;transaction-&amp;gt;t_log_list =&amp;gt; BJ_LogCtl&lt;br /&gt;transaction-&amp;gt;t_reserved_list =&amp;gt; BJ_Reserved&lt;br /&gt;transaction-&amp;gt;t_locked_list =&amp;gt; BJ_Locked&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3103156096521938896-6359582257615980438?l=mkatiyar.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mkatiyar.blogspot.com/feeds/6359582257615980438/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3103156096521938896&amp;postID=6359582257615980438' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/6359582257615980438'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/6359582257615980438'/><link rel='alternate' type='text/html' href='http://mkatiyar.blogspot.com/2011/07/journalling-layer-in-ext3-jbd.html' title='Journalling layer in ext3 (jbd)'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3103156096521938896.post-8429698626727092095</id><published>2011-07-02T01:41:00.000-07:00</published><updated>2011-07-02T01:42:40.753-07:00</updated><title type='text'>checkpointing transactions in journal (jbd)</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;Journal checkpointing : &lt;b&gt;jbd/checkpoint.c&lt;/b&gt;&lt;br /&gt;----------------------&lt;br /&gt;The main functions involved in doing journal checkpointing are :-&lt;br /&gt;&lt;i&gt;&lt;b&gt;a) log_do_checkpoint&lt;br /&gt;b) __process_buffer&lt;br /&gt;c) __flush_batch&lt;br /&gt;d) __wait_cp_io&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;i&gt;log_do_checkpoint&lt;/i&gt;&lt;/b&gt; picks up the first transaction on the checkpoint list and then iterates over all the buffers present in the transaction by calling &lt;b&gt;&lt;i&gt;__process_buffer &lt;/i&gt;&lt;/b&gt;on each of them. As it traverses, it keeps accumulating them in a local array for batching of disk writes. As part of processing it also moves the buffer from checkpoint_list to checkpoint_io_list to indicate that io is pending on these buffers.&lt;br /&gt;Once the array is full or we have no more buffers to process &lt;b&gt;&lt;i&gt;__flush_batch&lt;/i&gt;&lt;/b&gt; is called to send those buffers to disk for writing.&lt;br /&gt;&lt;br /&gt;After the buffers are submitted to disk,&lt;b&gt;&lt;i&gt; __wait_cp_io() &lt;/i&gt;&lt;/b&gt;is called to wait on each of the buffers for write to complete. After they get cleaned they are removed from the checkpoint_io_list. After all the buffers are freed, transaction itself is freed.&lt;br /&gt;&lt;br /&gt;* Helper functions to clear all the clean buffers from the checkpoint list.&lt;br /&gt;&lt;i&gt;&lt;b&gt;__journal_clean_checkpoint_list&lt;/b&gt;&lt;/i&gt; : Traverses the transactions in checkpoint transactions list (&lt;b&gt;&lt;i&gt;j_checkpoint_transactions&lt;/i&gt;&lt;/b&gt;) and frees memory by walking each list (&lt;i&gt;&lt;b&gt;j_checkpoint_list&lt;/b&gt;&lt;/i&gt;) at a time.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3103156096521938896-8429698626727092095?l=mkatiyar.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mkatiyar.blogspot.com/feeds/8429698626727092095/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3103156096521938896&amp;postID=8429698626727092095' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/8429698626727092095'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/8429698626727092095'/><link rel='alternate' type='text/html' href='http://mkatiyar.blogspot.com/2011/07/checkpointing-transactions-in-journal.html' title='checkpointing transactions in journal (jbd)'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3103156096521938896.post-4301978854723807370</id><published>2011-07-01T20:18:00.000-07:00</published><updated>2011-07-01T20:18:48.829-07:00</updated><title type='text'>Journal (jbd) revoke mechanism</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;Journal revoke :-&lt;b&gt; jbd/revoke.c&lt;/b&gt;&lt;br /&gt;------------------&lt;br /&gt;&lt;br /&gt;Revoke is a method of preventing journal from corrupting filesystem by not replaying ops and overwriting the contents of a deleted block on a newer block. For example consider the following sequence of steps when the filesystem is mounted in metadata only journalling mode.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;b&gt;a) A metadata block 'B' is journalled and contents are copied to journal.&lt;br /&gt;b) Later 'B' gets freed&lt;br /&gt;c) 'B' is now used to write contents of user data, this is not journalled.&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Now if we crash and replay, we need to avoid replaying the contents of block 'B' in journal over the user contents.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;u&gt;Revoke mechanism:-&lt;/u&gt;&lt;/b&gt; During commits of a transaction all the blocks which are revoked are stored in journal. This record of revoked blocks is used during journal recovery and journal is scanned for the revoked blocks before any ops is replayed. If there are transactions for the block after the last revoke record of a block, these ops are safe to replay. Any transactions which appear before the revoke record aren't replayed. The basic idea is that you don't want to replay ops corresponding to a block which may have been freed. Also note that if there are multiple revoke records corresponding to a block in a journal, we only need to worry about the latest record ie...one with highest transaction id.&lt;br /&gt;&lt;br /&gt;From file &lt;i&gt;jbd/revoke.c&lt;/i&gt;. &lt;br /&gt;&lt;pre&gt;&amp;nbsp;* We can get interactions between revokes and new log data within a&lt;br /&gt;&amp;nbsp;* single transaction:&lt;br /&gt;&amp;nbsp;*&lt;br /&gt;&amp;nbsp;* Block is revoked and then journaled:&lt;br /&gt;&amp;nbsp;*&amp;nbsp;&amp;nbsp; The desired end result is the journaling of the new block, so we&lt;br /&gt;&amp;nbsp;*&amp;nbsp;&amp;nbsp; cancel the revoke before the transaction commits.&lt;br /&gt;&amp;nbsp;*&lt;br /&gt;&amp;nbsp;* Block is journaled and then revoked:&lt;br /&gt;&amp;nbsp;*&amp;nbsp;&amp;nbsp; The revoke must take precedence over the write of the block, so we&lt;br /&gt;&amp;nbsp;*&amp;nbsp;&amp;nbsp; need either to cancel the journal entry or to write the revoke&lt;br /&gt;&amp;nbsp;*&amp;nbsp;&amp;nbsp; later in the log than the log block.&amp;nbsp; In this case, we choose the&lt;br /&gt;&amp;nbsp;*&amp;nbsp;&amp;nbsp; latter: journaling a block cancels any revoke record for that block&lt;br /&gt;&amp;nbsp;*&amp;nbsp;&amp;nbsp; in the current transaction, so any revoke for that block in the&lt;br /&gt;&amp;nbsp;*&amp;nbsp;&amp;nbsp; transaction must have happened after the block was journaled and so&lt;br /&gt;&amp;nbsp;*&amp;nbsp;&amp;nbsp; the revoke must take precedence.&lt;br /&gt;&amp;nbsp;*&lt;br /&gt;&amp;nbsp;* Block is revoked and then written as data:&lt;br /&gt;&amp;nbsp;*&amp;nbsp;&amp;nbsp; The data write is allowed to succeed, but the revoke is _not_&lt;br /&gt;&amp;nbsp;*&amp;nbsp;&amp;nbsp; cancelled.&amp;nbsp; We still need to prevent old log records from&lt;br /&gt;&amp;nbsp;*&amp;nbsp;&amp;nbsp; overwriting the new data.&amp;nbsp; We don't even need to clear the revoke&lt;br /&gt;&amp;nbsp;*&amp;nbsp;&amp;nbsp; bit here.&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&amp;nbsp;There are two hash tables to store the revoked entries. These two tables are required one for the running transaction and one for the committing transaction (if any). As you can guess new entries are always logged into the revoke table pointed by current &lt;b&gt;journal-&amp;gt;j_revoke&lt;/b&gt; pointer which points to the one corresponding to running transaction. You can think of it as a double buffering mechanism. These tables are switched alternately during the commit from kjounald. Access to these hash table entries is protected by the&lt;b&gt; j_revoke_lock.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;u&gt;Important functions:-&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;Initialize revoke hash : &lt;i&gt;journal_init_revoke()&lt;/i&gt;&lt;br /&gt;Inserts in hash : &lt;i&gt;insert_revoke_hash()&lt;/i&gt;&lt;br /&gt;Find in hash : &lt;i&gt;find_revoke_record()&lt;/i&gt;&lt;br /&gt;Transfer the in-memory revoke table to ondisk journal :&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;i&gt;journal_write_revoke_records()&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;NB:&lt;/b&gt; Note that you need to revoke a block before freeing it in bitmap and not the viceversa to prevent races.&lt;br /&gt;&lt;br /&gt;The buffer heads maintains two set of flags to indicate the revoke status of a buffer.&lt;br /&gt;a) &lt;i&gt;&lt;b&gt;RevokeValid &lt;/b&gt;&lt;/i&gt;: The revoke status of this buffer is known and can be trusted. If this is not set we can't say much about the buffer and need to search for it in hash.&lt;br /&gt;b) &lt;i&gt;&lt;b&gt;Revoke{set/clear}&lt;/b&gt;&lt;/i&gt; : These flags make sense when above is set. They tell whether the block is revoked or not.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3103156096521938896-4301978854723807370?l=mkatiyar.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mkatiyar.blogspot.com/feeds/4301978854723807370/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3103156096521938896&amp;postID=4301978854723807370' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/4301978854723807370'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/4301978854723807370'/><link rel='alternate' type='text/html' href='http://mkatiyar.blogspot.com/2011/07/journal-jbd-revoke-mechanism.html' title='Journal (jbd) revoke mechanism'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3103156096521938896.post-2506295452226151434</id><published>2011-06-26T23:05:00.000-07:00</published><updated>2011-07-02T01:31:03.110-07:00</updated><title type='text'>Journal recovery in jbd</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;Journal recovery :-&lt;b&gt; jbd/recovery.c&lt;/b&gt;&lt;br /&gt;------------------&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; Journal recovery is quite simple. It basically consists of below steps.&lt;br /&gt;&lt;br /&gt;a) Readahead journal blocks in memory.&lt;br /&gt;&lt;br /&gt;b) Do first pass &lt;b&gt;(PASS_SCAN&lt;/b&gt;) to see if we need a recovery. If yes what all transactions do we need to replay, if the journal is valid etc other sanity checks. After the first scan pass, an incore data structure about the journal (struct recovery_info) is populated which contains the required information about the recovery.&lt;br /&gt;&lt;br /&gt;c) Do second pass (&lt;b&gt;PASS_REVOKE&lt;/b&gt;). This traverses all the revoke block types and builds the incore hash of block numbers which are revoked. This ensures that we don't replay ops corresponding to these blocks when we do the actual replay.&lt;br /&gt;&lt;br /&gt;d) Do the third/final pass (&lt;b&gt;PASS_REPLAY&lt;/b&gt;) which actually does the job of replaying the journal and copies the data from journal to the real filesystem. Replaying a op simply consists of reading the corresponding block number from filesystem, copying the contents from journal to buffer and then marking the buffer dirty which would be written back to the actual location in filesystem.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;NB:&lt;/b&gt; Steps (b), (c) and (d) are done through a common function &lt;i&gt;&lt;b&gt;do_one_pass().&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;e) Once the replay is complete, throw away the in-memory revoke hash.&lt;br /&gt;&lt;br /&gt;f) Sync the blockdevice.&lt;br /&gt;&lt;br /&gt;g) Once the recovery is done, journal_reset() is called to setup the inmemory fields of journal, and journal is ready for business again.&lt;br /&gt;&lt;br /&gt;journal_recover()&lt;br /&gt;|&lt;br /&gt;--&amp;gt; do_one_pass(PASS_SCAN)&lt;br /&gt;|&lt;br /&gt;--&amp;gt; do_one_pass(PASS_REVOKE)&lt;br /&gt;|&lt;br /&gt;--&amp;gt; do_one_pass(PASS_REPLAY)&lt;br /&gt;|&lt;br /&gt;--&amp;gt; journal_clear_revoke()&lt;br /&gt;|&lt;br /&gt;--&amp;gt; sync_blockdev() and journal_reset()&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3103156096521938896-2506295452226151434?l=mkatiyar.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mkatiyar.blogspot.com/feeds/2506295452226151434/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3103156096521938896&amp;postID=2506295452226151434' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/2506295452226151434'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/2506295452226151434'/><link rel='alternate' type='text/html' href='http://mkatiyar.blogspot.com/2011/06/journal-recovery-in-jbd.html' title='Journal recovery in jbd'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3103156096521938896.post-6975834879743960808</id><published>2010-08-23T22:52:00.000-07:00</published><updated>2010-08-23T22:52:12.492-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='UFS'/><category scheme='http://www.blogger.com/atom/ns#' term='ufsutils'/><category scheme='http://www.blogger.com/atom/ns#' term='fsdb'/><category scheme='http://www.blogger.com/atom/ns#' term='mcopy'/><category scheme='http://www.blogger.com/atom/ns#' term='UFS copying from userspace'/><title type='text'>mcopy support in fsdb for ufsutils</title><content type='html'>I have been looking for something like mcopy support in ufsutils so that I could just copy any files from userspace to the UFS filesystem. This is primarily important because by default kernel is not configured/compiled to have readwrite support for UFS. Either you need to have your own custom compiled kernel or something else so that you can mount the UFS filesystem and then do a copy.&lt;br /&gt;&lt;br /&gt;Below is a patch (quick, ugly and dirty) which I wrote for myself which might be useful to you too. It is definitely not the best written but it does its job. Inorder to keep things simple even though I wanted to add a new command "addfile" (or something similar) to fsdb, I have removed other commands from this fsdb and just made it to work like mcopy because I needed it to be integrated with my shell scripts.&lt;br /&gt;&lt;br /&gt;Copy the file fsdb.c in the folder ufsutils-7.0/fsdb.ufs/ and then recompile fsdb. If you have a newer version you will want to copy the contents from the original file. Just grep for "MANISH" in the file. Recompile and njoy !!!&lt;br /&gt;&lt;br /&gt;Typical invocation would be something like&lt;br /&gt;&lt;br /&gt;$./fsdb myufs_file_system local_file_to_copy destination_path_in_FS&lt;br /&gt;eg..&lt;br /&gt;&lt;br /&gt;$./fsdb myfs testfile.c /boot/dummyfile.c&lt;br /&gt;&lt;br /&gt;Download &lt;a href="https://docs.google.com/leaf?id=0B7IdXzyV1V5PYWNmYzNmOTgtMWYxMy00MzI4LTg5NGMtNTM3YjA5OWFkYmFk&amp;hl=en&amp;authkey=CM__krwP"&gt;fsdb.c&lt;/a&gt; here&lt;br /&gt;&lt;br /&gt;Always verify your FS with fsck after running this :-). I'm not responsible if you lose your data.&lt;br /&gt;&lt;br /&gt;* I think it has a small bug while doing truncation of files which are less than size 48K.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3103156096521938896-6975834879743960808?l=mkatiyar.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mkatiyar.blogspot.com/feeds/6975834879743960808/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3103156096521938896&amp;postID=6975834879743960808' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/6975834879743960808'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/6975834879743960808'/><link rel='alternate' type='text/html' href='http://mkatiyar.blogspot.com/2010/08/mcopy-support-in-fsdb-for-ufsutils.html' title='mcopy support in fsdb for ufsutils'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3103156096521938896.post-6478920444198731217</id><published>2009-05-15T03:11:00.000-07:00</published><updated>2009-05-15T03:13:14.618-07:00</updated><title type='text'>Compatibility flags in ext filesystems</title><content type='html'>Ever wondered how do ext filesystems are compatible with each other in a way that most of the times they can be mounted seamlessly even though they have ondisk different structures, or in other words how does ext ensures that the ondisk changes are compatible across different releases ?&lt;br /&gt;This is possible due to three important fields present in ext superblock collectively known as compatibilty bitmaps. We will discuss each of them with anexample.&lt;br /&gt;&lt;br /&gt;Usually when a new feature is added in ext, a bit is assigned in superblock to indicate whether the filesystem has that feature or not. The decision of putting this bit in which compatibility bitmap is of utmost importance and depends on the nature of the enhancement done.&lt;br /&gt;&lt;br /&gt;Compatible features :(s_features_compat) (EXT3_FEATURE_COMPAT_HAS_JOURNAL - Journal support is valid)&lt;br /&gt;&lt;br /&gt;Readonly compatible features :(s_feature_ro_compat) (EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER - Filesystem has sparse superblocks ie.. there are backups of superblock in block numbers multiple of 3,5 &amp; 7). - These are the features which are only supported if the file system is mounted in read only mode.&lt;br /&gt;&lt;br /&gt;Incompatible features :(s_feature_incompat) EXT4_FEATURE_INCOMPAT_EXTENTS These are the features which older kernel won't be able to interpret and understand and in such cases they should refuse to mount if an incompat bit is set. One such example is the extent map changes, since it changes the way block pointers are stored on disk, older kernels who know only direct/indirect block formats will not be able to read this properly and thus can't serve data when requested.&lt;br /&gt;&lt;br /&gt;Inorder to specify a new feature, its compatibility has to be decided and then it is added to the compatibility flags appropriately. See the macro EXT3_FEATURE_RO_COMPAT_SUPP and others to get an idea.&lt;br /&gt;&lt;br /&gt;There have been talks and suggestions about adding these compatibility flags on per inode basis rather than the filesystem to achieve maximum backwards compatibility, so that only fewer files are non-compatible.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3103156096521938896-6478920444198731217?l=mkatiyar.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mkatiyar.blogspot.com/feeds/6478920444198731217/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3103156096521938896&amp;postID=6478920444198731217' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/6478920444198731217'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/6478920444198731217'/><link rel='alternate' type='text/html' href='http://mkatiyar.blogspot.com/2009/05/compatibility-flags-in-ext-filesystems.html' title='Compatibility flags in ext filesystems'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3103156096521938896.post-2172320425045948561</id><published>2009-03-05T22:03:00.000-08:00</published><updated>2009-03-15T22:59:20.737-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ext3 write'/><category scheme='http://www.blogger.com/atom/ns#' term='journalling'/><category scheme='http://www.blogger.com/atom/ns#' term='vfs write'/><category scheme='http://www.blogger.com/atom/ns#' term='ext2'/><category scheme='http://www.blogger.com/atom/ns#' term='ext3'/><title type='text'>ext3 write call stack</title><content type='html'>&lt;a href="http://nngfs.pbwiki.com/f/ext3.png"&gt;&lt;img src="http://nngfs.pbwiki.com/f/ext3.png" height=300 width=500/&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3103156096521938896-2172320425045948561?l=mkatiyar.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mkatiyar.blogspot.com/feeds/2172320425045948561/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3103156096521938896&amp;postID=2172320425045948561' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/2172320425045948561'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/2172320425045948561'/><link rel='alternate' type='text/html' href='http://mkatiyar.blogspot.com/2009/03/ext3-write-call-stack.html' title='ext3 write call stack'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3103156096521938896.post-7124386988538478520</id><published>2008-09-18T23:13:00.001-07:00</published><updated>2008-09-20T05:53:59.332-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hexadecimal conversion'/><category scheme='http://www.blogger.com/atom/ns#' term='negative integers'/><title type='text'>Easy way to convert negative integers to hexadecimal</title><content type='html'>If you came here looking for some magic......sorry .... I had some issues representing negative integers in hex. Every time I had to convert them to binary, do a 2's complement and then reconvert to hex :-( so was looking for a way which I could use to quickly do my job. I got this trick from net and thought it is pretty easy. You just need to keep the following mapping in mind (or paper :-) though it is pretty simple).&lt;br /&gt;&lt;br /&gt;0 1 2 3 4 5 6 7&lt;br /&gt;F E D C B A 9 8&lt;br /&gt;&lt;br /&gt;that's all. The above table is just numbers 1-15 in clockwise direction, or more technically the upper number is inverse of the below one.&lt;br /&gt;&lt;br /&gt;Now to convert any number say -7218, follow below steps :-&lt;br /&gt;&lt;br /&gt;a) Convert 7218 to hex = 0x1C32&lt;br /&gt;b) Interchange each digit with its corresponding counter bit from mapping = 0xE3CD&lt;br /&gt;c) Shift the last digit to left by 1 place...so D becomes E = 0xE3CE = -7218 = 0xFFFFE3CE (don't forget to fill it with 'F's on left. If you don't fill it with 'F' it becomes 65536-7218 = 58318)&lt;br /&gt;&lt;br /&gt;NOTE : If the last digit is 'F' and you shift left, you go to '0' and then shift the next digit to left. (Much like carry over).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3103156096521938896-7124386988538478520?l=mkatiyar.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mkatiyar.blogspot.com/feeds/7124386988538478520/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3103156096521938896&amp;postID=7124386988538478520' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/7124386988538478520'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/7124386988538478520'/><link rel='alternate' type='text/html' href='http://mkatiyar.blogspot.com/2008/09/easy-w.html' title='Easy way to convert negative integers to hexadecimal'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3103156096521938896.post-4277075091408422712</id><published>2008-09-12T23:03:00.000-07:00</published><updated>2008-09-15T01:25:00.800-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ext filesystem resizing'/><category scheme='http://www.blogger.com/atom/ns#' term='online resizing'/><category scheme='http://www.blogger.com/atom/ns#' term='ext2'/><category scheme='http://www.blogger.com/atom/ns#' term='ext3'/><title type='text'>ext filesystem online resizing.</title><content type='html'>Below notes are based on my understanding of code and most of the stuff is copied directly from the paper presented by &lt;span style="font-style:italic;"&gt;Andreas Dilger&lt;/span&gt; in OLS-2002 "&lt;span style="font-weight:bold;"&gt;&lt;span style="font-style:italic;"&gt;Online ext2 and ext3 Filesystem Resizing&lt;/span&gt;&lt;/span&gt;". So if you find any errors below they are due to me, and you may want to read the &lt;a href="http://edgyu.excess.org/ols/2002/Andreas%20E%20Dilger%20-%20Online%20Resizing%20with%20ext2%20and%20ext3.pdf"&gt;full paper&lt;/a&gt; for proper detailed explaination.&lt;br /&gt;&lt;br /&gt;Primary operations involved in resizing :-&lt;br /&gt;a) Increase the total number of blocks in primary and backup superblock.&lt;br /&gt;b) Increase the count of free blocks in primary superblock and group descriptor for that group.&lt;br /&gt;c) Increase the number of reserved filesystem blocks.&lt;br /&gt;d) Remount with "&lt;span style="font-weight:bold;"&gt;-o remount,resize=&amp;lt;newsize&amp;gt;&lt;/span&gt;" option.&lt;br /&gt;&lt;br /&gt;NOTE: You cannot shrink the filesystem. Shrinking is in general discouraged by filesystems and most of them don't support it, because this might lead to stale/invalid nfs file handles if the client has a file opened and you shrunk the filesystem.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Case 1: Adding blocks to the end of the last partially filled block group.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In order to increase the free blocks count and update block bitmaps, a fake inode is created which spans the newly added blocks at the end of last block group. Since the blocks are already being used by this fake inode we don't have risk of getting it used by anyone else. Now to make these blocks available to filesystem all that needs to be done is to delete this inode by &lt;span style="font-style:italic;"&gt;ext2_free_blocks()&lt;/span&gt;. During freeing of this inode, appropriate counts and block bitmaps will be updated. This inode is only in memory and has enough fields as required by &lt;span style="font-style:italic;"&gt;ext2_free_blocks()&lt;/span&gt;.&lt;br /&gt;Next is to update the superblock in backup superblocks. Since the backup superblocks are not accessed by kernel they can be directly modified by userspace. If the resize operations is done via mount then since backup superblocks are not updated, but they will be updated to proper values when e2fsck is run next time.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Case 2: Adding new block groups at the end of full block group.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Main steps required to handle this case are :-&lt;br /&gt;a) Create new block bitmaps and inode bitmaps for each added block group and updated them appropriately to reflect the available block and inodes within the group.&lt;br /&gt;b) Add the new entry to group descriptor table. Increase the number of groups in the filesystem so that the inode/block allocation routines know that there are new groups available with free resources.&lt;br /&gt;&lt;br /&gt;Inorder to avoid doing all this stuff from kernel, increasing the resources is done from userspace before the kernel becomes aware of new groups. Since as long as the kernel is unaware of new groups, it cannot allocate anything from there and moreover since we are adding at the last, it is safe to do these operations from userspace.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Case 3 : Adding a new block group in new group descriptor block.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This is a bit complex situation because backup superblocks and group descriptors are situated at the start of a block group (so as to be easily identifiable by e2fsck in case of corrruptions), followed by inode and block bitmap. Adding a new block group would mean overwriting these bitmaps and thus they need to be shifted to accomodate the new block. This relocation is not of a serious problem since gdt stores the location of bitmaps. So all we need to do is shift the blocks and then update the new information in the gdt. But this movement of blocks cannot be done while the filesystem is mounted, to avoid any mess within kernel. &lt;br /&gt;Such cases need an unmount of the filesystem (commonly known as filesystem preparation) and they have the advantage that the changes made are compatible with older kernels. However there is another alternative without offlining the filesystem if the user is ready to break this compatibility.&lt;br /&gt;&lt;br /&gt;Andreas nicely explains this case in his paper, and I would suggest to read section 3.4 for those who are interested.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Above strategy is for ext2, however resizing in ext3 using the above poses certain problems like :-&lt;br /&gt;&lt;br /&gt;a) Writing directly to the device is unsafe since, there is a journalling layer whose job is to make sure that everything is consistent. Thus if you bypass it, it might lead to ondisk corruptions in the filesystem.&lt;br /&gt;b) Unlike ext2, there may be copies of data in ext3 made by journalling and thus there may be differences in what was read and what actually is there if you try to do it from userspace. (See paper)&lt;br /&gt;c) Since ext3 resizing is done through journal, in case if the resize has made to journal and then system crashed. In such cases during reboot journal will be replayed, however since superblock is read first the new size will not be updated. This case is detected by comparing filesystem size before and after journal replay and then update the appropriate datastructures.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3103156096521938896-4277075091408422712?l=mkatiyar.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mkatiyar.blogspot.com/feeds/4277075091408422712/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3103156096521938896&amp;postID=4277075091408422712' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/4277075091408422712'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/4277075091408422712'/><link rel='alternate' type='text/html' href='http://mkatiyar.blogspot.com/2008/09/ext-filesystem-online-resizing.html' title='ext filesystem online resizing.'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3103156096521938896.post-7567374872637581683</id><published>2008-09-07T23:18:00.001-07:00</published><updated>2008-09-13T11:18:22.370-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='interacting with kernel'/><category scheme='http://www.blogger.com/atom/ns#' term='ioctl'/><title type='text'>ioctls - An easy interface to talk to kernel</title><content type='html'>There are many *good* resources available on net if you want to know about ioctls. One of them is "man ioctl"&lt;br /&gt;If you want to know about the list of ioctls, one quick way is "man ioctl_list". What follows below is a layman explaination and may not be liked by "strict" technical persons.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;a) What are ioctls ?&lt;/span&gt;&lt;br /&gt;ioctls are like swiss knife, one interface to do lot of things. A simple way for userspace to talk to kernel.&lt;br /&gt;Among other available options (syscalls, procfs, sysfs, debugfs .. etc.), probably ioctls are the easiest one if you want to talk to kernel space as well as get that into mainline kernel. One of the reasons is because they are easy to implement and everyone follows its own convention (though it is discouraged) due to lack of no single standard.&lt;br /&gt;&lt;br /&gt;To implement an ioctl, all you need to do is tell kernel that I will be sending some *codenumber* from userspace and depending on the codenumber you have to do something (execute a function). Much like switch-case or RPC.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;b) How to implement ioctls ?&lt;/span&gt;&lt;br /&gt;Depending on the way you want to talk to kernel, you can have an in, out, none or both arguments and you need to tell kernel accordingly so that it can take the appropriate action. Remember that you have to send a *codenumber* to kernel, so you must embed this information in it along with the functionality that you desire.&lt;br /&gt;&lt;br /&gt;ioctls are declared using the standard macros defined in asm/ioctl.h. Macros are of type _IO(type,nr) and {_IOR,_IOW,_IOWR}(type,nr,size). Note that R and W are from user perspective (same as read &amp; write). The last parameter 'size' is actually the datatype that needs to be transferred. &lt;br /&gt;&lt;br /&gt;An example of ioctl by which you need to change the version of a file in kernel is :-&lt;br /&gt;#define EXMPL_IOC_CHANGE_VERSION   _IOW('f',7,long)&lt;br /&gt;&lt;br /&gt;This gets expanded into a *codenumber* due to below defined macros :-&lt;br /&gt;#define _IOW(type,nr,size) _IOC(_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))&lt;br /&gt;#define _IOC(dir,type,nr,size) \&lt;br /&gt; (((dir)  &lt;&lt; _IOC_DIRSHIFT) | \  /* DIRSHIFT == 30 */&lt;br /&gt;  ((type) &lt;&lt; _IOC_TYPESHIFT) | \ /* TYPESHIFT == 8 */&lt;br /&gt;  ((nr)   &lt;&lt; _IOC_NRSHIFT) | \   /* NRSHIFT == 0 */&lt;br /&gt;  ((size) &lt;&lt; _IOC_SIZESHIFT))    /* SIZESHIFT == 16 */&lt;br /&gt;# define _IOC_WRITE 1U&lt;br /&gt;&lt;br /&gt;Based on the arguments passed this will create a unique *codenumber* which your module needs to understand. _IOC_TYPECHECK is just a macro for compiler to check for invalid uses of size argument. The ioctl number 'nr' should be chosen such that it doesn't conflict with others (that is one reason why ppl hate ioctls). Some devices use their major number for this.&lt;br /&gt;&lt;br /&gt;Once I have defined my ioctl, I need to make kernel understand it. This is done by installing your ioctl handler in the kernel as below :-&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;struct file_operations my_file_operations = {&lt;br /&gt;              .read = my_read_func,&lt;br /&gt;              .write = my_write_func,&lt;br /&gt;              ........&lt;br /&gt;              .ioctl = my_ioctl_handler,&lt;br /&gt;              ........&lt;br /&gt;              ........&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;and then define my_ioctl_handler() which performs the actual action.&lt;pre&gt;&lt;br /&gt;long my_ioctl_handler(struct file *filp, unsigned int cmd, unsigned long arg) {&lt;br /&gt;       struct inode *inode = filp-&gt;f_dentry-&gt;d_inode;&lt;br /&gt;       unsigned int version;&lt;br /&gt;       switch(cmd) {&lt;br /&gt;        ...........&lt;br /&gt;       case EXMPL_IOC_CHANGE_VERSION : &lt;br /&gt;                        if (get_user(version, (int __user *) arg)) {&lt;br /&gt;            err = -EFAULT;&lt;br /&gt;           goto err_out;&lt;br /&gt;          }&lt;br /&gt;                        inode-&gt;i_version = version;&lt;br /&gt;                        return 0;&lt;br /&gt;       ............&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;And then in userspace, you need to invoke your ioctl as :-&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;main() {&lt;br /&gt;  int fd;&lt;br /&gt;  fd = open("myfile",O_RDWR);&lt;br /&gt;  if(fd&lt;0)&lt;br /&gt;    exit(1);&lt;br /&gt;  if(ioctl(fd,EXMPL_IOC_CHANGE_VERSION,123)))&lt;br /&gt;      fprintf(stderr,"Some error in ioctl ...\n")&lt;br /&gt;  ........&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;As you can see from the example above, ioctls are very easy interface to notify kernel to do something and are largely used in device drivers. Read Documentation/ioctl-numbers.txt for more on them.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3103156096521938896-7567374872637581683?l=mkatiyar.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mkatiyar.blogspot.com/feeds/7567374872637581683/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3103156096521938896&amp;postID=7567374872637581683' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/7567374872637581683'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3103156096521938896/posts/default/7567374872637581683'/><link rel='alternate' type='text/html' href='http://mkatiyar.blogspot.com/2008/09/ioctls-easy-interface-to-talk-to-kernel.html' title='ioctls - An easy interface to talk to kernel'/><author><name>Unknown</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
