The ReiserFS filesystem

This document describes the ReiserFS filesystem structures on disk. It was created while writing a ReiserFS reader for Windows NT. I was unsatisfied with the documentation available over at the official URL, www.namesys.com, so I wrote my own. But, since this is my first contact with ReiserFS, it does contain errors, so if in doubt, please consult the original spec first.

This document is copyrighted by Gerson Kurz and licensed by the GPL.

What you need to know before you start

It is assumed that you know the partition layout, that is:

Start of the partition in bytes
Size of the partition in bytes
That the partition is indeed a reiserfs partition

The following typedefs will be used:

__u16 - 16-bit unsigned integer
__u32 - 32-bit unsigned integer
__u64 - 64-bit unsigned integer

The Superblock

The Superblock is the root information for the ReiserFS filesystem. It lies at a fixed offset from the start of the partition at

REISERFS_DISK_OFFSET_IN_BYTES = (64*1024) = 65536

This superblock is defined in reiserfs_fs_sb.h as the following structure:

struct reiserfs_super_block
{
  __u32 s_block_count;
  __u32 s_free_blocks;                  /* free blocks count    */
  __u32 s_root_block;                   /* root block number    */
  __u32 s_journal_block;                /* journal block number    */
  __u32 s_journal_dev;                  /* journal device number  */

  /* Since journal size is currently a #define in a header file, if 
  ** someone creates a disk with a 16MB journal and moves it to a 
  ** system with 32MB journal default, they will overflow their journal 
  ** when they mount the disk.  s_orig_journal_size, plus some checks
  ** while mounting (inside journal_init) prevent that from happening
  */

/* great comment Chris. Thanks.  -Hans */

  __u32 s_orig_journal_size;
  __u32 s_journal_trans_max ;           /* max number of blocks in a transaction.  */
  __u32 s_journal_block_count ;         /* total size of the journal. can change over time  */
  __u32 s_journal_max_batch ;           /* max number of blocks to batch into a trans */
  __u32 s_journal_max_commit_age ;      /* in seconds, how old can an async commit be */
  __u32 s_journal_max_trans_age ;       /* in seconds, how old can a transaction be */
  __u16 s_blocksize;                    /* block size           */
  __u16 s_oid_maxsize;                  /* max size of object id array, see get_objectid() commentary  */
  __u16 s_oid_cursize;                  /* current size of object id array */
  __u16 s_state;                        /* valid or error       */
  char s_magic[12];                     /* reiserfs magic string indicates that file system is reiserfs */
  __u32 s_hash_function_code;           /* indicate, what hash function is being use to sort names in a directory*/
  __u16 s_tree_height;                  /* height of disk tree */
  __u16 s_bmap_nr;                      /* amount of bitmap blocks needed to address each block of file system */
  __u16 s_version;                      /* I'd prefer it if this was a string,
                                           something like "3.6.4", and maybe
                                           16 bytes long mostly unused. We
                                           don't need to save bytes in the
                                           superblock. -Hans */
  __u16 s_reserved;
  __u32 s_inode_generation;
  char s_unused[124] ;                  /* zero filled by mkreiserfs */
} __attribute__ ((__packed__));

(For those not familiar with the GCC compiler,"__attribute__ ((__packed__))" just means that it is 1-byte aligned).

The information in s_blocksize is vital. All ReiserFS data is organized in blocks, and each block has this size (in bytes). When, in the rest of this document, I speak of "blocks", I mean this kind of blocks (and not whatever your disk information says a blocksize is. Furthermore, I will assume the define REISERFS_BLOCKSIZE points to this member.

Note: s_magic should contain either "ReIsErFs" or "ReIsEr2Fs".

Used/Free Blocks Bitmap

The Filesystem needs to know which blocks are used or free. Free blocks usually are unformatted, that is: the block header structure described below is missing or invalid. For this purpose, ReiserFS uses a bitmap. It consists of a series of blocks, where each byte represents 8 free/used bits. For example, the following bytes:

FF FF FF C7

where 0xC7 is 11000111 dual, means that the first 8+8+8+3 = blocks are used ("1"), then three free blocks ("0"), then two more used blocks ("1"). Note that the bit interpretation may vary with the endianess of your system. On a linux system, you can use the command

debugreiserfs -b <name of reiserfs partition>

to find out how reiserfs interprets the bitmap.

How many bitmap blocks are there?

You can find out how many bitmap blocks by looking at the superblock structure

reiserfs_super_block.s_bmap_nr - amount of bitmap blocks

Finding the first bitmap block

The first bitmap block is one block higher than the superblock. So, to calculate the first bitmap block, you have

REISERFS_DISK_OFFSET_IN_BYTES + REISERFS_BLOCKSIZE

for the first bitmap block offset in bytes .

Finding the next bitmap blocks

The remaining bitmap blocks other than the first are calculated using this formula:

REISERFS_BLOCKSIZE * REISERFS_BLOCKSIZE * REISERFS_BLOCKSIZE * 8 * n

Where n is >= 1. Don't ask me why. So, for example, if the blocksize is 4096, the second bitmap block is located at 4096*4096*8*1 = 134217728 offset in bytes, which would be block 32768.

The b-tree

Note: This section can just be an informal introduction. Try searching the web for "btree" for more technical specifications.

A btree is a linked structure that looks like a regular tree. There is a parent block, and it has child blocks, and they in turn have child blocks and so on.

What is distinguishing from "normal" trees is that each block in the tree is identified by a key. A key can be anything that is unique and can be compared with other keys; in our examples it will be just one integer, in ReiserFS it is a series of four integers. What is important is that you have a defined sort ordering for the keys - no two keys can be the same, and any two keys can be compared and sorted.

Here is an example. Lets imagine that the keys are decimal numbers, and they are compared by typical integer comparison. The root block has the following keys

100

200

This means, it needs three child blocks, one for keys in the decimal range 0..99, one for 100..199 and one for 200..max. Lets look at an imaginary set of child blocks:

105

130

170

175

180

205

209

280

Here, the first three elements are children of the block 100, the next five ones of the block 200, and the last three are for anything above 200. You should note that it is not necessary that all keys are used - in the example only three keys are used for the range 0..99. This type of ordering goes on for more levels (I think the default for reiserfs is four tree levels).

Leaf nodes are nodes that do not have any childs. In ReiserFS, their data format is different from internal nodes (see section on "Leaf Nodes" below)

How to find an item by key

Again, no algorithm here, just an informal note. You start from the top node. You look for the section that is larger than or equal to the desired key. If you got that, you go to its child node, and so on, until you reach a leaf node.

Keys used by ReiserFS

Note: This section is a bit lacking, since I don't quite understand the concept used fully.

Well, there are two types of keys, each consisting of four elements:

( directory-id, object-id, offset, type )

Each of these is just a decimal number. The main difference is that in "normal" keys, each of the four numbers is a 32-bit integer; but there are some keys where "offset" is encoded as a 60-bit number, and type as a 4-bit number. Since I have not exactly found out what "offset" means, I cannot tell you why this difference exists.

The key structure is defined like this:

struct offset_v1 {
    __u32 k_offset;
    __u32 k_uniqueness;
};

struct offset_v2 {
    __u64 k_offset: 60;
    __u64 k_type: 4;
};

struct key {
    __u32 k_dir_id;    /* packing locality: by default parent directory object id */
    __u32 k_objectid;  /* object identifier */
    union {
	struct offset_v1 k_offset_v1;
	struct offset_v2 k_offset_v2;                                       
    } u;
} ;

So, when interpreted as version 1, you have four 32-bit integers:

( k_dir_id, k_object_id, u.k_offset_v1.k_offset, u.k_offset_v1.k_uniqueness )

When interpreted as version 2, you have four integers like this:

( k_dir_id, k_object_id, u.k_offset_v2.k_offset, u.k_offset_v2.k_type )

It is best practice to expand this key to a "cpu key" in memory, that looks something like this (my definition):

typedef struct reiserfs_cpu_key
{
    __u32 k_dir_id;
    __u32 k_objectid;
    __u64 k_offset;
    __u32 k_type;
} REISERFS_CPU_KEY, *LPREISERFS_CPU_KEY;

This way it is easier to combine keys of the two types.

How to distinguish key types

Well, I have to admit, I don't know for sure. It seems like this: Normally, all keys are of type 1. However, for Item headers an ih_version of 2 specifies keys of type 2. I will explain more on this in the section on "Leaf Nodes" below, but for now just think that most keys on-disk are of type 1, and in memory should always be expanded to a cpu-key.

How to compare keys

You need to be able to compare keys to search through the btree. Well, think of the keys as an array of integers. Examine each integer sequentially, and use the normal "less-than" relation.

Here is a C function that compares two cpu keys, which makes it pretty clear:

int CompareCpuKeys(REISERFS_CPU_KEY* a, REISERFS_CPU_KEY *b)
{
    // compare 1. integer
    if( a->k_dir_id < b->k_dir_id )
        return -1;
    if( a->k_dir_id > b->k_dir_id )
        return 1;
    
    // compare 2. integer
    if( a->k_objectid < b->k_objectid )
        return -1;
    if( a->k_objectid > b->k_objectid )
        return 1;
    
    // compare 3. integer
    if( a->k_offset < b->k_offset )
        return -1;
    if( a->k_offset > b->k_offset )
        return 1;
    
    // compare 4. integer
    if( a->k_type < b->k_type )
        return -1;
    if( a->k_type > b->k_type )
        return 1;
	return 0;
}

Block handling

The whole concept of accessing disk data in ReiserFS is based on blocks. They is not necessarily any relation between a reiserfs-block and the familiar concept of "blocks/sectors/cylinders" - blocks on a reiserfs partition are the size specified in the superblock.

There are three types of blocks:

Formatted internal blocks: This is the block type used for the btree data. It contains keys and the associated disk blocks. It is described in detail in the section "Internal nodes" below
Formatted leaf blocks: This is the block type used for file information. It is described in detail in the section "Leaf nodes" below.
Unformatted blocks: An unformatted block just contains the raw file data. You cannot strictly recognize an unformatted block by looking at it, - an unformatted block is any block that is not used by the btree structure of reiserfs.

Strictly speaking, there is one more block type, the superblock, but that is a special case. (see above).

Standard block header

Each formatted block has a block header. It is defined as follows:

struct block_head {       
  __u16 blk_level;        /* Level of a block in the tree. 1: leaf; 2 and higher: internal;  */
  __u16 blk_nr_item;      /* Number of keys/items in a block. */
  __u16 blk_free_space;   /* Block free space in bytes. */
  __u16 blk_reserved;     /* dump this in v4/planA */
  struct key  blk_right_delim_key; /* used for Leaf nodes only */
};

You can see if a formatted block is a leaf node by looking at blk_level.

Internal nodes

An internal node is an element in the btree. As such, it just contains pointers to other nodes that are children of itself.

Block Header

Key 1

Key 2

...

Key n

Ptr. 1

Ptr. 2

...

Ptr. n+1

Free Space

The array of keys just contains the normal reiserfs key structures (V1). The array of pointers contains elements of the following structure:

struct disk_child {
  __u32       dc_block_number;  /* Disk child's block number. */
  __u16       dc_size;          /* Disk child's used space.   */
  __u16       dc_reserved;
};

As you can see, this is the pointer to the child node. Note that there is one more disk_child than keys, as is familiar from the short introduction on btrees above.

The block header contains a blk_level with a higher value than 1. The blk_nr_item element of the block header specifies how big the array of keys is.

Here is a code excerpt that enumerates all keys and their associated "disk-child" objects:

    // assume bMemory is the block data
    
    // get a pointer to the array of keys
    LPBYTE lpbHeaderData = bMemory+sizeof(REISERFS_BLOCK_HEAD);
    
    // get a pointer to the array of disk childs
    LPBYTE lpbPointerData = bMemory+sizeof(REISERFS_BLOCK_HEAD)+(pH->blk_nr_item*sizeof(REISERFS_KEY));

    // enumerate array
    for( int i = 0; i < pH->blk_nr_item; i++ )
    {
        REISERFS_KEY* key = (REISERFS_KEY*)(lpbHeaderData+i*sizeof(REISERFS_KEY));
        REISERFS_DISK_KEY* pointer = (REISERFS_DISK_KEY*) (lpbPointerData+i*sizeof(REISERFS_DISK_KEY));
        
        // TODO: add evaluation

    }
    
    // this is the last pointer (note: no key!)
    REISERFS_DISK_KEY* pointer = (REISERFS_DISK_KEY*) (lpbPointerData+i*sizeof(REISERFS_DISK_KEY));
    
    TODO: add evaluation for last pointer

Leaf nodes

A leaf node has the following layout:

Block Header

Header 1

Header 2

...

Header n

Free Space

Data 1

Data 2

...

Data n

The block header has a blk_level of 1. The number of items is given in the blk_nr_item field of the header.

Each item header uses the following structure:

struct item_head
{
  struct key ih_key; 	/* Everything in the tree is found by searching for it based on its key.*/
  union {
    __u16 ih_free_space_reserved; /* The free space in the last unformatted node of an indirect item if this
				     is an indirect item.  This equals 0xFFFF iff this is a direct item or
				     stat data item. Note that the key, not this field, is used to determine
				     the item type, and thus which field this union contains. */
    __u16 ih_entry_count; /* Iff this is a directory item, this field equals the number of directory
				      entries in the directory item. */
  } u;
  __u16 ih_item_len;           /* total size of the item body */
  __u16 ih_item_location;      /* an offset to the item body within the block */

  __u16 ih_version;	       /* 0 for all old items, 2 for new
                                  ones. Highest bit is set by fsck
                                  temporary, cleaned after all done */
} ;

The ih_item_location and ih_item_len members specify where the data for this item is (in the current block). When, in the sections below, the item types are described in detail, it is expected that you analyze the data located with these two members.

Here is a code excerpt that enumerates all items in a leaf node.

// assume: bMemory is the data of the current block
// assume: pH is a pointer to the block head.

// find start of item header array
LPBYTE lpbHeaderData = bMemory+sizeof(REISERFS_BLOCK_HEAD);

for( int i = 0; i < pH->blk_nr_item; i++ )
{
    // this is the item header
    LPREISERFS_ITEM_HEAD iH = (LPREISERFS_ITEM_HEAD)lpbHeaderData;

    // this is the item data
    LPBYTE lpbItemData = bMemory + iH->ih_item_location;
    DWORD dwItemSize = iH->ih_item_len
    
    // TODO: add implementation
        
    // skip to next item
    lpbHeaderData += sizeof(REISERFS_ITEM_HEAD);
}

There are four types of items in leaf nodes:

Directory items: for directory file names and keys
Stat items: for file information
Direct items: content of small files that fits in a block
Indirect items: array of block numbers for the content of larger files

They will be described in more detail below. The type of each item can be determined by looking at the ih_key.k_type value. Note that, as of my knowledge, this is the only part in ReiserFS where V2 type keys can come up. Use the following code excerpt to distinguish between v1 and v2 keys:

// assume: iH is the pointer to the item header

#define ITEM_VERSION_1 0
#define ITEM_VERSION_2 1

REISERFS_KEY* key = &(iH->ih_key);
REISERFS_CPU_KEY cpukey;
cpukey.k_dir_id = key->k_dir_id;
cpukey.k_objectid = key->k_objectid;
if( iH->ih_version == ITEM_VERSION_1 )
{
    cpukey.k_type = iH->ih_key.u.k_offset_v1.k_uniqueness;
    cpukey.k_offset = iH->ih_key.u.k_offset_v1.k_offset;
}
else if ( iH->ih_version == ITEM_VERSION_2 )
{
    cpukey.k_type = (int) iH->ih_key.u.k_offset_v2.k_type;
    cpukey.k_offset = iH->ih_key.u.k_offset_v2.k_offset;
}
else assert(false);

This code generates a proper cpu-key from the on-disk structure.

Directory items

A directory is an item in a leaf node, where k_type is 500. It represents entries in a directory, so to speak, a directory listing. (But it doesn't include file stats, you must read them separatly). Its data looks like this:

Dir Entry 1

Dir Entry 2

...

Dir Entry N

Filename N

...

Filename 2

Filename 1

Each directory entry uses the following structure:

struct reiserfs_de_head
{
  __u32 deh_offset;     /* third component of the directory entry key */
  __u32 deh_dir_id;     /* objectid of the parent directory of the object, that is referenced by directory entry */
  __u32 deh_objectid;   /* objectid of the object, that is referenced by directory entry */
  __u16 deh_location;   /* offset of name in the whole item */
  __u16 deh_state;      /* whether 1) entry contains stat data (for future), and 2) whether
                           entry is hidden (unlinked) */
} __attribute__ ((__packed__));

The filenames are not zero-terminated, you must calculate the filename size yourself. The first filename begins at the offset deh_location of the first reiserfs_de_head, and ends at the end of the item data. The second filename begins at the offset deh_location of its de_head, and ends at the deh_location of the first de_head, and so on. To find out how many reiserfs_de_head entries there are you have to enumerate all headers, until the start of the header is equal to the last known deh_location.

If that sounds confusing at first, here is a sample code that takes a directory item on the disk and dumps its filenames:

// given: 
// SIZE_OF_BLOCK size of the item on disk.
// DATA_OF_BLOCK data of the directory item on disk. You should allocate one 
// byte more and make sure the buffer is zero-terminated for the code below to work.

int dh_offset = 0; // offset in the file
int dh_strpos = SIZE_OF_BLOCK; // size 

while( dh_offset < dh_strpos )
{
    // get a pointer to the current item
    REISERFS_DIRECTORY_HEAD* pDH = (REISERFS_DIRECTORY_HEAD*) (DATA_OF_BLOCK+dh_offset);

    // the filename starts at the deh_location and is zeroterminated automatically
    printf("name='%s'\n",bMemory+pDH->deh_location);
    
    // make the next string zero-terminated, too.
    (bMemory+pDH->deh_location)[0] = 0;
    
    // this is the max sized, used for the loop ending criteria
    dh_strpos = pDH->deh_location;
    
    // increase array offset.
    dh_offset += sizeof(REISERFS_DIRECTORY_HEAD);
}

Here is an annotated hexdump of such a directory block

0000000: 01000000 | 05000000 | efa40100 | 66000400  ............f... <-- direntry #1
0000010: 02000000 | 04000000 | 05000000 | 64000400  ............d... <-- direntry #2
0000020: 805e2900 | efa40100 | f2a40100 | 61000400  .^).........a... <-- direntry #3
0000030: 8032e44f | efa40100 | f0a40100 | 58000400  .2.O........X... <-- direntry #4
0000040: 80eeb365 | efa40100 | f1a40100 | 50000400  ...e........P... <-- direntry #5
0000050: 70726f66 | 696c6573 | 64656663 | 6f6e6669  profilesdefconfi 
0000060: 67746d70 | 2e2e2e                          gtmp...

You can see that the filenames are, indeed, not zero-terminated. When properly analyzed, this structure reads:

deh_offset	deh_dir_id	deh_objectid	deh_location	deh_state	filename
1	5	107759	102	4	'.'
2	4	5	100	4	'..'
2711168	107759	107762	97	4	'tmp'
1340355200	107759	107760	88	4	'defconfig'
1706290816	107759	107761	80	4	'profiles'

You can see that the current directory (".") has an object id 107759, which is being used as the directory id for the dependant files.

Large Directories

For large directories, the right neighbor of the leaf node continues the directory description with a leaf node, that only varies in k_offset.

For example: Say, Block 12041 is an internal node with the following pointers:

[ 0 ] --> 18291
[ 1 ] --> 19011
[ 2 ] --> 20010

Furthermore, lets imagine that Block 18291 contains the first part of the directory "/usr/include", specified by the key

( 131, 137, 1L, 500 )

Then, the block 19011 will contain a directory description with the key

( 131, 137, < some large number here >, 500 )

that is the continuation of the directory listing.

Stat items

A stat item is an item in a leaf node, where k_type is 0. It represents the stat details for both files and directories. Each stat item corresponds to exactly one file or one directory. There are two types of stat items, mainly distinguished by their size in memory.

Stat Version 1

This seems to be the old stat data, which is 32 bytes long. It is declared as the following structure:

struct stat_data_v1
{
    __u16 sd_mode;	/* file type, permissions */
    __u16 sd_nlink;	/* number of hard links */
    __u16 sd_uid;	/* owner */
    __u16 sd_gid;	/* group */
    __u32 sd_size;	/* file size */
    __u32 sd_atime;	/* time of last access */
    __u32 sd_mtime;	/* time file was last modified  */
    __u32 sd_ctime;	/* time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) */
    union {
	__u32 sd_rdev;
	__u32 sd_blocks;/* number of blocks file uses */
    } u;
    __u32 sd_first_direct_byte; /* first byte of file which is stored
				   in a direct item: except that if it
				   equals 1 it is a symlink and if it
				   equals ~(__u32)0 there is no
				   direct item.  The existence of this
				   field really grates on me. Let's
				   replace it with a macro based on
				   sd_size and our tail suppression
				   policy.  Someday.  -Hans */
};

Stat Version 2

This is the one used by newer versions of the ReiserFS filesystem, the most important difference being large file support (filesize is now 64 bit). The size is 44 bytes.

struct stat_data {
    __u16 sd_mode;	/* file type, permissions */
    __u16 sd_reserved;
    __u32 sd_nlink;	/* number of hard links */
    __u64 sd_size;	/* file size */
    __u32 sd_uid;		/* owner */
    __u32 sd_gid;		/* group */
    __u32 sd_atime;	/* time of last access */
    __u32 sd_mtime;	/* time file was last modified  */
    __u32 sd_ctime;	/* time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) */
    __u32 sd_blocks;
    union {
	__u32 sd_rdev;
	__u32 sd_generation;
      //__u32 sd_first_direct_byte; 
      /* first byte of file which is stored in a
				       direct item: except that if it equals 1
				       it is a symlink and if it equals
				       ~(__u32)0 there is no direct item.  The
				       existence of this field really grates
				       on me. Let's replace it with a macro
				       based on sd_size and our tail
				       suppression policy? */
  } u;
} ;

Two comments on structure members:

The sd_atime/sd_mtime/sd_ctime fields can be analyzed using the standard c-library localtime() function
The sd_mode permissions (Unix terminology) field contains the file attributes (Windows terminology).

Direct Items

A direct item is an item in a leaf node, where k_type is -1. For small files, it contains the whole binary content of the file. For large files, it can exist (but doesn't need to), and if it does, it contains the tail of the file.

Indirect Items

An indirect item is an item in a leaf node, where k_type is -2. For large files, it contains an array of ULONG values with block numbers that contain parts of the file. You can think of them as indices for file data. For example, an indirect item might look like this:

16482 16483 16484

This means that the unformatted blocks 16482,16483,16484, in this order, contain data for the file.

Note: The last block might not be fully used. You must first determine the file size to be sure you read only as much data as you need. For example, if the block size is 4096, and a file is 12232 (=4096+4096+4040) bytes long, you must only read 4040 bytes from the last of the three indices.

Note: It is possible that there exists an additional direct item, which has to be appended to the file data retrieved from indirect blocks.