Chapter 12. The Virtual Filesystem
five standard Unix file:
1:regular files,2.directories,3.symbolic links,4.Device files,5. pipes
12.1. The Role of the Virtual Filesystem (VFS)
Filesystems supported by the VFS may be grouped into three main classes:
1:Disk-based filesystems
2:Network filesystems
3:Special filesystems
12.1.1. The Common File Model
Figure 12-2. Interaction between processes and VFS objects
The superblock object
The inode object
The file object
12.1.2. System Calls Handled by the VFS
Table 12-1. Some system calls handled by the VFS
System call name Description
mount( ) umount( ) umount2( ) Mount/unmount filesystems
sysfs( ) Get filesystem information
statfs( ) fstatfs( ) statfs64( ) fstatfs64( )ustat( ) Get filesystem statistics
chroot( ) pivot_root( ) Change root directory
chdir( ) fchdir( ) getcwd( ) Manipulate current directory
mkdir( ) rmdir( ) Create and destroy directories
getdents( ) getdents64( ) readdir( ) link( ) Manipulate directory entries
unlink( ) rename( ) lookup_dcookie( )
readlink( ) symlink( ) Manipulate soft links
chown( ) fchown( ) lchown( ) chown16( )
fchown16( ) lchown16( ) Modify file owner
chmod( ) fchmod( ) utime( ) Modify file attributes
stat( ) fstat( ) lstat( ) access( ) oldstat( ) oldfstat()
oldlstat( ) stat64( ) lstat64( ) fstat64( ) Read file status
open( ) close( ) creat( ) umask( ) Open, close, and create files
dup( ) dup2( ) fcntl( ) fcntl64( ) Manipulate file descriptors
select( ) poll( ) Wait for events on a set of file descriptors
truncate( ) ftruncate( ) truncate64( ) ftruncate64( ) Change file size
lseek( ) _llseek( ) Change file pointer
read( ) write( ) readv( ) writev( ) sendfile( ) Carry out file I/O operations
sendfile64( ) readahead( )
io_setup( ) io_submit( ) io_getevents( ) Asynchronous I/O (allows multiple outstanding read and write requests)
io_cancel( ) io_destroy( )
pread64( ) pwrite64( ) Seek file and access it
mmap( ) mmap2( ) munmap( ) madvise( ) mincore( ) Handle file memory mapping
remap_file_pages( )
fdatasync( ) fsync( ) sync( ) msync( ) Synchronize file data
flock( ) Manipulate file lock
setxattr( ) lsetxattr( ) fsetxattr( ) getxattr( ) lgetxattr( )
fgetxattr( ) listxattr( ) llistxattr( ) flistxattr( ) removexattr( ) Manipulate file extended attributes
lremovexattr( ) fremovexattr( )
12.2. VFS Data Structures
12.2.1. Superblock Objects
Table 12-2. The fields of the superblock object
Type Field Description
struct list_head s_list Pointers for superblock list
dev_t s_dev Device identifier
unsigned long s_blocksize Block size in bytes
unsigned long s_old_blocksize Block size in bytes as reported by the underlying block device driver
unsigned char s_blocksize_bits Block size in number of bits
unsigned char s_dirt Modified (dirty) flag
unsigned long long s_maxbytes Maximum size of the files
struct file_system_type * s_type Filesystem type
struct super_operations * s_op Superblock methods
struct dquot_operations * dq_op Disk quota handling methods
struct quotactl_ops * s_qcop Disk quota administration methods
struct export_operations * s_export_op Export operations used by network filesystems
unsigned long s_flags Mount flags
unsigned long s_magic Filesystem magic number
struct dentry * s_root Dentry object of the filesystem's root directory
struct rw_semaphore s_umount Semaphore used for unmounting
struct semaphore s_lock Superblock semaphore
int s_count Reference counter
int s_syncing Flag indicating that inodes of the superblock are being synchronized
int s_need_sync_fs Flag used when synchronizing the superblock's mounted filesystem
atomic_t s_active Secondary reference coun
void * s_security Pointer to superblock security structure
struct xattr_handler ** s_xattr Pointer to superblock extended attribute structure
struct list_head s_inodes List of all inodes
struct list_head s_dirty List of modified inodes
struct list_head s_io List of inodes waiting to be written to disk
struct hlist_head s_anon List of anonymous dentries for handling remote network filesystems
struct list_head s_files List of file objects
struct block_device* s_bdev Pointer to the block device driver descriptor
struct list_head s_instances Pointers for a list of superblock objects of a given filesystem type
(see the later section "Filesystem Type Registration")
struct quota_info s_dquot Descriptor for disk quota
int s_frozen Flag used when freezing the filesystem (forcing it to aconsistent state)
wait_queue_head_t s_wait_unfrozen Wait queue where processes sleep until the filesystem is unfrozen
char[] s_id Name of the block device containing the superblock
void * s_fs_info Pointer to superblock information of a specific filesystem
struct semaphore s_vfs_rename_sem Semaphore used by VFS when renaming files across directories
u32 s_time_gran Timestamp's granularity (in nanoseconds
super_operations ->
alloc_inode(sb)
Allocates space for an inode object, including the space required for filesystem-specific data.
destroy_inode(inode)
Destroys an inode object, including the filesystem-specific data
read_inode(inode)
Fills the fields of the inode object passed as the parameter with the data on disk; the i_ino
field of the inode object identifies the specific filesystem inode on the disk to be read.
dirty_inode(inode)
Invoked when the inode is marked as modified (dirty). Used by filesystems such as ReiserFS
and Ext3 to update the filesystem journal on disk.
write_inode(inode, flag)
Updates a filesystem inode with the contents of the inode object passed as the parameter; the
i_ino field of the inode object identifies the filesystem inode on disk that is concerned. The
flag parameter indicates whether the I/O operation should be synchronous.
put_inode(inode)
Invoked when the inode is released its reference counter is decreased to perform filesystemspecific operations.
drop_inode(inode)
Invoked when the inode is about to be destroyed that is, when the last user releases the inode;
filesystems that implement this method usually make use of generic_drop_inode( ). This
function removes every reference to the inode from the VFS data structures and, if the inode
no longer appears in any directory, invokes the delete_inode superblock method to delete the
inode from the filesystem.
delete_inode(inode)
Invoked when the inode must be destroyed. Deletes the VFS inode in memory and the file data
and metadata on disk.
put_super(super)
Releases the superblock object passed as the parameter (because the corresponding
filesystem is unmounted).
write_super(super)
Updates a filesystem superblock with the contents of the object indicated.
sync_fs(sb, wait)
Invoked when flushing the filesystem to update filesystem-specific data structures on disk
(used by journaling filesystems ).
write_super_lockfs(super)
Blocks changes to the filesystem and updates the superblock with the contents of the object
indicated. This method is invoked when the filesystem is frozen, for instance by the Logical
Volume Manager (LVM) driver.
unlockfs(super)
Undoes the block of filesystem updates achieved by the write_super_lockfs superblock
method.
statfs(super, buf)
Returns statistics on a filesystem by filling the buf buffer.
remount_fs(super, flags, data)
Remounts the filesystem with new options (invoked when a mount option must be changed).
clear_inode(inode)
Invoked when a disk inode is being destroyed to perform filesystem-specific operations.
umount_begin(super)
Aborts a mount operation because the corresponding unmount operation has been started
(used only by network filesystems ).
show_options(seq_file, vfsmount)
Used to display the filesystem-specific options
quota_read(super, type, data, size, offset)
Used by the quota system to read data from the file that specifies the limits for this filesystem.[*]
quota_write(super, type, data, size, offset)
Used by the quota system to write data into the file that specifies the limits for this filesystem.
12.2.2. Inode Objects
Table 12-3. The fields of the inode object
Type Field Description
struct hlist_node i_hash Pointers for the hash list
struct list_head i_list Pointers for the list that describes the inode's current state
struct list_head i_sb_list Pointers for the list of inodes of the superblock
struct list_head i_dentry The head of the list of dentry objects referencing this inode
unsigned long i_ino inode number
atomic_t i_count Usage counter
umode_t i_mode File type and access rights
unsigned int i_nlink Number of hard links
uid_t i_uid Owner identifier
gid_t i_gid Group identifier
dev_t i_rdev Real device identifier
loff_t i_size File length in bytes
struct timespec i_atime Time of last file access
struct timespec i_mtime Time of last file write
struct timespec i_ctime Time of last inode change
unsigned int i_blkbits Block size in number of bits
unsigned long i_blksize Block size in bytes
unsigned long i_version Version number, automatically increased after each use
unsigned long i_blocks Number of blocks of the file
unsigned short i_bytes Number of bytes in the last block of the file
unsigned char i_sock Nonzero if file is a socket
spinlock_t i_lock Spin lock protecting some fields of the inode
struct semaphore i_sem inode semaphore
struct rw_semaphore i_alloc_sem Read/write semaphore protecting against race conditions in direct I/O file operations
struct inode_operations * i_op inode operations
struct file_operations * i_fop Default file operations
struct super_block * i_sb Pointer to superblock object
struct file_lock * i_flock Pointer to file lock list
struct address_space* i_mapping Pointer to an address_space object (see Chapter 15)
struct address_space i_data address_space object of the file
struct dquot * [] i_dquot inode disk quotas
struct list_head i_devices Pointers for a list of inodes relative to a specific character or block device (see Chapter 13)
struct pipe_inode_info * i_pipe Used if the file is a pipe (see Chapter 19)
struct block_device * i_bdev Pointer to the block device driver
struct cdev * i_cdev Pointer to the character device driver int i_cindex Index of the device
file within a group of minor numbers
_ _u32 i_generation inode version number (used by some filesystems)
unsigned long i_dnotify_mask Bit mask of directory notify events
struct dnotify_struct * i_dnotify Used for directory notifications
unsigned long i_state inode state flags
unsigned long dirtied_when Dirtying time (in ticks) of the inode
unsigned int i_flags Filesystem mount flags
atomic_t i_writecount Usage counter for writing processes
void * i_security Pointer to inode's security structure
void * u.generic_ip Pointer to private data
seqcount_t i_size_seqcount Sequence counter used in SMP systems to get consistent values for i_size
2:The methods associated with an inode object are also called inode operations
12.2.3. File Objects
A file object describes how a process interacts with a file it has opened,
The object is created when the file is opened and consists of a file structure
Table 12-4. The fields of the file object
Type Field Description
struct list_head f_list Pointers for generic file object list
struct dentry * f_dentry dentry object associated with the file
struct vfsmount * f_vfsmnt Mounted filesystem containing the file
file_operations * f_op Pointer to file operation table
atomic_t f_count File object's reference counter
unsigned int f_flags Flags specified when opening the file
mode_t f_mode Process access mode
int f_error Error code for network write operation
loff_t f_pos Current file offset (file pointer)
struct fown_struct f_owner Data for I/O event notification via signals
unsigned int f_uid User's UID
unsigned int f_gid User group ID
struct file_ra_state f_ra File read-ahead state (see Chapter 16)
size_t f_maxcount Maximum number of bytes that can be read or written with a single operation (currently set to 231-1)
unsigned long f_version Version number, automatically increased after each use
void * f_security Pointer to file object's security structure
void * private_data Pointer to data specific for a filesystem or a device driver
struct list_head f_ep_links Head of the list of event poll waiters for this file
spinlock_t f_ep_lock Spin lock protecting the f_ep_links list
struct address_space* f_mapping Pointer to file's address space object (see Chapter 15)
file operations:
llseek(file, offset, origin)
Updates the file pointer.
read(file, buf, count, offset)
Reads count bytes from a file starting at position *offset; the value *offset (which usually
corresponds to the file pointer) is then increased.
aio_read(req, buf, len, pos)
Starts an asynchronous I/O operation to read len bytes into buf from file position pos
(introduced to support the io_submit( ) system call).
write(file, buf, count, offset)
Writes count bytes into a file starting at position *offset; the value *offset (which usually
corresponds to the file pointer) is then increased.
aio_write(req, buf, len, pos)
Starts an asynchronous I/O operation to write len bytes from buf to file position pos.
readdir(dir, dirent, filldir)
Returns the next directory entry of a directory in dirent; the filldir parameter contains the
address of an auxiliary function that extracts the fields in a directory entry.
poll(file, poll_table)
Checks whether there is activity on a file and goes to sleep until something happens on it.
ioctl(inode, file, cmd, arg)
Sends a command to an underlying hardware device. This method applies only to device files.
unlocked_ioctl(file, cmd, arg)
Similar to the ioctl method, but it does not take the big kernel lock (see the section "The Big
Kernel Lock" in Chapter 5). It is expected that all device drivers and all filesystems will
implement this new method instead of the ioctl method.
compat_ioctl(file, cmd, arg)
Method used to implement the ioctl() 32-bit system call by 64-bit kernels.
mmap(file, vma)
Performs a memory mapping of the file into a process address space (see the section "Memory
Mapping" in Chapter 16).
open(inode, file)
Opens a file by creating a new file object and linking it to the corresponding inode object (see
the section "The open( ) System Call" later in this chapter).
flush(file)
Called when a reference to an open file is closed. The actual purpose of this method is
filesystem-dependent.
release(inode, file)
Releases the file object. Called when the last reference to an open file is closedthat is, when
the f_count field of the file object becomes 0.
fsync(file, dentry, flag)
Flushes the file by writing all cached data to disk.
aio_fsync(req, flag)
Starts an asynchronous I/O flush operation.
fasync(fd, file, on)
Enables or disables I/O event notification by means of signals.
lock(file, cmd, file_lock)
Applies a lock to the file (see the section "File Locking" later in this chapter).
readv(file, vector, count, offset)
Reads bytes from a file and puts the results in the buffers described by vector; the number of
buffers is specified by count.
writev(file, vector, count, offset)
Writes bytes into a file from the buffers described by vector; the number of buffers is specified by count.
sendfile(in_file, offset, count, file_send_actor, out_file)
Transfers data from in_file to out_file (introduced to support the sendfile( ) system call).
sendpage(file, page, offset, size, pointer, fill)
Transfers data from file to the page cache's page; this is a low-level method used by
sendfile( ) and by the networking code for sockets.
get_unmapped_area(file, addr, len, offset, flags)
Gets an unused address range to map the file.
check_flags(flags)
Method invoked by the service routine of the fcntl( ) system call to perform additional checks
when setting the status flags of a file (F_SETFL command). Currently used only by the NFS
network filesystem.
dir_notify(file, arg)
Method invoked by the service routine of the fcntl( ) system call when establishing a
directory change notification (F_NOTIFY command). Currently used only by the Common
Internet File System (CIFS ) network filesystem.
flock(file, flag, lock)
Used to customize the behavior of the flock() system call. No official Linux filesystem makes
use of this method
12.2.4. dentry Objects(directory entry object)
Table 12-5. The fields of the dentry object
Type Field Description
atomic_t d_count Dentry object usage counter
unsigned int d_flags Dentry cache flags
spinlock_t d_lock Spin lock protecting the dentry object
struct inode * d_inode Inode associated with filename
struct dentry * d_parent Dentry object of parent directory
struct qstr d_name Filename
struct list_head d_lru Pointers for the list of unused dentries
struct list_head d_child For directories, pointers for the list of directory dentries in the same parent directory
struct list_head d_subdirs For directories, head of the list of subdirectory dentries
struct list_head d_alias Pointers for the list of dentries associated with the same inode (alias)
unsigned long d_time Used by d_revalidate method
struct dentry_operations* d_op Dentry methods
struct super_block * d_sb Superblock object of the file
void * d_fsdata Filesystem-dependent data
struct rcu_head d_rcu The RCU descriptor used when reclaiming the dentry object
(see the section "Read-Copy Update (RCU)" in Chapter 5)
struct dcookie_struct * d_cookie Pointer to structure used by kernel profilers
struct hlist_node d_hash Pointer for list in hash table entry
int d_mounted For directories, counter for the number of filesystems mounted on this dentry
unsigned char[] d_iname Space for short filename
the dentry_operations structure, whose address is stored in the d_op field.
d_revalidate(dentry, nameidata)
Determines whether the dentry object is still valid before using it for translating a file
pathname. The default VFS function does nothing, although network filesystems may specify
their own functions.
d_hash(dentry, name)
Creates a hash value; this function is a filesystem-specific hash function for the dentry hash
table. The dentry parameter identifies the directory containing the component. The name
parameter points to a structure containing both the pathname component to be looked up and
the value produced by the hash function.
d_compare(dir, name1, name2)
Compares two filenames ; name1 should belong to the directory referenced by dir. The default
VFS function is a normal string match. However, each filesystem can implement this method in
its own way. For instance, MS-DOS does not distinguish capital from lowercase letters.
d_delete(dentry)
Called when the last reference to a dentry object is deleted (d_count becomes 0). The default
VFS function does nothing.
d_release(dentry)
Called when a dentry object is going to be freed (released to the slab allocator). The default
VFS function does nothing.
d_iput(dentry, ino)
Called when a dentry object becomes "negative"that is, it loses its inode. The default VFS
function invokes iput( ) to release the inode object.
12.2.5. The dentry Cache
1:The addresses of the first and last elements of the LRU list are stored in the next and
prev fields of the dentry_unused variable of type list_head. The d_lru field of the dentry object
contains pointers to the adjacent dentries in the list.
2:Each "in use" dentry object is inserted into a doubly linked list specified by the i_dentry field of the
corresponding inode object (because each inode could be associated with several hard links, a list is
required). The d_alias field of the dentry object stores the addresses of the adjacent elements in the list.
3:The hash table is implemented by means of a dentry_hashtable array
12.2.6. Files Associated with a Process
fs_struct
Table 12-6. The fields of the fs_struct structure
Type Field Description
atomic_t count Number of processes sharing this table
rwlock_t lock Read/write spin lock for the table fields
int umask Bit mask used when opening the file to set the file permissions
struct dentry * root Dentry of the root directory
struct dentry* pwd Dentry of the current working directory
struct dentry* altroot Dentry of the emulated root directory (always NULL for the 80 x 86 architecture)
struct vfsmount * rootmnt Mounted filesystem object of the root directory
struct vfsmount * pwdmnt Mounted filesystem object of the current working directory
struct vfsmount * altrootmnt Mounted filesystem object of the emulated root directory (always NULL for the 80 x 86 architecture)
files_struct
Table 12-7. The fields of the files_struct structure
Type Field Description
atomic_t count Number of processes sharing this table
rwlock_t file_lock Read/write spin lock for the table fields
int max_fds Current maximum number of file objects
int max_fdset Current maximum number of file descriptors
int next_fd Maximum file descriptors ever allocated plus 1
struct file ** fd Pointer to array of file object pointers
fd_set * close_on_exec Pointer to file descriptors to be closed on exec( )
fd_set * open_fds Pointer to open file descriptors
fd_set close_on_exec_init Initial set of file descriptors to be closed on exec( )
fd_set open_fds_init Initial set of file descriptors
struct file *[] fd_array Initial array of file object pointers
Figure 12-3. The fd array
fget( )/fget_light( )
fput( )/fput_light( )
12.3. Filesystem Types
12.3.1. Special Filesystems
Table 12-8. Most common special filesystems
Name Mount point Description
bdev none Block devices (see Chapter 13)
binfmt_misc any Miscellaneous executable formats (see Chapter 20)
devpts /dev/pts Pseudoterminal support (Open Group's Unix98 standard)
eventpollfs none Used by the efficient event polling mechanism
futexfs none Used by the futex (Fast Userspace Locking) mechanism
pipefs none Pipes (see Chapter 19)
proc /proc General access point to kernel data structures
rootfs none Provides an empty root directory for the bootstrap phase
shm none IPC-shared memory regions (see Chapter 19)
mqueue any Used to implement POSIX message queues (see Chapter 19)
sockfs none Sockets
sysfs /sys General access point to system data (see Chapter 13)
tmpfs any Temporary files (kept in RAM unless swapped)
usbfs /proc/bus/usb USB devices
12.3.2. Filesystem Type Registration
Each registered filesystem is represented as a file_system_type object whose fields are illustrated in Table 12-9.
Table 12-9. The fields of the file_system_type object
Type Field Description
const char * name Filesystem name
int fs_flags Filesystem type flags
struct super_block * (*)() get_sb Method for reading a superblock void (*)( ) kill_sb Method for removing a superblock
struct module * owner Pointer to the module implementing the filesystem (see Appendix B)
struct file_system_type * next Pointer to the next element in the list of filesystem types
struct list_head fs_supers Head of a list of superblock objects having the same filesystem type
All filesystem-type objects are inserted into a singly linked list. The file_systems variable points to the first item
12.4. Filesystem Handling
12.4.1. Namespaces
The namespace of a process is represented by a namespace structure pointed to by the namespace
field of the process descriptor
Table 12-11. The fields of the namespace structure
Type Field Description
atomic_t count Usage counter (how many processes share the namespace)
struct vfsmount * root Mounted filesystem descriptor for the root directory of the namespace
struct list_head list Head of list of all mounted filesystem descriptors
struct rw_semaphore sem Read/write semaphore protecting this structure
12.4.2. Filesystem Mounting
vfsmount
12.4.3. Mounting a Generic Filesystem
12.4.3.1. The do_kern_mount( ) function
12.4.3.2. Allocating a superblock object
12.4.4. Mounting the Root Filesystem
Why does the kernel bother to mount the rootfs filesystem before the real one? Well, the rootfs
filesystem allows the kernel to easily change the real root filesystem.
12.4.4.1. Phase 1: Mounting the rootfs filesystem
init_rootfs( )/
init_mount_tree( )
12.4.4.2. Phase 2: Mounting the real root filesystem
prepare_namespace( )
12.4.5. Unmounting a Filesystem
do_umount( )
12.5. Pathname Lookup
path_lookup(const char *name unsigned int flags,struct nameidata *nd )/path_lookupat( )
Table 12-15. The fields of the nameidata data structure
Type Field Description
struct dentry * dentry Address of the dentry object
struct vfs_mount * mnt Address of the mounted filesystem object
struct qstr last Last component of the pathname (used when the LOOKUP_PARENT flag is set)
unsigned int flags Lookup flags
int last_type Type of last component of the pathname (used when the LOOKUP_PARENT flag is set)
unsigned int depth Current level of symbolic link nesting (see below); it must be smaller than 6
char[ ] * saved_names Array of pathnames associated with nested symbolic links
union intent One-member union specifying how the file will be accessed
Table 12-16. The flags of the lookup operation
Macro Description
LOOKUP_FOLLOW If the last component is a symbolic link, interpret (follow) it
LOOKUP_DIRECTORY The last component must be a directory
LOOKUP_CONTINUE There are still filenames to be examined in the pathname
LOOKUP_PARENT Look up the directory that includes the last component of the pathname
LOOKUP_NOALT Do not consider the emulated root directory (useless in the 80x86 architecture
LOOKUP_OPEN Intent is to open a file
LOOKUP_CREATE Intent is to create a file (if it doesn't exist)
LOOKUP_ACCESS Intent is to check user's permission for a file
12.5.1. Standard Pathname Lookup
link_path_walk( )
12.5.2. Parent Pathname Lookup
12.5.3. Lookup of Symbolic Links
12.6. Implementations of VFS System Calls
12.6.1. The open( ) System Call
Table 12-18. The flags of the open( ) system call
Flag name Description
O_RDONLY Open for reading
O_WRONLY Open for writing
O_RDWR Open for both reading and writing
O_CREAT Create the file if it does not exist
O_EXCL With O_CREAT, fail if the file already exists
O_NOCTTY Never consider the file as a controlling terminal
O_TRUNC Truncate the file (remove all existing contents)
O_APPEND Always write at end of the file
O_NONBLOCK No system calls will block on the file
O_NDELAY Same as O_NONBLOCK
O_SYNC Synchronous write (block until physical write terminates)
FASYNC I/O event notification via signals
O_DIRECT Direct I/O transfer (no kernel buffering)
O_LARGEFILE Large file (size greater than 2 GB)
O_DIRECTORY Fail if file is not a directory
O_NOFOLLOW Do not follow a trailing symbolic link in pathname
O_NOATIME Do not update the inode's last access time
12.6.2. The read( ) and write( ) System Calls
12.6.3. The close( ) System Call
12.7. File Locking
The POSIX standard requires a file-locking mechanism based on the fcntl( ) system call
12.7.1. Linux File Locking
By issuing the flock( ) system call. The two parameters of the system call are the fd file
descriptor, and a command to specify the lock operation. The lock applies to the whole file
By using the fcntl( ) system call. The three parameters of the system call are the fd file
descriptor, a command to specify the lock operation, and a pointer to a flock structure (see
Table 12-20). A couple of fields in this structure allow the process to specify the portion of the
file to be locked. Processes can thus hold several locks on different portions of the same file.
12.7.2. File-Locking Data Structures
Table 12-19. The fields of the file_lock data structure
Type Field Description
struct file_lock * fl_next Next element in list of locks associated with the inode
struct list_head fl_link Pointers for active or blocked list
struct list_head fl_block Pointers for the lock's waiters list
struct files_struct * fl_owner Owner's files_struct
unsigned int fl_pid PID of the process owner
wait_queue_head_t fl_wait Wait queue of blocked processes
struct file * fl_file Pointer to file object
unsigned char fl_flags Lock flags
unsigned char fl_type Lock type
loff_t fl_start Starting offset of locked region
loff_t fl_end Ending offset of locked region
struct fasync_struct * fl_fasync Used for lease break notifications
unsigned long fl_break_time Remaining time before end of lease
struct file_lock_operations * fl_ops Pointer to file lock operations
struct lock_manager_operations* fl_mops Pointer to lock manager operations
union fl_u Filesystem-specific information
12.7.3. FL_FLOCK Locks
flock_lock_file_wait( )
12.7.4. FL_POSIX Locks
Table 12-20. The fields of the flock data structure
Type Field Description
short l_type F_RDLOCK (requests a shared lock), F_WRLOCK (requests an exclusive lock), F_UNLOCK (releases the lock)
short l_whence SEEK_SET (from beginning of file), SEEK_CURRENT (from current file pointer), SEEK_END (from end of file)
off_t l_start Initial offset of the locked region relative to the value of l_whence
off_t l_len Length of locked region (0 means that the region includes all potential writes past the current end of the file)
pid_t l_pid PID of the owner
F_GETLK
Determines whether the lock described by the flock structure conflicts with some FL_POSIX
lock already obtained by another process. In this case, the flock structure is overwritten with
the information about the existing lock.
F_SETLK
Sets the lock described by the flock structure. If the lock cannot be acquired, the system call
returns an error code.
F_SETLKW
Sets the lock described by the flock structure. If the lock cannot be acquired, the system call
blocks; that is, the calling process is put to sleep until the lock is available.
Chapter 13. I/O Architecture and Device Drivers
13.1. I/O Architecture
Figure 13-1. PC's I/O architecture
13.1.1. I/O Ports
13.1.1.1. Accessing I/O ports
inb( ), inw( ), inl( )
inb( ), inw( ), inl( )
outb( ), outw( ), outl( )
outb_p( ), outw_p( ), outl_p( )
insb( ), insw( ), insl( )
outsb( ), outsw( ), outsl( )
13.1.2. I/O Interfaces
13.1.2.1. Custom I/O interfaces
13.1.2.2. General-purpose I/O interfaces
13.1.3. Device Controllers
13.2. The Device Driver Model
13.2.1. The sysfs Filesystem
Relationships between components of the device driver models are expressed in the sysfs filesystem
as symbolic links between directories and files. For example, the /sys/block/sda/device file can be a
symbolic link to a subdirectory nested in /sys/devices/pci0000:00 representing the SCSI controller
connected to the PCI bus. Moreover, the /sys/block/sda/device/block file is a symbolic link to
/sys/block/sda, stating that this PCI device is the controller of the SCSI disk
13.2.2. Kobjects
each kobject corresponds to a directory in that filesystem
13.2.2.1. Kobjects, ksets, and subsystems
Table 13-2. The fields of the kobject data structure
Type Field Description
char * k_name Pointer to a string holding the name of the container
char [] name String holding the name of the container, if it fits in 20 bytes
struct k_ref kref The reference counter for the container
struct list_head entry Pointers for the list in which the kobject is inserted
struct kobject * parent Pointer to the parent kobject, if any
struct kset * kset Pointer to the containing kset
struct kobj_type * ktype Pointer to the kobject type descriptor
struct dentry * dentry Pointer to the dentry of the sysfs file associated with the kobject
The kobj_type data structure includes three fields:
a release method
a sysfs_ops pointer to a table of sysfs operations
and a list of default attributes for the sysfs filesystem
Table 13-3. The fields of the kset data structure
Type Field Description
struct subsystem * subsys Pointer to the subsystem descriptor
struct kobj_type * ktype Pointer to the kobject type descriptor of the kset
struct list_head list Head of the list of kobjects included in the kset
struct kobject kobj Embedded kobject (see text)
struct kset_hotplug_ops * hotplug_ops Pointer to a table of callback functions for kobject filtering and hot-plugging
Figure 13-3. An example of device driver model hierarchy
13.2.2.2. Registering kobjects, ksets, and subsystems
kset_register() and kset_unregister( ) functions
13.2.3. Components of the Device Driver Mod
13.2.3.1. Devices
device
Table 13-4. The fields of the device object
Type Field Description
struct list_head node Pointers for the list of sibling devices
struct list_head bus_list Pointers for the list of devices on the same bus type
struct list_head driver_list Pointers for the driver's list of devices
struct list_head children Head of the list of children devices
struct device * parent Pointer to the parent device
struct kobject kobj Embedded kobject
char [] bus_id Device position on the hosting bus
struct bus_type * bus Pointer to the hosting bus
struct device_driver * driver Pointer to the controlling device driver
void * driver_data Pointer to private data for the driver
void * platform_data Pointer to private data for legacy device drivers
struct dev_pm_info power Power management information
unsigned long detach_state Power state to be entered when unloading the device driver
unsigned long long * dma_mask Pointer to the DMA mask of the device(see the later section
"Direct Memory Access (DMA)")
unsigned long long coherent_dma_mask Mask for coherent DMA of the device
struct list_head dma_pools Head of a list of aggregate DMA buffers
struct dma_coherent_mem * dma_mem Pointer to a descriptor of the coherent DMA memory
used by the device (see the later section "Direct Memory Access (DMA)")
void (*)(struct device*) release Callback function for releasing the device descriptor
13.2.3.2. Drivers
device_driver
Table 13-5. The fields of the device_driver object
Type Field Description
char * name Name of the device driver
struct bus_type * bus Pointer to descriptor of the bus that hosts the supported devices
struct semaphore unload_sem Semaphore to forbid device driver unloading; it is
released when the reference counter reaches zero
struct kobject kobj Embedded kobject
struct list_head devices Head of the list including all devices supported by the driver
struct module * owner Identifies the module that implements the device
driver, if any (see Appendix B)
int (*)(struct device *) probe Method for probing a device
(checking that it can be handled by the device driver)
int (*)(struct device *) remove Method invoked on a device when it is removed
void (*)(struct device *) shutdown Method invoked on a device when it is powered off (shut down)
int (*)(struct device *,unsigned long, unsigned long) suspend Method invoked on a device when it is put in lowpower state
int (*)(struct device *,unsigned long) resume Method invoked on a device when it is put back in
the normal state (full power)
13.2.3.3. Buses
bus_type object
Table 13-6. The fields of the bus_type object
Type Field Description
char * name Name of the bus type
struct subsystem subsys Kobject subsystem associated with this bus type
struct kset drivers The set of kobjects of the drivers
struct kset devices The set of kobjects of the devices
struct bus_attribute * bus_attrs Pointer to the object including the bus attributes and
the methods for exporting them to the sysfs filesystem
struct device_attribute * dev_attrs Pointer to the object including the device attributes and the methods for exporting them to the sysfs filesystem
struct driver_attribute * drv_attrs Pointer to the object including the device driver
attributes and the methods for exporting them to the sysfs filesystem
int (*)(struct device *,struct device_driver *) match Method for checking whether a given driver supports a given device
int (*)(struct device *, char**, int, char *, int) hotplug Method invoked when a device is being registered
int (*)(struct device *,unsigned long) suspend Method for saving the hardware context state and
changing the power level of a device
int (*)(struct device *) resume Method for changing the power level and restoring
the hardware context of a device
13.2.3.4. Classes
The classes of the device driver model are essentially aimed to provide a standard method for
exporting to User Mode applications the interfaces of the logical devices . Each class_device
descriptor embeds a kobject having an attribute (special file) named dev. Such attribute stores the
major and minor numbers of the device file that is needed to access to the corresponding logical
device
13.3. Device Files
major number, identifies the device type
Traditionally, all device files that have the same major number and the same type share the same set of file
operations, because they are handled by the same device driver
minor number identifies a specific device among a group of devices that share the same major numbe
13.3.1. User Mode Handling of Device Files
MKDEV
13.3.1.1. Dynamic device number assignment
13.3.1.2. Dynamic device file creation
udev toolset can automatically
13.3.2. VFS Handling of Device Files
The inode object is initialized by reading the corresponding inode on disk through a suitable function
of the filesystem (usually ext2_read_inode( ) or ext3_read_inode( ); see Chapter 18). When this
function determines that the disk inode is relative to a device file, it invokes init_special_inode( ),
which initializes the i_rdev field of the inode object to the major and minor numbers of the device
file, and sets the i_fop field of the inode object to the address of either the def_blk_fops or the
def_chr_fops file operation table, according to the type of device file. The service routine of the
open( ) system call also invokes the dentry_open( ) function, which allocates a new file object and
sets its f_op field to the address stored in i_fopthat is, to the address of def_blk_fops or
def_chr_fops once again. Thanks to these two tables, every system call issued on a device file will
activate a device driver's function rather than a function of the underlying filesystem.
13.4. Device Drivers
13.4.1. Device Driver Registration
13.4.2. Device Driver Initialization
13.4.3. Monitoring I/O Operations
13.4.3.1. Polling mode
13.4.3.2. Interrupt mode
13.4.4. Accessing the I/O Shared Memory
ioremap( ) or ioremap_nocache()
13.4.5. Direct Memory Access (DMA)
13.4.5.1. Synchronous and asynchronous DMA
synchronous DMA the data transfers are triggered by processes
asynchronous DMA the data transfers are triggered by hardware devices
13.4.5.2. Helper functions for DMA transfers
13.4.5.3. Bus addresses
13.4.5.4. Cache coherency
Coherent DMA mapping
Streaming DMA mapping
13.4.5.5. Helper functions for coherent DMA mappings
dma_alloc_coherent( )/dma_free_coherent( )
13.4.5.6. Helper functions for streaming DMA mappings
dma_map_single( )/dma_unmap_single( )
13.4.6. Levels of Kernel Support
The Linux kernel does not fully support all possible existing I/O devices. Generally speaking, in fact,
there are three possible kinds of support for a hardware device:
No support at all
Minimal support
Extended support
The ioctl( ) system call was introduced to satisfy such needs :
let an application check whether the device is in a specific internal state
13.5. Character Device Drivers
cdev structure
Table 13-8. The fields of the cdev structure
Type Field Description
struct kobject kobj Embedded kobject
struct module * owner Pointer to the module implementing the driver, if any
struct file_operations * ops Pointer to the file operations table of the device driver
struct list_head list Head of the list of inodes relative to device files for this character device
dev_t dev Initial major and minor numbers assigned to the device driver
unsigned int count Size of the range of device numbers assigned to the device driver
Table 13-9. The fields of the probe object
Type Field Description
struct probe * next Next element in hash collision list
dev_t dev Initial device number (major and minor) of the interval
unsigned long range Size of the interval
struct module * owner Pointer to the module that implements the device driver, if any
struct kobject *(*)(dev_t, int *, void*) get Method for probing the owner of the interval
int (*)(dev_t, void*) lock Method for increasing the reference counter of the owner of the interval
void * data Private data for the owner of the interval
13.5.1. Assigning Device Numbers
char_device_struct structure
Table 13-10. The fields of the char_device_struct descriptor
Type Field Description
unsigned char_device_struct * next The pointer to next element in hash collision list
unsigned int major The major number of the interval
unsigned int baseminor The initial minor number of the interval
int minorct The interval size
const char * name The name of the device driver that handles the interval
struct file_operations * fops Not used
struct cdev * cdev Pointer to the character device driver descriptor
two ways:
(1) register_chrdev_region( ) and alloc_chrdev_region( ) functions+cdev_add(
(2) register_chrdev( )
13.5.1.1. The register_chrdev_region( ) and alloc_chrdev_region( ) functions
The _ _register_chrdev_region( ) function executes the following steps
13.5.1.2. The register_chrdev( ) function
(1):_ _register_chrdev_region( )
(2):
13.5.2. Accessing a Character Device Driver
chrdev_open( )
13.5.3. Buffering Strategies for Character Devices
This can be done by combining two different techniques:
1:Use of DMA to transfer blocks of data.
2:Use of a circular buffer of two or more elements, each element having the size of a block of
data.When an interrupt occurs signaling that a new block of data has been read,the interrupt
handler advances a pointer to the elements of the circular buffer so that further data will be
stored in an empty element.Conversely, whenever the driver succeeds in copying a block of
data into user address space, it releases an element of the circular buffer so that it is available
for saving new data from the hardware device.
Chapter 14. Block Device Drivers
14.1. Block Devices Handling
Figure 14-1. Kernel components affected by a block device operation
Figure 14-2. Typical layout of a page including disk data
14.1.1. Sectors
the sector is the basic unit of data transfer for the hardware devices
In Linux, the size of a sector is conventionally set to 512 bytes; sector_t
14.1.2. Blocks
the block is the basic unit of data transfer for the VFS
Each buffer has a "buffer head" descriptor of type buffer_head.
We will give a detailed explanation of all fields of the buffer head in Chapter 15
14.1.3. Segments
As we'll see, the generic block layer can merge different segments if the corresponding page frames
happen to be contiguous in RAM and the corresponding chunks of disk data are adjacent on disk.
The larger memory area resulting from this merge operation is called physical segment.
Yet another merge operation is allowed on architectures that handle the mapping between bus
addresses and physical addresses through a dedicated bus circuitry (the IO-MMU; see the section
"Direct Memory Access (DMA)" in Chapter 13). The memory area resulting from this kind of merge
operation is called hardware segment .
14.2. The Generic Block Layer
14.2.1. The Bio Structure
Table 14-1. The fields of the bio structure
Type Field Description
sector_t bi_sector First sector on disk of block I/O operation
struct bio * bi_next Link to the next bio in the request queue
struct block_device * bi_bdev Pointer to block device descriptor
unsigned long bi_flags Bio status flags
unsigned long bi_rw I/O operation flags
unsigned short bi_vcnt Number of segments in the bio's bio_vec array
unsigned short bi_idx Current index in the bio's bio_vec array of segments
unsigned short bi_phys_segments Number of physical segments of the bio after merging
unsigned short bi_hw_segments Number of hardware segments after merging
unsigned int bi_size Bytes (yet) to be transferred
unsigned int bi_hw_front_size Used by the hardware segment merge algorithm
unsigned int bi_hw_back_size Used by the hardware segment merge algorithm
unsigned int bi_max_vecs Maximum allowed number of segments in the bio's bio_vec array
struct bio_vec * bi_io_vec Pointer to the bio's bio_vec array of segments
bio_end_io_t * bi_end_io Method invoked at the end of bio's I/O operation
atomic_t bi_cnt Reference counter for the bio
void * bi_private Pointer used by the generic block layer and the I/O
completion method of the block device driver
bio_destructor_t* bi_destructor Destructor method (usually bio_destructor()) invoked when
the bio is being freed
Table 14-2. The fields of the bio_vec structure
Type Field Description
struct page * bv_page Pointer to the page descriptor of the segment's page frame
unsigned int bv_len Length of the segment in bytes
unsigned int bv_offset Offset of the segment's data in the page frame
14.2.2. Representing Disks and Disk Partitions
Table 14-3. The fields of the gendisk object
Type Field Description
int major Major number of the disk
int first_minor First minor number associated with the disk
int minors Range of minor numbers associated with the disk
char [32] disk_name Conventional name of the disk (usually, the canonical
name of the corresponding device file)
struct hd_struct ** part Array of partition descriptors for the disk
struct block_device_operations* fops Pointer to a table of block device methods
struct request_queue * queue Pointer to the request queue of the disk (see "Request
Queue Descriptors" later in this chapter)
void * private_data Private data of the block device driver
sector_t capacity Size of the storage area of the disk (in number of sectors)
int flags Flags describing the kind of disk (see below)
char [64] devfs_name Device filename in the (nowadays deprecated) devfs special filesystem
int number No longer used
struct device * driverfs_dev Pointer to the device object of the disk's hardware device
(see the section "Components of the Device Driver Model" in Chapter 13)
struct kobject kobj Embedded kobject (see the section "Kobjects" in Chapter 13)
struct timer_rand_state * random Pointer to a data structure that records the timing of the
disk's interrupts; used by the kernel built-in random number generator
int policy Set to 1 if the disk is read-only (write operations forbidden), 0 otherwise
atomic_t sync_io Counter of sectors written to disk, used only for RAID
unsigned long stamp Timestamp used to determine disk queue usage statistics
unsigned long stamp_idle Same as above
int in_flight Number of ongoing I/O operations
struct disk_stats * dkstats Statistics about per-CPU disk usage
Table 14-4. The methods of the block devices
Method Triggers
open Opening the block device file
release Closing the last reference to a block device file
ioctl Issuing an ioctl( ) system call on the block device file (uses the big kernel lock )
compat_ioctl Issuing an ioctl( ) system call on the block device file (does not use the big kernel lock)
media_changed Checking whether the removable media has been changed (e.g., floppy disk)
revalidate_disk Checking whether the block device holds valid data
14.2.3. Submitting a Request
generic_make_request()
14.3. The I/O Scheduler
14.3.1. Request Queue Descriptors
struct request_queue * queue
Table 14-6. The fields of the request queue descriptor
Type Field Description
struct list_head queue_head List of pending requests
struct request * last_merge Pointer to descriptor of the request in the queue to be considered first for possible merging
elevator_t * elevator Pointer to the elevator object (see the later section "I/O Scheduling Algorithms")
struct request_list rq Data structure used for allocation of request descriptors
request_fn_proc * request_fn Method that implements the entry point of the strategy routine of the driver
merge_request_fn* back_merge_fn Method to check whether it is possible to merge a bio to the last request in the queue
merge_requests_fn * merge_requests_fn Method to attempt merging two adjacent requests in the queue
make_request_fn * make_request_fn Method invoked when a new request has to be insertedin the queue
prep_rq_fn * prep_rq_fn Method to build the commands to be sent to the hardware device to process this request
unplug_fn * unplug_fn Method to unplug the block device (see the section "Activating the Block Device Driver" later in the chapter)
merge_bvec_fn * merge_bvec_fn Method that returns the number of bytes that can be inserted into an existing bio when adding a new segment (usually undefined)
activity_fn * activity_fn Method invoked when a request is added to a queue(usually undefined)
issue_flush_fn * issue_flush_fn Method invoked when a request queue is flushed (the queue is emptied by processing all requests in a row)
struct timer_list unplug_timer Dynamic timer used to perform device plugging (see the later section "Activating the Block Device Driver")
int unplug_thresh If the number of pending requests in the queue exceeds this value, the device is immediately unplugged (default is 4)
unsigned long unplug_delay Time delay before device unplugging (default is 3 milliseconds)
struct work_struct unplug_work Work queue used to unplug the device (see the later section "Activating the Block Device Driver")
struct backing_dev_info backing_dev_info See the text following this table
void * queuedata Pointer to private data of the block device driver
void * activity_data Private data used by the activity_fn method
unsigned long bounce_pfn Page frame number above which buffer bouncing must be used (see the section "Submitting a Request" later in
this chapter)
int bounce_gfp Memory allocation flags for bounce buffers
unsigned long queue_flags Set of flags describing the queue status
spinlock_t * queue_lock Pointer to request queue lock
struct kobject kobj Embedded kobject for the request queue
unsigned long nr_requests Maximum number of requests in the queue
unsigned int nr_congestion_on Queue is considered congested if the number of pending requests rises above this threshold
unsigned int nr_congestion_off Queue is considered not congested if the number of pending requests falls below this threshold
unsigned int nr_batching Maximum number (usually 32) of pending requests that can be submitted even when the queue is full by a
special "batcher" process
unsigned short max_sectors Maximum number of sectors handled by a single request (tunable)
unsigned short max_hw_sectors Maximum number of sectors handled by a single request(hardware constraint)
unsigned short max_phys_segments Maximum number of physical segments handled by a single request
unsigned short max_hw_segments Maximum number of hardware segments handled by a single request (the maximum number of distinct
memory areas in a scatter-gather DMA operation)
unsigned short hardsect_size Size in bytes of a sector
unsigned int max_segment_size Maximum size of a physical segment (in bytes
unsigned long seg_boundary_mask Memory boundary mask for segment merging
unsigned int dma_alignment Alignment bitmap for initial address and length of DMA buffers (default 511)
struct blk_queue_tag * queue_tags Bitmap of free/busy tags (used for tagged requests)
atomic_t refcnt Reference counter of the queue unsigned int in_flight Number of pending requests in the queue
unsigned int sg_timeout User-defined command time-out (used only by SCSI generic devices)
unsigned int sg_reserved_size Essentially unused
struct list_head drain_list Head of a list of requests temporarily delayed until the I/O scheduler is dynamically replaced
14.3.2. Request Descriptors
request data structure
Table 14-7. The fields of the request descriptor
Type Field Description
struct list_head queuelist Pointers for request queue list
unsigned long flags Flags of the request (see below)
sector_t sector Number of the next sector to be transferred
unsigned long nr_sectors Number of sectors yet to be transferred in the whole request
unsigned int current_nr_sectors Number of sectors in the current segment of the current bio yet to be transferred
sector_t hard_sector Number of the next sector to be transferred
unsigned long hard_nr_sectors Number of sectors yet to be transferred in the whole request (updated by the generic block layer)
unsigned int hard_cur_sectors Number of sectors in the current segment of the current bio yet to be transferred
(updated by the generic block layer)
struct bio * bio First bio in the request that has not been completely transferred
struct bio * biotail Last bio in the request list
void * elevator_private Pointer to private data for the I/O scheduler int rq_status Request status:
essentially, either RQ_ACTIVE or RQ_INACTIVE
struct gendisk * rq_disk The descriptor of the disk referenced by the request int errors Counter for the number of I/O errors that occurred on the current transfer
unsigned long start_time Request's starting time (in jiffies)
unsigned short nr_phys_segments Number of physical segments of the request
unsigned short nr_hw_segments Number of hardware segments of the request
int tag Tag associated with the request (only for hardware devices supporting multiple outstanding data transfers)
char * buffer Pointer to the memory buffer of the current data transfer (NULL if the buffer is in high-memory)
int ref_count Reference counter for the request
request_queue_t * q Pointer to the descriptor of the request queue containing the request
struct request_list* rl Pointer to request_list data structure
struct completion* waiting Completion for waiting for the end of the data transfers(see the section "Completions" in Chapter 5)
void * special Pointer to data used when the request includes a "special"command to the hardware device
unsigned int cmd_len Length of the commands in the cmd field
unsigned char [] cmd Buffer containing the pre-built commands prepared by the request queue's prep_rq_fn method
unsigned int data_len Usually, the length of data in the buffer pointed to by the data field
void * data Pointer used by the device driver to keep track of the data to be transferred
unsigned int sense_len Length of buffer pointed to by the sense field (0 if the sense field is NULL)
void * sense Pointer to buffer used for output of sense commands
unsigned int timeout Request's time-out
struct request_pm_state* pm Pointer to a data structure used for power-management commands
The flags field stores a large number of flags, which are listed in Table 14-8.
14.3.2.1. Managing the allocation of request descriptors
blk_get_request( )
blk_put_request( )
14.3.2.2. Avoiding request queue congestion
blk_congestion_wait( )
14.3.3. Activating the Block Device Driver
The blk_plug_device( ) function plugs a block deviceor more precisely
The blk_remove_plug( ) function unplugs a request queue q
14.3.4. I/O Scheduling Algorithms
elevators.
Currently, Linux 2.6 offers four different types of I/O schedulersor elevatorscalled
"Anticipatory," "Deadline," "CFQ (Complete Fairness Queueing)," and "Noop (No Operation)."
The I/O scheduler algorithm used in a request queue is represented by an elevator object of type
elevator_t; its address is stored in the elevator field of the request queue descriptor
14.3.4.1. The "Noop" elevator
14.3.4.2. The "CFQ" elevator
14.3.4.3. The "Deadline" elevator
14.3.4.4. The "Anticipatory" elevator
14.3.5. Issuing a Request to the I/O Scheduler
_ _make_request( )
14.3.5.1. The blk_queue_bounce( ) function
14.4. Block Device Drivers
14.4.1. Block Devices
block_device descriptor,
Table 14-9. The fields of the block device descriptor
Type Field Description
dev_t bd_dev Major and minor numbers of the block device
struct inode * bd_inode Pointer to the inode of the file associated with the block device in the bdev filesystem
int bd_openers Counter of how many times the block device has been opened
struct semaphore bd_sem Semaphore protecting the opening and closing of the block device
struct semaphore bd_mount_sem Semaphore used to forbid new mounts on the block device
struct list_head bd_inodes Head of a list of inodes of opened block device files for this block device
void * bd_holder Current holder of block device descriptor
int bd_holders Counter for multiple settings of the bd_holder field
struct block_device * bd_contains If block device is a partition, it points to the block device descriptor of the whole disk;
otherwise, it points to this block device descriptor
unsigned bd_block_size Block size
struct hd_struct* bd_part Pointer to partition descriptor (NULL if this block device is not a partition)
unsigned bd_part_count Counter of how many times partitions included in this block device have been opened
int bd_invalidated Flag set when the partition table on this block device needs to be read
struct gendisk * bd_disk Pointer to gendisk structure of the disk underlying this block device
struct list_head * bd_list Pointers for the block device descriptor list
struct backing_dev_info* bd_inode_backing_dev_info Pointer to a specialized backing_dev_info descriptor for this block device (usually NULL)
unsigned long bd_private Pointer to private data of the block device holder
Figure 14-3. Linking the block device descriptors with the other structures of the block subsystem
14.4.1.1. Accessing a block device
14.4.2. Device Driver Registration and Initialization
14.4.2.1. Defining a custom driver descriptor
First of all, the device driver needs a custom descriptor foo of type foo_dev_t holding the data required to drive the hardware device.
struct foo_dev_t {
[...]
spinlock_t lock;
struct gendisk *gd;
[...]
} foo;
register_blkdev()
14.4.2.2. Initializing the custom descriptor
alloc_disk
14.4.2.3. Initializing the gendisk descriptor
blk_init_queue( )
14.4.2.4. Initializing the table of block device methods
14.4.2.5. Allocating and initializing a request queue
14.4.2.6. Setting up the interrupt handler
14.4.2.7. Registering the disk
add_disk( )
14.4.3. The Strategy Routine
blk_init_queue(foo_strategy)
14.4.4. The Interrupt Handler
14.5. Opening a Block Device File
Table 14-10. The default block device file operations (def_blk_fops table)
Method Function
open blkdev_open( )
release blkdev_close( )
llseek block_llseek( )
read generic_file_read( )
write blkdev_file_write( )
aio_read generic_file_aio_read( )
aio_write blkdev_file_aio_write( )
mmap generic_file_mmap( )
fsync block_fsync( )
ioctl block_ioctl( )
compat-ioctl compat_blkdev_ioctl( )
readv generic_file_readv( )
writev generic_file_write_nolock( )
sendfile generic_file_sendfile( )
Chapter 15. The Page Cache
15.1. The Page Cache
Kernel designers have implemented the page cache to fulfill two main requirements:
(1)Quickly locate a specific page containing data relative to a given owner. To take the maximum
advantage from the page cache, searching it should be a very fast operation.
(2) Keep track of how every page in the cache should be handled when reading or writing its
content. For instance, reading a page from a regular file, a block device file, or a swap area
must be performed in different ways, thus the kernel must select the proper operation
depending on the page's owner
15.1.1. The address_space Object
The core data structure of the page cache is the address_space object,
Each page descriptor includes two fields called mapping and index,
The first field points to the address_space object of the inode that owns the page.
The second field specifies the offset in page-size units within the owner's "address space," that is, the position of the page's data inside the owner's disk image.
These two fields are used when looking for a page in the page cache.
address_space object
Table 15-1. The fields of the address_space object
Type Field Description
struct inode * host Pointer to the inode hosting this object, if any
struct radix_tree_root page_tree Root of radix tree identifying the owner's pages
spinlock_t tree_lock Spin lock protecting the radix tree
unsigned int i_mmap_writable Number of shared memory mappings in the address space
struct prio_tree_root i_mmap Root of the radix priority search tree (see Chapter 17)
struct list_head i_mmap_nonlinear List of non-linear memory regions in the address space
spinlock_t i_mmap_lock Spin lock protecting the radix priority search tree
unsigned int TRuncate_count Sequence counter used when truncating the file
unsigned long nrpages Total number of owner's pages
unsigned long writeback_index Page index of the last write-back operation on the owner's pages
struct address_space_operations * a_ops Methods that operate on the owner's pages
unsigned long flags Error bits and memory allocator flags
struct backing_dev_info * backing_dev_info Pointer to the backing_dev_info of the block device holding the data of this owner
spinlock_t private_lock Usually, spin lock used when managing the private_list list
struct list head private_list Usually, a list of dirty buffers of indirect blocks associated with the inode
struct address_space * assoc_mapping Usually, pointer to the address_space object of the block device including the indirect blocks
Table 15-2. The methods of the address_space object
Method Description
writepage Write operation (from the page to the owner's disk image)
readpage Read operation (from the owner's disk image to the page)
sync_page Start the I/O data transfer of already scheduled operations on owner's pages
writepages Write back to disk a given number of dirty owner's pages
set_page_dirty Set an owner's page as dirty
readpages Read a list of owner's pages from disk
prepare_write Prepare a write operation (used by disk-based filesystems)
commit_write Complete a write operation (used by disk-based filesystems)
bmap Get a logical block number from a file block index
invalidatepage Invalidate owner's pages (used when truncating the file)
releasepage Used by journaling filesystems to prepare the release of a page
direct_IO Direct I/O transfer of the owner's pages (bypassing the page cache)
15.1.2. The Radix Tree
15.1.3. Page Cache Handling Functions
15.1.3.1. Finding a page
find_get_page( )
find_get_pages( )
15.1.3.2. Adding a page
The add_to_page_cache( ) function inserts a new page descriptor in the page cache
radix_tree_insert( )
15.1.3.3. Removing a page
The remove_from_page_cache( ) function removes a page descriptor from the page cache
radix_tree_delete( )
15.1.3.4. Updating a page
The read_cache_page( ) function ensures that the cache includes an up-to-date version of a given page.
15.1.4. The Tags of the Radix Tree
The radix_tree_tag_set( ) function is invoked when setting the PG_dirty or the PG_writeback flag of a cached page;
The radix_tree_tag_clear( ) function is invoked when clearing the PG_dirty or the PG_writeback flag of a cached page;
15.2. Storing Blocks in the Page Cache
Formally, a buffer page is a page of data associated with additional descriptors called "buffer heads
," whose main purpose is to quickly locate the disk address of each individual block in the page. In
fact, the chunks of data stored in a page belonging to the page cache are not necessarily adjacent on disk.
15.2.1. Block Buffers and Buffer Heads
buffer_head
Table 15-4. The fields of a buffer head
Type Field Description
unsigned long b_state Buffer status flags
struct buffer_head * b_this_page Pointer to the next element in the buffer page's list
struct page * b_page Pointer to the descriptor of the buffer page holding this block
atomic_t b_count Block usage counter
u32 b_size Block size
sector_t b_blocknr Block number relative to the block device (logical block number)
char * b_data Position of the block inside the buffer page
struct block_device * b_bdev Pointer to block device descriptor bh_end_io_t * b_end_io I/O completion method
void * b_private Pointer to data for the I/O completion method
struct list_head b_assoc_buffers Pointers for the list of indirect blocks associated with an inode(see the section "The address_space Object" earlier in this chapter)
15.2.2. Managing the Buffer Heads
whose kmem_cache_s descriptor is stored in the bh_cachep variable.
The alloc_buffer_head( ) and free_buffer_head( ) functions are used to get and release a buffer head, respectively.
_ _getblk( )/_bforget( );
15.2.3. Buffer Pages
Figure 15-2. A buffer page including four buffers and their buffer heads
15.2.4. Allocating Block Device Buffer Pages
grow_dev_page( )
15.2.5. Releasing Block Device Buffer Pages
TRy_to_release_page( )
15.2.6. Searching Blocks in the Page Cache
15.2.6.1. The _ _find_get_block( ) function
15.2.6.2. The _ _getblk( ) function
15.2.6.3. The _ _bread( ) function
15.2.7. Submitting Buffer Heads to the Generic Block Layer
15.2.7.1. The submit_bh( ) function
15.2.7.2. The ll_rw_block( ) function
15.3. Writing Dirty Pages to Disk
15.3.1. The pdflush Kernel Threads
Each pdflush kernel thread has a pdflush_work descriptor (see Table 15-6). The descriptors of idle
pdflush kernel threads are collected in the pdflush_list list; the pdflush_lock spin lock protects that
list from concurrent accesses in multiprocessor systems. The nr_pdflush_threads variable[*] stores
the total number of pdflush kernel threads (idle and busy). Finally, the last_empty_jifs variable
stores the last time (in jiffies) since the pdflush_list list of pdflush threads became empty.
Table 15-6. The fields of the pdflush_work descriptor
Type Field Description
struct task_struct * who Pointer to kernel thread descriptor
void(*)(unsigned long) fn Callback function to be executed by the kernel thread
unsigned long arg0 Argument to callback function
struct list head list Links for the pdflush_list list
unsigned long when_i_went_to_sleep Time in jiffies when kernel thread became available
15.3.2. Looking for Dirty Pages To Be Flushed
The wakeup_bdflush( ) function receives as argument the number of dirty pages in the page cache that should be flushed;
The background_writeout( ) function acts on a single parameter: nr_pages, the minimum number of pages that should be flushed to disk
15.3.3. Retrieving Old Dirty Pages
The job of retrieving old dirty pages is delegated to a pdflush kernel thread that is periodically
woken up. During the kernel initialization, the page_writeback_init( ) function sets up the wb_timer
dynamic timer so that it decays after dirty_writeback_centisecs hundreds of a second (usually 500,
but this value can be adjusted by writing in the /proc/sys/vm/dirty_writeback_centisecs file). The
timer function, which is called wb_timer_fn( ), essentially invokes the pdflush_operation( )
function passing to it the address of the wb_kupdate( ) callback function.
15.4. The sync( ), fsync( ), and fdatasync( ) System Calls
15.4.1. The sync ( ) System Call
The service routine sys_sync( ) of the sync( ) system call invokes a series of auxiliary functions:
wakeup_bdflush(0);
sync_inodes(0);
sync_supers( );
sync_filesystems(0);
sync_filesystems(1);
sync_inodes(1);
15.4.2. The fsync ( ) and fdatasync ( ) System Calls
The fsync( ) system call forces the kernel to write to disk all dirty buffers that belong to the file
specified by the fd file descriptor parameter (including the buffer containing its inode, if necessary).
The corresponding service routine derives the address of the file object and then invokes the fsync
method. Usually, this method ends up invoking the _ _writeback_single_inode( ) function to write
back both the dirty pages associated with the selected inode and the inode itself (see the section
"Looking for Dirty Pages To Be Flushed" earlier in this chapter).
The fdatasync( ) system call is very similar to fsync( ), but writes to disk only the buffers that
contain the file's data, not those that contain inode information. Because Linux 2.6 does not have a
specific file method for fdatasync( ), this system call uses the fsync method and is thus identical to fsync( ).
Chapter 16. Accessing Files
There are many different ways to access a file. In this chapter we will consider the following cases:
1) canonical mode
2) synchoronous mode
3) Memory mapping mode
4) Direct I/O mode
5) Asynchronous mode
16.1. Reading and Writing a File
16.1.1. Reading from a File
generic_file_read( )
The first descriptor is stored in the local variable local_iov of type iovec; it contains the address (buf) and the length (count) of the User
Mode buffer that shall receive the data read from the file.
The second descriptor is stored in the local variable kiocb of type kiocb; it is used to keep track of the completion status of an ongoing
synchronous or asynchronous I/O operation.
16.1.1.1. The readpage method for regular files
int ext3_readpage(struct file *file, struct page *page)
{
return mpage_readpage(page, ext3_get_block);
}
The mpage_readpage( ) function chooses between two different strategies when reading a page from disk.
16.1.1.2. The readpage method for block device files
It is implemented by the blkdev_readpage( ) function,
which calls block_read_full_page( ):
int blkdev_readpage(struct file * file, struct * page page)
{
return block_read_full_page(page, blkdev_get_block);
}
16.1.2. Read-Ahead of Files
Read-ahead consists of reading several adjacent pages of data of a regular file or block device file
before they are actually requested
The main data structure used by the read-ahead algorithm is the file_ra_state descriptor whose
fields are listed in Table 16-3. Each file object includes such a descriptor in its f_ra field.
Table 16-3. The fields of the file_ra_state descriptor
Type Field Description
unsigned long start Index of first page in the current window
unsigned long size Number of pages included in the current window
(-1 for read-ahead temporarily disabled, 0 for empty current window)
unsigned long flags Flags used to control the read-ahead
unsigned long cache_hit Number of consecutive cache hits
(pages requested by the process and found in the page cache)
unsigned long prev_page Index of the last page requested by the process
unsigned long ahead_start Index of the first page in the ahead window
unsigned long ahead_size Number of pages in the ahead window (0 for an empty ahead window)
unsigned long ra_pages Maximum size in pages of a read-ahead window (0 for read-ahead permanently disabled)
unsigned long mmap_hit Read-ahead hit counter (used for memory mapped files)
unsigned long mmap_miss Read-ahead miss counter (used for memory mapped files)
16.1.2.1. The page_cache_readahead( ) function
Figure 16-1. The flow diagram of the page_cache_readahead( ) function
16.1.2.2. The handle_ra_miss( ) function
16.1.3. Writing to a File
Many filesystems (including Ext2 or JFS ) implement the write method of the file object by means of
the generic_file_write( ) function, which acts on the following parameters:
The _ _generic_file_aio_write_nolock( ) function receives four parameters
16.1.3.1. The prepare_write and commit_write methods for regular files
16.1.3.2. The prepare_write and commit_write methods for block device files
16.1.4. Writing Dirty Pages to Disk
int ext2_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
return mpage_writepages(mapping, wbc, ext2_get_block);
}
The mpage_writepages( ) function essentially performs the following actions:
16.2. Memory Mapping
As already mentioned in the section "Memory Regions" in Chapter 9, a memory region can be
associated with some portion of either a regular file in a disk-based filesystem or a block device file.
This means that an access to a byte within a page of the memory region is translated by the kernel
into an operation on the corresponding byte of the file. This technique is called memory mapping.
Two kinds of memory mapping exist:
Shared
Private
16.2.1. Memory Mapping Data Structures
Figure 16-2. Data structures for file memory mapping
File memory mapping depends on the demand paging mechanism described in the section "Demand
Paging" in Chapter 9. In fact, a newly established memory mapping is a memory region that doesn't
include any page; as the process references an address inside the region, a Page Fault occurs and
the Page Fault handler checks whether the nopage method of the memory region is defined. If nopage
is not defined, the memory region doesn't map a file on disk; otherwise, it does, and the method
takes care of reading the page by accessing the block device. Almost all disk-based filesystems and
block device files implement the nopage method by means of the filemap_nopage( ) function.
16.2.2. Creating a Memory Mapping
we refer to the enumeration used to describe do_mmap_pgoff( ) and point out the additional steps performed under the new condition.
16.2.3. Destroying a Memory Mapping
The sys_munmap( ) service routine of the system call essentially invokes the do_munmap( ) function
already described in the section "Releasing a Linear Address Interval" in Chapter 9.
16.2.4. Demand Paging for Memory Mapping
The filemap_nopage( ) function executes the following steps:
16.2.5. Flushing Dirty Memory Mapping Pages to Disk
The msync( ) system call can be used by a process to flush to disk dirty pages belonging to a shared memory mapping
16.2.6. Non-Linear Memory Mappings
To create a non-linear memory mapping, the User Mode application first creates a normal shared
memory mapping with the mmap( ) system call. Then, the application remaps some of the pages in
the memory mapping region by invoking remap_file_pages( ). The sys_remap_file_pages( )
service routine of the system call receives four parameters:
16.3. Direct I/O Transfers
generic_file_direct_IO( )
16.4. Asynchronous I/O
16.4.1. Asynchronous I/O in Linux 2.6
Table 16-5. Linux system calls for asynchronous I/O
System call Description
io_setup( ) Initializes an asynchronous context for the current process
io_submit( ) Submits one or more asynchronous I/O operations
io_getevents( ) Gets the completion status of some outstanding asynchronous I/O operations
io_cancel( ) Cancels an outstanding I/O operation
io_destroy( ) Removes an asynchronous context for the current process
16.4.1.2. Submitting the asynchronous I/O operations
To start some asynchronous I/O operations, the application invokes the io_submit( ) system call.
The system call has three parameters:
Chapter 17. Page Frame Reclaiming
17.1. The Page Frame Reclaiming Algorithm
One of the goals of page frame reclaiming is thus to conserve a minimal pool of free page frames so
that the kernel may safely recover from "low on memory" conditions.
17.1.1. Selecting a Target Page
The objective of the page frame reclaiming algorithm (PFRA ) is to pick up page frames and make them free
Table 17-1. The types of pages considered by the PFRA
Type of pages Description Reclaim action
Unreclaimable Free pages (included in buddy system lists)
Reserved pages (with PG_reserved flag set)
Pages dynamically allocated by the kernel
Pages in the Kernel Mode stacks of the processes (No reclaiming allowed or needed)
Temporarily locked pages (with PG_locked flag set)
Memory locked pages (in memory regions with VM_LOCKED flag set)
Swappable Anonymous pages in User Mode address spaces
Mapped pages of tmpfs filesystem (e.g., pages of IPC shared memory) Save the page contents in a swap area
Mapped pages in User Mode address spaces
Pages included in the page cache and containing data of disk files
Syncable Block device buffer pages
Pages of some disk caches (e.g., the inode cache ) Synchronize the page with its image on disk, if necessary
Discardable Unused pages included in memory caches (e.g., slab allocator caches) Nothing to be done
Unused pages of the dentry cache
In the above table, a page is said to be mapped if it maps a portion of a file. For instance, all pages
in the User Mode address spaces belonging to file memory mappings are mapped, as well as any
other page included in the page cache. In almost all cases, mapped pages are syncable: in order to
reclaim the page frame, the kernel must check whether the page is dirty and, if necessary, write the
page contents in the corresponding disk file.
Conversely, a page is said to be anonymous if it belongs to an anonymous memory region of a
process (for instance, all pages in the User Mode heap or stack of a process are anonymous). In
order to reclaim the page frame, the kernel must save the page contents in a dedicated disk
partition or disk file called "swap area" (see the later section "Swapping"); therefore, all anonymous
pages are swappable.
Usually, the pages of special filesystems are not reclaimable. The only exceptions are the pages of
the tmpfs special filesystem, which can be reclaimed by saving them in a swap area. As we'll see in
Chapter 19, the tmpfs special filesystem is used by the IPC shared memory mechanism.
17.1.2. Design of the PFRA
Looking too close to the trees' leaves might lead us to miss the whole forest. Therefore, let us
present a few general rules adopted by the PFRA. These rules are embedded in the functions that
will be described later in this chapter.
Free the "harmless" pages first
Make all pages of a User Mode process reclaimable
Reclaim a shared page frame by unmapping at once all page table entries that reference it
Reclaim "unused" pages only
17.2. Reverse Mapping
The technique used in Linux 2.6 is named object-based reverse mapping. Essentially, for any reclaimable User Mode page, the kernel
stores the backward links to all memory regions in the system (the "objects") that include the page
itself. Each memory region descriptor stores a pointer to a memory descriptor, which in turn
includes a pointer to a Page Global Directory.
int try_to_unmap(struct page *page)
{
int ret;
if (PageAnon(page))
ret = try_to_unmap_anon(page);
else
ret = try_to_unmap_file(page);
if (!page_mapped(page))
ret = SWAP_SUCCESS;
return ret;
}
17.2.1. Reverse Mapping for Anonymous Pages
Figure 17-1. Object-based reverse mapping for anonymous pages
17.2.1.1. The try_to_unmap_anon( ) function
-->try_to_unmap_one( )
17.2.1.2. The try_to_unmap_one( ) function
17.2.2. Reverse Mapping for Mapped Pages
17.2.2.1. The priority search tree
17.2.2.2. The try_to_unmap_file( ) function
17.3. Implementing the PFRA
Figure 17-3. The main functions of the PFRA
17.3.1. The Least Recently Used (LRU) Lists
17.3.1.1. Moving pages across the LRU lists
Figure 17-4. Moving pages across the LRU lists
17.3.1.2. The mark_page_accessed( ) function
17.3.1.3. The page_referenced( ) function
17.3.1.4. The refill_inactive_zone( ) function
Table 17-2. The fields of the scan_control descriptor
Type Field Description
unsigned long nr_to_scan Target number of pages to be scanned in the active list.
unsigned long nr_scanned Number of inactive pages scanned in the current iteration.
unsigned long nr_reclaimed Number of pages reclaimed in the current iteration.
unsigned long nr_mapped Number of pages referenced in the User Mode address spaces.
int nr_to_reclaim Target number of pages to be reclaimed.
unsigned int priority Priority of the scanning, ranging between 12 and 0. Lower priority implies scanning more pages.
unsigned int gfp_mask GFP mask passed from calling function.
int may_writepage If set, writing a dirty page to disk is allowed (only for laptop mode).
17.3.2. Low On Memory Reclaiming
17.3.2.1. The free_more_memory( ) function
17.3.2.2. The try_to_free_pages( ) function
17.3.2.3. The shrink_caches( ) function
17.3.2.4. The shrink_zone( ) function
17.3.2.5. The shrink_cache( ) function
17.3.2.6. The shrink_list( ) function
Figure 17-5. The page reclaiming logic of the shrink_list( ) function
17.3.2.7. The pageout( ) function
The pageout( ) function is invoked by shrink_list( ) when a dirty page must be written to disk
17.3.3. Reclaiming Pages of Shrinkable Disk Caches
17.3.3.1. Reclaiming page frames from the dentry cache
The shrink_dcache_memory( ) function is the shrinker function for the dentry cache;
17.3.3.2. Reclaiming page frames from the inode cache
17.3.4. Periodic Reclaiming
The PFRA performs periodic reclaiming by using two different mechanisms: the kswapd kernel
threads, which invoke shrink_zone( ) and shrink_slab( ) to reclaim pages from the LRU lists, and
the cache_reap function, which is invoked periodically to reclaim unused slabs from the slab allocator.
17.3.4.1. The kswapd kernel threads
following steps:
1: Invokes finish_wait( ) to remove the kernel thread from the node's kswapd_wait wait queue
(see the section "How Processes Are Organized" in Chapter 3).
2: Invokes balance_pgdat( ) to perform the memory reclaiming on the kswapd's memory node
(see below).
3: Invokes prepare_to_wait( ) to set the process in the TASK_INTERRUPTIBLE state and to put it to
sleep in the node's kswapd_wait wait queue.
4: Invokes schedule( ) to yield the CPU to some other runnable process
17.3.4.2. The cache_reap( ) function
The PFRA must also reclaim the pages owned by the slab allocator caches (see the section "Memory
Area Management " in Chapter 8). To do this, it relies on the cache_reap( ) function, which is
periodically scheduled approximately once every two secondsin the predefined events work queue
(see the section "Work Queues" in Chapter 4). The address of the cache_reap( ) function is stored in
the func field of the reap_work per-CPU variable of type work_struct.
17.3.5. The Out of Memory Killer
The out_of_memory( ) function is invoked by _ _alloc_pages( ) when the free memory is very low
and the PFRA has not succeeded in reclaiming any page frames (see the section "The Zone Allocator"
in Chapter 8). The function invokes select_bad_process( ) to select a victim among the existing
processes, then invokes oom_kill_process( ) to perform the sacrifice
17.3.6. The Swap Token
17.4. Swapping
17.4.1. Swap Area
The first page slot of a swap area is used to persistently store some information about the
swap area; its format is described by the swap_header union composed of two structures
magic
info
17.4.1.1. Creating and activating a swap area
Each swap area consists of one or more swap extents , each of which is represented by a
swap_extent descriptor
17.4.1.2. How to distribute pages in the swap areas
17.4.2. Swap Area Descriptor
swap_info_struct descriptor in memory
Table 17-3. Fields of a swap area descriptor
Type Field Description
unsigned int flags Swap area flags
spinlock_t sdev_lock Spin lock protecting the swap area
struct file * swap_file Pointer to the file object of the regular file or device file that stores the swap area
struct block_device * bdev Descriptor of the block device containing the swap area
struct list head extent_list Head of the list of extents that compose the swap area
int nr_extents Number of extents composing the swap area
struct swap_extent * curr_swap_extent Pointer to the most recently used extent descriptor
unsigned int old_block_size Natural block size of the partition containing the swap area
unsigned short * swap_map Pointer to an array of counters, one for each swap area page slot
unsigned int lowest_bit First page slot to be scanned when searching for a free one
unsigned int highest_bit Last page slot to be scanned when searching for a free one
unsigned int cluster_next Next page slot to be scanned when searching for a free one
unsigned int cluster_nr Number of free page slot allocations before restarting from the beginning
int prio Swap area priority
int pages Number of usable page slots
unsigned long max Size of swap area in pages
unsigned long inuse_pages Number of used page slots in the swap area
int next Pointer to next swap area descriptor
17.4.3. Swapped-Out Page Identifier
17.4.4. Activating and Deactivating a Swap Area
17.4.4.1. The sys_swapon( ) service routine
17.4.4.2. The sys_swapoff( ) service routine
17.4.4.3. The try_to_unuse( ) function
17.4.5. Allocating and Releasing a Page Slot
17.4.5.1. The scan_swap_map( ) function
17.4.5.2. The get_swap_page( ) function
17.4.5.3. The swap_free( ) function
17.4.6. The Swap Cache
17.4.6.1. Swap cache implementation
17.4.6.2. Swap cache helper functions
17.4.7. Swapping Out Pages
17.4.7.1. Inserting the page frame in the swap cache
17.4.7.2. Updating the Page Table entries
17.4.7.3. Writing the page into the swap area
17.4.7.4. Removing the page frame from the swap cache
17.4.8. Swapping in Pages
17.4.8.1. The do_swap_page( ) function
17.4.8.2. The read_swap_cache_async( ) function
Chapter 18. The Ext2 and Ext3 Filesystems
18.1. General Characteristics of Ext2
18.2. Ext2 Disk Data Structures
The first block in each Ext2 partition is never managed by the Ext2 filesystem, because it is reserved for the partition boot sector
The rest of the Ext2 partition is split into block groups, each of which has the layout shown in Figure 18-1.
How many block groups are there? Well, that depends both on the partition size and the block size.
The main constraint is that the block bitmap, which is used to identify the blocks that are used and
free inside a group, must be stored in a single block. Therefore, in each block group, there can be at
most 8xb blocks, where b is the block size in bytes. Thus, the total number of block groups is
roughly s/(8xb), where s is the partition size in blocks.
For example, let's consider a 32-GB Ext2 partition with a 4-KB block size. In this case, each 4-KB
block bitmap describes 32K data blocks that is, 128 MB. Therefore, at most 256 block groups are
needed. Clearly, the smaller the block size, the larger the number of block groups.
18.2.1. Superblock
An Ext2 disk superblock is stored in an ext2_super_block structure
Table 18-1. The fields of the Ext2 superblock
Type Field Description
_ _le32 s_inodes_count Total number of inodes
_ _le32 s_blocks_count Filesystem size in blocks
_ _le32 s_r_blocks_count Number of reserved blocks
_ _le32 s_free_blocks_count Free blocks counter
_ _le32 s_free_inodes_count Free inodes counter
_ _le32 s_first_data_block Number of first useful block (always 1)
_ _le32 s_log_block_size Block size
_ _le32 s_log_frag_size Fragment size
_ _le32 s_blocks_per_group Number of blocks per group
_ _le32 s_frags_per_group Number of fragments per group
_ _le32 s_inodes_per_group Number of inodes per group
_ _le32 s_mtime Time of last mount operation
_ _le32 s_wtime Time of last write operation
_ _le16 s_mnt_count Mount operations counter
_ _le16 s_max_mnt_count Number of mount operations before check
_ _le16 s_magic Magic signature
_ _le16 s_state Status flag
_ _le16 s_errors Behavior when detecting errors
_ _le16 s_minor_rev_level Minor revision level
_ _le32 s_lastcheck Time of last check
_ _le32 s_checkinterval Time between checks
_ _le32 s_creator_os OS where filesystem was created
_ _le32 s_rev_level Revision level of the filesystem
_ _le16 s_def_resuid Default UID for reserved blocks
_ _le16 s_def_resgid Default user group ID for reserved blocks
_ _le32 s_first_ino Number of first nonreserved inode
_ _le16 s_inode_size Size of on-disk inode structure
_ _le16 s_block_group_nr Block group number of this superblock
_ _le32 s_feature_compat Compatible features bitmap
_ _le32 s_feature_incompat Incompatible features bitmap
_ _le32 s_feature_ro_compat Read-only compatible features bitmap
_ _u8 [16] s_uuid 128-bit filesystem identifier
char [16] s_volume_name Volume name
char [64] s_last_mounted Pathname of last mount point
_ _le32 s_algorithm_usage_bitmap Used for compression
_ _u8 s_prealloc_blocks Number of blocks to preallocate
_ _u8 s_prealloc_dir_blocks Number of blocks to preallocate for directories
_ _u16 s_padding1 Alignment to word
_ _u32 [204] s_reserved Nulls to pad out 1,024 bytes
18.2.2. Group Descriptor and Bitmap
ext2_group_desc
Table 18-2. The fields of the Ext2 group descriptor
Type Field Description
_ _le32 bg_block_bitmap Block number of block bitmap
_ _le32 bg_inode_bitmap Block number of inode bitmap
_ _le32 bg_inode_table Block number of first inode table block
_ _le16 bg_free_blocks_count Number of free blocks in the group
_ _le16 bg_free_inodes_count Number of free inodes in the group
_ _le16 bg_used_dirs_count Number of directories in the group
_ _le16 bg_pad Alignment to word
_ _le32 [3] bg_reserved Nulls to pad out 24 bytes
18.2.3. Inode Table
The inode table consists of a series of consecutive blocks, each of which contains a predefined
number of inodes. The block number of the first block of the inode table is stored in the
n bg_inode_table field of the group descriptor.
Each Ext2 inode is an ext2_inode structure whose fields are illustrated in Table 18-3.
Table 18-3. The fields of an Ext2 disk inode
Type Field Description
_ _le16 i_mode File type and access rights
_ _le16 i_uid Owner identifier
_ _le32 i_size File length in bytes
_ _le32 i_atime Time of last file access
_ _le32 i_ctime Time that inode last changed
_ _le32 i_mtime Time that file contents last changed
_ _le32 i_dtime Time of file deletion
_ _le16 i_gid User group identifier
_ _le16 i_links_count Hard links counter
_ _le32 i_blocks Number of data blocks of the file
_ _le32 i_flags File flags
union osd1 Specific operating system information
_ _le32 [EXT2_N_BLOCKS] i_block Pointers to data blocks
_ _le32 i_generation File version (used when the file is accessed by anetwork filesystem)
_ _le32 i_file_acl File access control list
_ _le32 i_dir_acl Directory access control list
_ _le32 i_faddr Fragment address
union osd2 Specific operating system information
18.2.4. Extended Attributes of an Inode
The i_file_acl field of an inode points to the block containing the extended attributes
ext2_xattr_entry descriptor
ext2_xattr_entry descriptor together with the name of the attribute are placed at the beginning of
the block, while the value of the attribute is placed at the end of the block.
18.2.5. Access Control Lists
18.2.6. How Various File Types Use Disk Blocks
Table 18-4. Ext2 file types
File_type Description
0 Unknown
1 Regular file
2 Directory
3 Character device
4 Block device
5 Named pipe
6 Socket
7 Symbolic link
18.2.6.1. Regular file
18.2.6.2. Directory
18.2.6.3. Symbolic link
18.2.6.4. Device file, pipe, and socket
18.3. Ext2 Memory Data Structures
Table 18-6. VFS images of Ext2 data structures
Type Disk data structure Memory data structure Caching mode
Superblock ext2_super_block ext2_sb_info Always cached
Group descriptor ext2_group_desc ext2_group_desc Always cached
Block bitmap Bit array in block Bit array in buffer Dynamic
inode bitmap Bit array in block Bit array in buffer Dynamic
inode ext2_inode ext2_inode_info Dynamic
Data block Array of bytes VFS buffer Dynamic
Free inode ext2_inode None Never
Free block Array of bytes None Never
18.3.1. The Ext2 Superblock Object
As stated in the section "Superblock Objects" in Chapter 12, the s_fs_info field of the VFS
superblock points to a structure containing filesystem-specific data. In the case of Ext2, this field
points to a structure of type ext2_sb_info
18.3.2. The Ext2 inode Object
When the VFS accesses an Ext2 disk inode, it creates a corresponding inode descriptor of type ext2_inode_info
18.4. Creating the Ext2 Filesystem
There are generally two stages to creating a filesystem on a disk. The first step is to format it so that
the disk driver can read and write blocks on it.
The second step involves creating a filesystem, which means
setting up the structures described in detail earlier in this chapter
18.5. Ext2 Methods
18.5.1. Ext2 Superblock Operations
The addresses of the superblock methods are stored in the ext2_sops array of pointers
18.5.2. Ext2 inode Operations
The addresses of the Ext2 methods for regular files and directories are stored in the ext2_file_inode_operations
and ext2_dir_inode_operations tables, respectively.
18.5.3. Ext2 File Operations
The addresses of these methods are stored in the ext2_file_operations table.
18.6. Managing Ext2 Disk Space
18.6.1. Creating inodes
The ext2_new_inode( ) function creates an Ext2 disk inode
18.6.2. Deleting inodes
The ext2_free_inode( ) function deletes a disk inode
18.6.3. Data Blocks Addressing
18.6.5. Allocating a Data Block
ext2_get_block( )
ext2_alloc_block( )
18.6.6. Releasing a Data Block
ext2_truncate( ),
18.7. The Ext3 Filesystem
18.7.1. Journaling Filesystems
The goal of a journaling filesystem is to avoid running time-consuming consistency checks on the
whole filesystem by looking instead in a special disk area that contains the most recent disk write
operations named journal. Remounting a journaling filesystem after a system failure is a matter of a
few seconds.
18.7.2. The Ext3 Journaling Filesystem
it offers three different journaling modes
Journal
Ordered
Writeback
18.7.3. The Journaling Block Device Layer
18.7.3.1. Log records
18.7.3.2. Atomic operation handles
18.7.3.3. Transactions\
18.7.4. How Journaling Works
Chapter 19. Process Communication
As usual, application programmers have a variety of needs that call for different communication
mechanisms. Here are the basic mechanisms that Unix systems offer to allow interprocess
communication:
Pipes and FIFOs (named pipes)
Semaphores
Messages
Shared memory regions
Sockets
19.1. Pipes
19.1.1. Using a Pipe
In Linux, popen( ) and pclose( ) are included in the C library. The popen( ) function receives two
parameters: the filename pathname of an executable file and a type string specifying the direction
of the data transfer
19.1.2. Pipe Data Structures
pipe_inode_info
19.1.2.1. The pipefs special filesystem
19.1.3. Creating and Destroying a Pipe
The pipe( ) system call is serviced by the sys_pipe( ) function, which in turn invokes the do_pipe() function. To create a new pipe
19.1.4. Reading from a Pipe
The pipe_read( ) function is quite involved
19.1.5. Writing into a Pipe
A process wishing to put data into a pipe issues a write( ) system call, specifying the file descriptor
for the writing end of the pipe. The kernel satisfies this request by invoking the write method of the
proper file object; the corresponding entry in the write_pipe_fops table points to the pipe_write( ) function.
19.2. FIFOs
FIFOs and PIPE are only two significant differences
1): FIFO inodes appear on the system directory tree rather than on the pipefs special filesystem
2): FIFOs are a bidirectional communication channel; that is, it is possible to open a FIFO in read/write mode.
19.2.1. Creating and Opening a FIFO
A process creates a FIFO by issuing a mknod( )[*] system call (see the section "Device Files" in
Chapter 13), passing to it as parameters the pathname of the new FIFO and the value S_IFIFO
(0x10000) logically ORed with the permission bit mask of the new file. POSIX introduces a function
named mkfifo( ) specifically to create a FIFO. This call is implemented in Linux, as in System V
Release 4, as a C library function that invokes mknod().
The fifo_open( ) function initializes the data structures specific to the FIFO; in particular
19.3. System V IPC
IPC
1): Synchronize itself with other processes by means of semaphores
2): Send messages to other processes or receive messages from them
3): Share a memory area with other processes
19.3.1. Using an IPC Resource
IPC resources are created by invoking the semget( ), msgget( ), or shmget( ) functions, depending
on whether the new resource is a semaphore, a message queue, or a shared memory region.
Table 19-8. The fields of the ipc_ids data structure
Type Field Description
int in_use Number of allocated IPC resources
int max_id Maximum slot index in use
unsigned short seq Slot usage sequence number for the next allocation
unsigned short seq_max Maximum slot usage sequence number
struct semaphore sem Semaphore protecting the ipc_ids data structure
struct ipc_id_ary nullentry Fake data structure pointed to by the entries field if this IPC resource
cannot be initialized (normally not used)
struct ipc_id_ary * enTRies Pointer to the ipc_id_ary data structure for this resource
The ipc_id_ary data structure consists of two fields: p and size. The p field is an array of pointers to
kern_ipc_perm data structures, one for every allocatable resource. The size field is the size of this array.
Each kern_ipc_perm data structure is associated with an IPC resource and contains the fields shown
in Table 19-9.
Table 19-9. The fields in the kern_ipc_ perm structure
Type Field Description
spinlock_t lock Spin lock protecting the IPC resource descriptor
int deleted Flag set if the resource has been released
int key IPC key
unsigned int uid Owner user ID
unsigned int gid Owner group ID
unsigned int cuid Creator user ID
unsigned int cgid Creator group ID
unsigned short mode Permission bit mask
unsigned long seq Slot usage sequence number
void * security Pointer to a security structure (used by SELinux)
19.3.2. The ipc( ) System Call
19.3.3. IPC Semaphores
Table 19-10. The fields in the sem_array data structure
Type Field Description
struct kern_ipc_perm sem_perm kern_ipc_perm data structure
long sem_otime Timestamp of last semop( )
long sem_ctime Timestamp of last change
struct sem * sem_base Pointer to first sem structure
struct sem_queue * sem_pending Pending operations
struct sem_queue ** sem_pending_last Last pending operation
struct sem_undo * undo Undo requests
unsigned long sem_nsems Number of semaphores in array
19.3.3.1. Undoable semaphore operations
19.3.3.2. The queue of pending requests
19.3.4. IPC Messages
To send a message, a process invokes the msgsnd( ) function, passing the following as parameters:
To retrieve a message, a process invokes the msgrcv( ) function, passing to it:
Table 19-12. The msg_queue data structure
Type Field Description
struct kern_ipc_perm q_perm kern_ipc_perm data structure
long q_stime Time of last msgsnd( )
long q_rtime Time of last msgrcv( )
long q_ctime Last change time
unsigned long q_qcbytes Number of bytes in queue
unsigned long q_qnum Number of messages in queue
unsigned long q_qbytes Maximum number of bytes in queue
int q_lspid PID of last msgsnd( )
int q_lrpid PID of last msgrcv( )
struct list_head q_messages List of messages in queue
struct list_head q_receivers List of processes receiving messages
struct list_head q_senders List of processes sending messages
Type Field Description
struct list_head m_list Pointers for message list
long m_type Message type
int m_ts Message text size
struct msg_msgseg * next Next portion of the message
void * security Pointer to a security data structure (used by SELinux)
19.3.5. IPC Shared Memory
As with semaphores and message queues, the shmget( ) function is invoked to get the IPC identifier
of a shared memory region, optionally creating it if it does not already exist.
The shmat( ) function is invoked to "attach" an IPC shared memory region to a process
The shmdt( ) function is invoked to "detach" an IPC shared memory region specified by its IPC
identifierthat is, to remove the corresponding memory region from the process's address space.
Figure 19-3. IPC shared memory data structures
Table 19-14. The fields in the shmid_kernel data structure
Type Field Description
struct kern_ipc_perm shm_perm kern_ipc_perm data structure
struct file * shm_file Special file of the segment
int id Slot index of the segment
unsigned long shm_nattch Number of current attaches
unsigned long shm_segsz Segment size in bytes
long shm_atim Last access time
long shm_dtim Last detach time
long shm_ctim Last change time
int shm_cprid PID of creator
int shm_lprid PID of last accessing process
struct user_struct * mlock_user Pointer to the user_struct descriptor of the user that locked in RAM
the shared memory resource (see the section "The clone( ), fork( ),
and vfork( ) System Calls" in Chapter 3)
19.3.5.1. Swapping out pages of IPC shared memory regions
19.3.5.2. Demand paging for IPC shared memory regions
19.4. POSIX Message Queues
Table 19-15. Library functions for POSIX message queues
Function names Description
mq_open( ) Open (optionally creating) a POSIX message queue
mq_close( ) Close a POSIX message queue (without destroying it)
mq_unlink( ) Destroy a POSIX message queue
mq_send( ) ,mq_timedsend( ) Send a message to a POSIX message queue; the latter function defines a time limit for the operation
mq_receive( ) ,mq_timedreceive() Fetch a message from a POSIX message queue; the latter function defines a time limit for the operation
mq_notify( ) Establish an asynchronous notification mechanism for the arrival of messages in an empty POSIX message queue
mq_getattr( ) ,mq_setattr( ) Respectively get and set attributes of a POSIX message queue (essentially,
whether the send and receive operations should be blocking or nonblocking)
776