深入理解Linux内核第3版--笔记-3.pdf

Chapter 12. The Virtual Filesystem

   five standard Unix file:

      1:regular files,2.directories,3.symbolic links,4.Device files,5. pipes

   12.1. The Role of the Virtual Filesystem (VFS)

    Filesystems supported by the VFS may be grouped into three main classes:

    1:Disk-based filesystems

      2:Network filesystems

      3:Special filesystems

           12.1.1. The Common File Model

           Figure 12-2. Interaction between processes and VFS objects

     

扫描二维码关注公众号,回复: 3625828 查看本文章

           The superblock object

           The inode object

           The file object

           12.1.2. System Calls Handled by the VFS

           Table 12-1. Some system calls handled by the VFS

         System call name                                            Description

              mount( ) umount( ) umount2( )                                 Mount/unmount filesystems

              sysfs( )                                                       Get filesystem information

              statfs( ) fstatfs( ) statfs64( ) fstatfs64( )ustat( )       Get filesystem statistics

              chroot( ) pivot_root( )                     Change root directory

           chdir( ) fchdir( ) getcwd( )                    Manipulate current directory

           mkdir( ) rmdir( )                           Create and destroy directories

 

           getdents( ) getdents64( ) readdir( ) link( )            Manipulate directory entries

        unlink( ) rename( ) lookup_dcookie( )              

 

              readlink( ) symlink( )                      Manipulate soft links

 

           chown( ) fchown( ) lchown( ) chown16( )

        fchown16( ) lchown16( )                     Modify file owner

 

           chmod( ) fchmod( ) utime( )                     Modify file attributes

 

           stat( ) fstat( ) lstat( ) access( ) oldstat( ) oldfstat()

        oldlstat( ) stat64( ) lstat64( )    fstat64( )          Read file status

 

           open( ) close( ) creat( ) umask( )                  Open, close, and create files

 

           dup( ) dup2( ) fcntl( ) fcntl64( )                  Manipulate file descriptors

 

              select( ) poll( )                           Wait for events on a set of file descriptors

 

              truncate( ) ftruncate( ) truncate64( ) ftruncate64( )       Change file size

 

           lseek( ) _llseek( )                         Change file pointer

 

           read( ) write( ) readv( ) writev( ) sendfile( )     Carry out file I/O operations

        sendfile64( ) readahead( )

 

              io_setup( ) io_submit( ) io_getevents( )                    Asynchronous I/O (allows multiple outstanding read and write requests)

            io_cancel( ) io_destroy( )

 

              pread64( ) pwrite64( )                               Seek file and access it

 

           mmap( ) mmap2( ) munmap( ) madvise( ) mincore( )       Handle file memory mapping

           remap_file_pages( )

             

              fdatasync( ) fsync( ) sync( ) msync( )                     Synchronize file data

 

           flock( )                                Manipulate file lock

 

 

            setxattr( ) lsetxattr( ) fsetxattr( ) getxattr( ) lgetxattr( )

           fgetxattr( ) listxattr( ) llistxattr( ) flistxattr( ) removexattr( )   Manipulate file extended attributes

           lremovexattr( ) fremovexattr( )

 

      12.2. VFS Data Structures

        12.2.1. Superblock Objects

           Table 12-2. The fields of the superblock object               

           Type                                Field                                 Description     

              struct list_head            s_list              Pointers for superblock list

              dev_t                   s_dev                   Device identifier

              unsigned long           s_blocksize             Block size in bytes

              unsigned long               s_old_blocksize         Block size in bytes as reported by the underlying block device driver

              unsigned char           s_blocksize_bits            Block size in number of bits

              unsigned char           s_dirt              Modified (dirty) flag

              unsigned long long          s_maxbytes              Maximum size of the files

             

              struct file_system_type *            s_type                  Filesystem type

 

           struct super_operations *       s_op                    Superblock methods

 

           struct dquot_operations *       dq_op                   Disk quota handling methods

 

           struct quotactl_ops *      s_qcop                        Disk quota administration methods

 

           struct export_operations *          s_export_op                 Export operations used by network filesystems

 

           unsigned long           s_flags             Mount flags

 

           unsigned long           s_magic             Filesystem magic number

 

           struct dentry *         s_root              Dentry object of the filesystem's root directory

 

           struct rw_semaphore         s_umount                Semaphore used for unmounting

 

            struct semaphore            s_lock              Superblock semaphore

 

           int                 s_count             Reference counter

 

            int                         s_syncing                     Flag indicating that inodes of the superblock are being synchronized

 

           int                         s_need_sync_fs            Flag used when synchronizing the superblock's mounted filesystem

 

           atomic_t                s_active                Secondary reference coun

 

           void *                         s_security                    Pointer to superblock security structure

 

           struct xattr_handler **         s_xattr                  Pointer to superblock extended attribute structure

 

           struct list_head             s_inodes                      List of all inodes

 

           struct list_head            s_dirty             List of modified inodes

 

           struct list_head             s_io                            List of inodes waiting to be written to disk

 

           struct hlist_head                 s_anon                  List of anonymous dentries for handling remote network filesystems

 

           struct list_head            s_files             List of file objects

 

           struct block_device*        s_bdev              Pointer to the block device driver descriptor

 

           struct list_head             s_instances                  Pointers for a list of superblock objects of a given filesystem type

                                                                   (see the later section "Filesystem Type Registration")

           struct quota_info           s_dquot             Descriptor for disk quota

 

           int                         s_frozen                      Flag used when freezing the filesystem (forcing it to aconsistent state)

 

 

           wait_queue_head_t             s_wait_unfrozen            Wait queue where processes sleep until the filesystem is unfrozen

 

 

           char[]                         s_id                            Name of the block device containing the superblock

 

           void *                         s_fs_info                     Pointer to superblock information of a specific filesystem

 

           struct semaphore            s_vfs_rename_sem            Semaphore used by VFS when renaming files across directories

 

           u32                             s_time_gran                       Timestamp's granularity (in nanoseconds

 

 

 

 

           super_operations ->

 

           alloc_inode(sb)

 

           Allocates space for an inode object, including the space required for filesystem-specific data.

 

           destroy_inode(inode)

 

           Destroys an inode object, including the filesystem-specific data

          

           read_inode(inode)

           Fills the fields of the inode object passed as the parameter with the data on disk; the i_ino

           field of the inode object identifies the specific filesystem inode on the disk to be read.

 

 

           dirty_inode(inode)

           Invoked when the inode is marked as modified (dirty). Used by filesystems such as ReiserFS

           and Ext3 to update the filesystem journal on disk.

 

 

           write_inode(inode, flag)

           Updates a filesystem inode with the contents of the inode object passed as the parameter; the

           i_ino field of the inode object identifies the filesystem inode on disk that is concerned. The

           flag parameter indicates whether the I/O operation should be synchronous.

 

           put_inode(inode)

           Invoked when the inode is released its reference counter is decreased to perform filesystemspecific operations.

 

           drop_inode(inode)

           Invoked when the inode is about to be destroyed that is, when the last user releases the inode;

           filesystems that implement this method usually make use of generic_drop_inode( ). This

           function removes every reference to the inode from the VFS data structures and, if the inode

           no longer appears in any directory, invokes the delete_inode superblock method to delete the

           inode from the filesystem.

 

           delete_inode(inode)

           Invoked when the inode must be destroyed. Deletes the VFS inode in memory and the file data

           and metadata on disk.

 

 

           put_super(super)

           Releases the superblock object passed as the parameter (because the corresponding

           filesystem is unmounted).

 

       write_super(super)

           Updates a filesystem superblock with the contents of the object indicated.

 

 

           sync_fs(sb, wait)

           Invoked when flushing the filesystem to update filesystem-specific data structures on disk

           (used by journaling filesystems ).

 

 

           write_super_lockfs(super)

           Blocks changes to the filesystem and updates the superblock with the contents of the object

           indicated. This method is invoked when the filesystem is frozen, for instance by the Logical

           Volume Manager (LVM) driver.

 

           unlockfs(super)

           Undoes the block of filesystem updates achieved by the write_super_lockfs superblock

           method.

 

 

           statfs(super, buf)

           Returns statistics on a filesystem by filling the buf buffer.

 

           remount_fs(super, flags, data)

           Remounts the filesystem with new options (invoked when a mount option must be changed).

     

           clear_inode(inode)

           Invoked when a disk inode is being destroyed to perform filesystem-specific operations.

 

 

           umount_begin(super)

           Aborts a mount operation because the corresponding unmount operation has been started

           (used only by network filesystems ).

 

           show_options(seq_file, vfsmount)

       Used to display the filesystem-specific options

 

           quota_read(super, type, data, size, offset)

           Used by the quota system to read data from the file that specifies the limits for this filesystem.[*]

 

           quota_write(super, type, data, size, offset)

           Used by the quota system to write data into the file that specifies the limits for this filesystem.

 

           12.2.2. Inode Objects

 

           Table 12-3. The fields of the inode object

           Type                         Field                          Description

              struct hlist_node       i_hash              Pointers for the hash list

           struct list_head        i_list          Pointers for the list that describes the inode's current state

           struct list_head        i_sb_list                 Pointers for the list of inodes of the superblock

           struct list_head        i_dentry            The head of the list of dentry objects referencing this inode

           unsigned long       i_ino               inode number

           atomic_t            i_count         Usage counter

           umode_t         i_mode          File type and access rights

           unsigned int            i_nlink                    Number of hard links

           uid_t               i_uid               Owner identifier

           gid_t               i_gid               Group identifier

 

           dev_t               i_rdev          Real device identifier

 

           loff_t          i_size          File length in bytes

 

           struct timespec       i_atime             Time of last file access

 

           struct timespec     i_mtime         Time of last file write

 

           struct timespec     i_ctime         Time of last inode change

        unsigned int            i_blkbits           Block size in number of bits

        unsigned long       i_blksize           Block size in bytes

        unsigned long       i_version           Version number, automatically increased after each use

        unsigned long       i_blocks            Number of blocks of the file

           unsigned short        i_bytes             Number of bytes in the last block of the file

           unsigned char         i_sock                    Nonzero if file is a socket

           spinlock_t               i_lock                    Spin lock protecting some fields of the inode

        struct          semaphore i_sem     inode semaphore

           struct rw_semaphore     i_alloc_sem             Read/write semaphore protecting against race conditions in direct I/O file operations

        struct inode_operations *   i_op                inode operations

        struct file_operations *    i_fop               Default file operations

        struct super_block *    i_sb                Pointer to superblock object

        struct file_lock *     i_flock         Pointer to file lock list

        struct address_space*   i_mapping           Pointer to an address_space object (see Chapter 15)

        struct address_space    i_data address_space    object of the file

        struct dquot * []       i_dquot         inode disk quotas

        struct list_head        i_devices           Pointers for a list of inodes relative to a specific character or block device (see Chapter 13)

        struct  pipe_inode_info *   i_pipe          Used if the file is a pipe (see Chapter 19)

        struct block_device *   i_bdev          Pointer to the block device driver

        struct cdev *       i_cdev          Pointer to the character device driver int i_cindex Index of the device

                                                        file within a group of minor numbers

        _ _u32          i_generation            inode version number (used by some filesystems)

        unsigned long       i_dnotify_mask      Bit mask of directory notify events

        struct dnotify_struct   * i_dnotify         Used for directory notifications

        unsigned long       i_state         inode state flags

           unsigned long               dirtied_when                 Dirtying time (in ticks) of the inode

        unsigned int            i_flags         Filesystem mount flags

        atomic_t            i_writecount            Usage counter for writing processes

        void *          i_security          Pointer to inode's security structure

        void *          u.generic_ip            Pointer to private data

           seqcount_t             i_size_seqcount       Sequence counter used in SMP systems to get consistent values for i_size

 

 

           2:The methods associated with an inode object are also called inode operations

 

           12.2.3. File Objects

 

           A file object describes how a process interacts with a file it has opened,

           The object is created when the file is opened and consists of a file structure

 

           Table 12-4. The fields of the file object

           Type                   Field                        Description

           struct list_head    f_list          Pointers for generic file object list

           struct dentry * f_dentry            dentry object associated with the file

           struct vfsmount *   f_vfsmnt            Mounted filesystem containing the file

 

           file_operations *   f_op                Pointer to file operation table

        atomic_t        f_count         File object's reference counter

        unsigned int        f_flags         Flags specified when opening the file

        mode_t      f_mode          Process access mode

           int             f_error             Error code for network write operation

        loff_t      f_pos               Current file offset (file pointer)

        struct fown_struct f_owner         Data for I/O event notification via signals

        unsigned int        f_uid               User's UID

           unsigned int        f_gid               User group ID

           struct file_ra_state   f_ra File                 read-ahead state (see Chapter 16)

           size_t               f_maxcount            Maximum number of bytes that can be read or written with a single operation (currently set to 231-1)

        unsigned long   f_version           Version number, automatically increased after each use

           void *              f_security               Pointer to file object's security structure

        void *      private_data            Pointer to data specific for a filesystem or a device driver

           struct list_head f_ep_links              Head of the list of event poll waiters for this file

           spinlock_t         f_ep_lock               Spin lock protecting the f_ep_links list

        struct address_space* f_mapping Pointer to file's address space object (see Chapter 15)

 

 

           file operations:

           llseek(file, offset, origin)

       Updates the file pointer.

 

           read(file, buf, count, offset)

       Reads count bytes from a file starting at position *offset; the value *offset (which usually

           corresponds to the file pointer) is then increased.

 

           aio_read(req, buf, len, pos)

           Starts an asynchronous I/O operation to read len bytes into buf from file position pos

           (introduced to support the io_submit( ) system call).

 

           write(file, buf, count, offset)

           Writes count bytes into a file starting at position *offset; the value *offset (which usually

           corresponds to the file pointer) is then increased.

 

           aio_write(req, buf, len, pos)

           Starts an asynchronous I/O operation to write len bytes from buf to file position pos.

 

           readdir(dir, dirent, filldir)

           Returns the next directory entry of a directory in dirent; the filldir parameter contains the

           address of an auxiliary function that extracts the fields in a directory entry.

 

           poll(file, poll_table)

           Checks whether there is activity on a file and goes to sleep until something happens on it.

 

           ioctl(inode, file, cmd, arg)

           Sends a command to an underlying hardware device. This method applies only to device files.

 

           unlocked_ioctl(file, cmd, arg)

           Similar to the ioctl method, but it does not take the big kernel lock (see the section "The Big

           Kernel Lock" in Chapter 5). It is expected that all device drivers and all filesystems will

           implement this new method instead of the ioctl method.

 

           compat_ioctl(file, cmd, arg)

           Method used to implement the ioctl() 32-bit system call by 64-bit kernels.

 

           mmap(file, vma)

           Performs a memory mapping of the file into a process address space (see the section "Memory

           Mapping" in Chapter 16).

 

           open(inode, file)

           Opens a file by creating a new file object and linking it to the corresponding inode object (see

           the section "The open( ) System Call" later in this chapter).

 

           flush(file)

           Called when a reference to an open file is closed. The actual purpose of this method is

           filesystem-dependent.

 

           release(inode, file)

           Releases the file object. Called when the last reference to an open file is closedthat is, when

           the f_count field of the file object becomes 0.

 

 

           fsync(file, dentry, flag)

           Flushes the file by writing all cached data to disk.

 

 

           aio_fsync(req, flag)

           Starts an asynchronous I/O flush operation.

 

 

           fasync(fd, file, on)

           Enables or disables I/O event notification by means of signals.

 

 

           lock(file, cmd, file_lock)

           Applies a lock to the file (see the section "File Locking" later in this chapter).

 

       readv(file, vector, count, offset)

           Reads bytes from a file and puts the results in the buffers described by vector; the number of

           buffers is specified by count.

 

       writev(file, vector, count, offset)

           Writes bytes into a file from the buffers described by vector; the number of buffers is specified by count.

 

       sendfile(in_file, offset, count, file_send_actor, out_file)

           Transfers data from in_file to out_file (introduced to support the sendfile( ) system call).

 

           sendpage(file, page, offset, size, pointer, fill)

           Transfers data from file to the page cache's page; this is a low-level method used by

        sendfile( ) and by the networking code for sockets.

 

           get_unmapped_area(file, addr, len, offset, flags)

           Gets an unused address range to map the file.

 

           check_flags(flags)

           Method invoked by the service routine of the fcntl( ) system call to perform additional checks

           when setting the status flags of a file (F_SETFL command). Currently used only by the NFS

           network filesystem.

 

       dir_notify(file, arg)

           Method invoked by the service routine of the fcntl( ) system call when establishing a

           directory change notification (F_NOTIFY command). Currently used only by the Common

           Internet File System (CIFS ) network filesystem.

 

       flock(file, flag, lock)

           Used to customize the behavior of the flock() system call. No official Linux filesystem makes

           use of this method

 

 

           12.2.4. dentry Objects(directory entry object)

 

           Table 12-5. The fields of the dentry object

           Type                         Field                          Description

 

              atomic_t            d_count         Dentry object usage counter

        unsigned int            d_flags         Dentry cache flags

           spinlock_t               d_lock                   Spin lock protecting the dentry object

        struct inode *      d_inode         Inode associated with filename

        struct dentry *     d_parent            Dentry object of parent directory

        struct qstr         d_name          Filename

        struct list_head        d_lru               Pointers for the list of unused dentries

        struct list_head        d_child         For directories, pointers for the list of directory dentries in the same parent directory

        struct list_head        d_subdirs           For directories, head of the list of subdirectory dentries

        struct list_head        d_alias         Pointers for the list of dentries associated with the same inode (alias)

        unsigned long       d_time          Used by d_revalidate method

        struct dentry_operations*   d_op                Dentry methods

        struct super_block *    d_sb                Superblock object of the file

        void *          d_fsdata            Filesystem-dependent data

           struct rcu_head       d_rcu                     The RCU descriptor used when reclaiming the dentry object

                                                        (see the section "Read-Copy Update (RCU)" in Chapter 5)

           struct dcookie_struct *   d_cookie                Pointer to structure used by kernel profilers

        struct hlist_node       d_hash          Pointer for list in hash table entry

        int             d_mounted           For directories, counter for the number of filesystems mounted on this dentry

        unsigned char[]     d_iname         Space for short filename

 

 

           the dentry_operations structure, whose address is stored in the d_op field.

 

           d_revalidate(dentry, nameidata)

           Determines whether the dentry object is still valid before using it for translating a file

           pathname. The default VFS function does nothing, although network filesystems may specify

           their own functions.

 

           d_hash(dentry, name)

           Creates a hash value; this function is a filesystem-specific hash function for the dentry hash

           table. The dentry parameter identifies the directory containing the component. The name

           parameter points to a structure containing both the pathname component to be looked up and

           the value produced by the hash function.

 

           d_compare(dir, name1, name2)

           Compares two filenames ; name1 should belong to the directory referenced by dir. The default

           VFS function is a normal string match. However, each filesystem can implement this method in

           its own way. For instance, MS-DOS does not distinguish capital from lowercase letters.

 

           d_delete(dentry)

           Called when the last reference to a dentry object is deleted (d_count becomes 0). The default

           VFS function does nothing.

 

           d_release(dentry)

           Called when a dentry object is going to be freed (released to the slab allocator). The default

           VFS function does nothing.

 

 

           d_iput(dentry, ino)

           Called when a dentry object becomes "negative"that is, it loses its inode. The default VFS

           function invokes iput( ) to release the inode object.

 

           12.2.5. The dentry Cache

           1:The addresses of the first and last elements of the LRU list are stored in the next and

        prev fields of the dentry_unused variable of type list_head. The d_lru field of the dentry object

           contains pointers to the adjacent dentries in the list.

 

           2:Each "in use" dentry object is inserted into a doubly linked list specified by the i_dentry field of the

           corresponding inode object (because each inode could be associated with several hard links, a list is

           required). The d_alias field of the dentry object stores the addresses of the adjacent elements in the list.

 

 

           3:The hash table is implemented by means of a dentry_hashtable array

 

          

           12.2.6. Files Associated with a Process

           fs_struct

 

           Table 12-6. The fields of the fs_struct structure

 

           Type                  Field                   Description

        atomic_t        count           Number of processes sharing this table

        rwlock_t        lock            Read/write spin lock for the table fields

        int         umask           Bit mask used when opening the file to set the file permissions

        struct dentry   *   root            Dentry of the root directory

        struct dentry* pwd         Dentry of the current working directory

        struct dentry* altroot     Dentry of the emulated root directory (always NULL for the 80 x 86 architecture)

        struct  vfsmount * rootmnt     Mounted filesystem object of the root directory

        struct vfsmount *   pwdmnt      Mounted filesystem object of the current working directory

        struct vfsmount *   altrootmnt           Mounted filesystem object of the emulated root directory (always NULL for the 80 x 86 architecture)

 

           files_struct

 

        Table 12-7. The fields of the files_struct structure

        

         Type                 Field                   Description

        atomic_t        count           Number of processes sharing this table

        rwlock_t        file_lock       Read/write spin lock for the table fields

        int         max_fds     Current maximum number of file objects

        int         max_fdset       Current maximum number of file descriptors

        int         next_fd     Maximum file descriptors ever allocated plus 1

        struct file ** fd          Pointer to array of file object pointers

        fd_set *        close_on_exec       Pointer to file descriptors to be closed on exec( )

        fd_set *        open_fds        Pointer to open file descriptors

        fd_set      close_on_exec_init Initial set of file descriptors to be closed on exec( )

        fd_set      open_fds_init   Initial set of file descriptors

        struct file *[] fd_array        Initial array of file object pointers

           Figure 12-3. The fd array

 

          

 

           fget( )/fget_light( )

        fput( )/fput_light( )

 

 

      12.3. Filesystem Types

 

           12.3.1. Special Filesystems

 

           Table 12-8. Most common special filesystems

              Name                        Mount point                    Description

           bdev                      none                    Block devices (see Chapter 13)

           binfmt_misc                 any                       Miscellaneous executable formats (see Chapter 20)

           devpts             /dev/pts                 Pseudoterminal support (Open Group's Unix98 standard)

           eventpollfs             none                    Used by the efficient event polling mechanism

            futexfs             none                     Used by the futex (Fast Userspace Locking) mechanism

           pipefs                    none                     Pipes (see Chapter 19)

           proc                      /proc                     General access point to kernel data structures

           rootfs                    none                    Provides an empty root directory for the bootstrap phase

           shm                      none                     IPC-shared memory regions (see Chapter 19)

           mqueue                 any                       Used to implement POSIX message queues (see Chapter 19)

           sockfs                    none                    Sockets

           sysfs                     /sys                      General access point to system data (see Chapter 13)

           tmpfs                    any                       Temporary files (kept in RAM unless swapped)

           usbfs                     /proc/bus/usb        USB devices

 

 

           12.3.2. Filesystem Type Registration

 

           Each registered filesystem is represented as a file_system_type object whose fields are illustrated in Table 12-9.

 

           Table 12-9. The fields of the file_system_type object

 

         Type                                Field                          Description

              const char *                name                Filesystem name

           int                 fs_flags            Filesystem type flags

        struct super_block * (*)()      get_sb          Method for reading a superblock void (*)( ) kill_sb Method for removing a superblock

        struct module *         owner               Pointer to the module implementing the filesystem (see Appendix B)

        struct file_system_type *       next                Pointer to the next element in the list of filesystem types

        struct list_head            fs_supers           Head of a list of superblock objects having the same filesystem type

 

           All filesystem-type objects are inserted into a singly linked list. The file_systems variable points to the first item

 

 

      12.4. Filesystem Handling

 

        12.4.1. Namespaces

           The namespace of a process is represented by a namespace structure pointed to by the namespace

           field of the process descriptor

 

           Table 12-11. The fields of the namespace structure

 

         Type                  Field                          Description

 

           atomic_t        count               Usage counter (how many processes share the namespace)

        struct vfsmount *   root                Mounted filesystem descriptor for the root directory of the namespace

        struct list_head    list                Head of list of all mounted filesystem descriptors

        struct rw_semaphore sem             Read/write semaphore protecting this structure

 

 

           12.4.2. Filesystem Mounting

           vfsmount

 

        12.4.3. Mounting a Generic Filesystem

 

                 12.4.3.1. The do_kern_mount( ) function

 

                 12.4.3.2. Allocating a superblock object

 

 

           12.4.4. Mounting the Root Filesystem

 

                 Why does the kernel bother to mount the rootfs filesystem before the real one? Well, the rootfs

                 filesystem allows the kernel to easily change the real root filesystem.

 

                 12.4.4.1. Phase 1: Mounting the rootfs filesystem

 

                 init_rootfs( )/

            init_mount_tree( )

 

                     12.4.4.2. Phase 2: Mounting the real root filesystem

                     prepare_namespace( )

 

         12.4.5. Unmounting a Filesystem

                 do_umount( )

                

      12.5. Pathname Lookup 

         path_lookup(const char *name unsigned int flags,struct nameidata *nd )/path_lookupat( )

 

        Table 12-15. The fields of the nameidata data structure

         Type                         Field                   Description

        struct dentry *     dentry      Address of the dentry object

        struct vfs_mount *      mnt         Address of the mounted filesystem object

        struct qstr         last            Last component of the pathname (used when the LOOKUP_PARENT flag is set)

        unsigned int            flags           Lookup flags

        int             last_type       Type of last component of the pathname (used when the LOOKUP_PARENT flag is set)

           unsigned int                  depth               Current level of symbolic link nesting (see below); it must be smaller than 6

           char[ ] *                saved_names          Array of pathnames associated with nested symbolic links

           union                     intent               One-member union specifying how the file will be accessed

 

 

         Table 12-16. The flags of the lookup operation

              Macro         Description

        LOOKUP_FOLLOW   If the last component is a symbolic link, interpret (follow) it

        LOOKUP_DIRECTORY    The last component must be a directory

        LOOKUP_CONTINUE There are still filenames to be examined in the pathname

        LOOKUP_PARENT   Look up the directory that includes the last component of the pathname

        LOOKUP_NOALT        Do not consider the emulated root directory (useless in the 80x86 architecture

         LOOKUP_OPEN     Intent is to open a file

        LOOKUP_CREATE   Intent is to create a file (if it doesn't exist)

        LOOKUP_ACCESS   Intent is to check user's permission for a file

 

         12.5.1. Standard Pathname Lookup

                 link_path_walk( )

        12.5.2. Parent Pathname Lookup

 

         12.5.3. Lookup of Symbolic Links

 

     12.6. Implementations of VFS System Calls

 

         12.6.1. The open( ) System Call

 

              Table 12-18. The flags of the open( ) system call

 

              Flag name                Description

            O_RDONLY            Open for reading

            O_WRONLY            Open for writing

            O_RDWR          Open for both reading and writing

                 O_CREAT         Create the file if it does not exist

            O_EXCL          With O_CREAT, fail if the file already exists

            O_NOCTTY            Never consider the file as a controlling terminal

            O_TRUNC         Truncate the file (remove all existing contents)

            O_APPEND            Always write at end of the file

            O_NONBLOCK          No system calls will block on the file

            O_NDELAY            Same as O_NONBLOCK

            O_SYNC          Synchronous write (block until physical write terminates)

            FASYNC          I/O event notification via signals

            O_DIRECT            Direct I/O transfer (no kernel buffering)

            O_LARGEFILE         Large file (size greater than 2 GB)

            O_DIRECTORY         Fail if file is not a directory

            O_NOFOLLOW          Do not follow a trailing symbolic link in pathname

                 O_NOATIME                  Do not update the inode's last access time

 

         12.6.2. The read( ) and write( ) System Calls

 

         12.6.3. The close( ) System Call

 

      12.7. File Locking

        The POSIX standard requires a file-locking mechanism based on the fcntl( ) system call

 

           12.7.1. Linux File Locking

           By issuing the flock( ) system call. The two parameters of the system call are the fd file

           descriptor, and a command to specify the lock operation. The lock applies to the whole file

 

           By using the fcntl( ) system call. The three parameters of the system call are the fd file

           descriptor, a command to specify the lock operation, and a pointer to a flock structure (see

           Table 12-20). A couple of fields in this structure allow the process to specify the portion of the

           file to be locked. Processes can thus hold several locks on different portions of the same file.

     

           12.7.2. File-Locking Data Structures

           Table 12-19. The fields of the file_lock data structure

         Type                        Field                          Description

        struct file_lock *      fl_next         Next element in list of locks associated with the inode

        struct list_head        fl_link         Pointers for active or blocked list

        struct list_head        fl_block            Pointers for the lock's waiters list

        struct files_struct *   fl_owner            Owner's files_struct

        unsigned int            fl_pid          PID of the process owner

        wait_queue_head_t       fl_wait         Wait queue of blocked processes

        struct file *       fl_file         Pointer to file object

        unsigned char       fl_flags            Lock flags

        unsigned char       fl_type         Lock type

        loff_t          fl_start            Starting offset of locked region

        loff_t          fl_end          Ending offset of locked region

        struct fasync_struct * fl_fasync           Used for lease break notifications

           unsigned long               fl_break_time               Remaining time before end of lease

      struct file_lock_operations * fl_ops                    Pointer to file lock operations

      struct lock_manager_operations* fl_mops                 Pointer to lock manager operations

        union               fl_u                Filesystem-specific information

 

         12.7.3. FL_FLOCK Locks

 

           flock_lock_file_wait( )

 

           12.7.4. FL_POSIX Locks

 

           Table 12-20. The fields of the flock data structure

              Type                  Field                   Description

        short           l_type      F_RDLOCK (requests a shared lock), F_WRLOCK (requests an exclusive lock), F_UNLOCK (releases the lock)

           short           l_whence        SEEK_SET (from beginning of file), SEEK_CURRENT (from current file pointer), SEEK_END (from end of file)

           off_t           l_start     Initial offset of the locked region relative to the value of l_whence

        off_t           l_len           Length of locked region (0 means that the region includes all potential writes past the current end of the file)

        pid_t           l_pid           PID of the owner

 

           F_GETLK

           Determines whether the lock described by the flock structure conflicts with some FL_POSIX

           lock already obtained by another process. In this case, the flock structure is overwritten with

           the information about the existing lock.

       F_SETLK

           Sets the lock described by the flock structure. If the lock cannot be acquired, the system call

           returns an error code.

       F_SETLKW

           Sets the lock described by the flock structure. If the lock cannot be acquired, the system call

           blocks; that is, the calling process is put to sleep until the lock is available.

 

 

Chapter 13. I/O Architecture and Device Drivers

      13.1. I/O Architecture

           Figure 13-1. PC's I/O architecture

          

 

           13.1.1. I/O Ports

 

                 13.1.1.1. Accessing I/O ports

                 inb( ), inw( ), inl( )

           inb( ), inw( ), inl( )

      

           outb( ), outw( ), outl( )

           outb_p( ), outw_p( ), outl_p( )

 

           insb( ), insw( ), insl( )

           outsb( ), outsw( ), outsl( )

 

           13.1.2. I/O Interfaces

 

                 13.1.2.1. Custom I/O interfaces

 

                 13.1.2.2. General-purpose I/O interfaces

 

         13.1.3. Device Controllers

 

 

    13.2. The Device Driver Model

        13.2.1. The sysfs Filesystem

           Relationships between components of the device driver models are expressed in the sysfs filesystem

           as symbolic links between directories and files. For example, the /sys/block/sda/device file can be a

           symbolic link to a subdirectory nested in /sys/devices/pci0000:00 representing the SCSI controller

           connected to the PCI bus. Moreover, the /sys/block/sda/device/block file is a symbolic link to

           /sys/block/sda, stating that this PCI device is the controller of the SCSI disk

 

           13.2.2. Kobjects

        each kobject corresponds to a directory in that filesystem

                 13.2.2.1. Kobjects, ksets, and subsystems

             

                     Table 13-2. The fields of the kobject data structure

              Type                  Field                   Description

            char *      k_name      Pointer to a string holding the name of the container

            char []     name            String holding the name of the container, if it fits in 20 bytes

            struct k_ref        kref            The reference counter for the container

            struct list_head    entry           Pointers for the list in which the kobject is inserted

            struct kobject *    parent      Pointer to the parent kobject, if any

            struct kset *   kset            Pointer to the containing kset

            struct kobj_type * ktype           Pointer to the kobject type descriptor

            struct dentry * dentry      Pointer to the dentry of the sysfs file associated with the kobject

 

                 The kobj_type data structure includes three fields:

                 a release method

                 a sysfs_ops pointer to a table of sysfs operations

            and a list of default attributes for the sysfs filesystem

 

     

                 Table 13-3. The fields of the kset data structure

 

              Type                  Field                   Description

            struct subsystem    * subsys        Pointer to the subsystem descriptor

            struct kobj_type    * ktype     Pointer to the kobject type descriptor of the kset

            struct list_head    list            Head of the list of kobjects included in the kset

            struct kobject kobj            Embedded kobject (see text)

            struct kset_hotplug_ops * hotplug_ops   Pointer to a table of callback functions for kobject filtering and hot-plugging

 

            Figure 13-3. An example of device driver model hierarchy

           

 

            13.2.2.2. Registering kobjects, ksets, and subsystems

 

                     kset_register() and kset_unregister( ) functions

 

 

 

                 13.2.3. Components of the Device Driver Mod

 

                      13.2.3.1. Devices

 

                            device

                            Table 13-4. The fields of the device object

                   Type                         Field                          Description

                struct list_head        node                Pointers for the list of sibling devices

                struct list_head        bus_list            Pointers for the list of devices on the same bus type

                struct list_head        driver_list         Pointers for the driver's list of devices

                struct list_head        children            Head of the list of children devices

                struct device *     parent          Pointer to the parent device

                struct kobject      kobj                Embedded kobject

                char []         bus_id          Device position on the hosting bus

                struct bus_type *       bus             Pointer to the hosting bus

                struct device_driver    *   driver          Pointer to the controlling device driver

                void *          driver_data         Pointer to private data for the driver

                void *          platform_data       Pointer to private data for legacy device drivers

                struct dev_pm_info      power               Power management information

                      unsigned long               detach_state                 Power state to be entered when unloading the device driver

                      unsigned long long *       dma_mask             Pointer to the DMA mask of the device(see the later section

                                                                   "Direct Memory Access (DMA)")

 

                      unsigned long long         coherent_dma_mask           Mask for coherent DMA of the device

 

                      struct list_head        dma_pools              Head of a list of aggregate DMA buffers

                      struct dma_coherent_mem * dma_mem             Pointer to a descriptor of the coherent DMA memory

                                                                   used by the device (see the later section "Direct Memory Access (DMA)")

                      void (*)(struct device*)  release             Callback function for releasing the device descriptor

 

 

                      13.2.3.2. Drivers

 

                            device_driver

                Table 13-5. The fields of the device_driver object

                Type                                 Field                        Description

                char *              name                Name of the device driver

                struct bus_type *           bus             Pointer to descriptor of the bus that hosts the supported devices

                struct semaphore            unload_sem          Semaphore to forbid device driver unloading; it is

                                                                         released when the reference counter reaches zero

                struct kobject          kobj                Embedded kobject

                struct list_head            devices         Head of the list including all devices supported by the driver

                      struct module *             owner                    Identifies the module that implements the device

                                                                         driver, if any (see Appendix B)

                int (*)(struct device *)        probe               Method for probing a device

                                                                         (checking that it can be handled by the device driver)

                int (*)(struct device *)        remove          Method invoked on a device when it is removed

                void (*)(struct device *)       shutdown            Method invoked on a device when it is powered off (shut down)

            int (*)(struct device *,unsigned long, unsigned long) suspend   Method invoked on a device when it is put in lowpower state

                int (*)(struct device *,unsigned long) resume           Method invoked on a device when it is put back in

                                                                         the normal state (full power)

                13.2.3.3. Buses

                bus_type object

 

                      Table 13-6. The fields of the bus_type object

                Type                                       Field                          Description

                char *                  name                Name of the bus type

                struct subsystem                subsys          Kobject subsystem associated with this bus type

                struct kset                 drivers         The set of kobjects of the drivers

                struct kset                 devices         The set of kobjects of the devices

                      struct bus_attribute *                bus_attrs                Pointer to the object including the bus attributes and

                                                                              the methods for exporting them to the sysfs filesystem

                      struct device_attribute *                  dev_attrs               Pointer to the object including the device attributes and                                                                               the methods for exporting them to the sysfs filesystem

                      struct driver_attribute *             drv_attrs                Pointer to the object including the device driver

                                                                              attributes and the methods for exporting them to the                                                                                  sysfs filesystem

                int (*)(struct device *,struct device_driver *) match           Method for checking whether a given driver supports a                                                                                   given device

                int (*)(struct device *, char**, int, char *, int) hotplug      Method invoked when a device is being registered

                int (*)(struct device *,unsigned long) suspend              Method for saving the hardware context state and

                                                                              changing the power level of a device

                int (*)(struct device *) resume                     Method for changing the power level and restoring

                                                                              the hardware context of a device

 

                13.2.3.4. Classes

 

                            The classes of the device driver model are essentially aimed to provide a standard method for

                      exporting to User Mode applications the interfaces of the logical devices . Each class_device

                      descriptor embeds a kobject having an attribute (special file) named dev. Such attribute stores the

                      major and minor numbers of the device file that is needed to access to the corresponding logical

                      device

 

    13.3. Device Files

                 major number, identifies the device type       

                Traditionally, all device files that have the same major number and the same type share the same set of file

                      operations, because they are handled by the same device driver

                minor number identifies a specific device among a group of devices that share the same major numbe               

                13.3.1. User Mode Handling of Device Files

                    MKDEV

                    13.3.1.1. Dynamic device number assignment

                    13.3.1.2. Dynamic device file creation

                                   udev toolset can automatically

                     

 

                13.3.2. VFS Handling of Device Files

                The inode object is initialized by reading the corresponding inode on disk through a suitable function

                      of the filesystem (usually ext2_read_inode( ) or ext3_read_inode( ); see Chapter 18). When this

                      function determines that the disk inode is relative to a device file, it invokes init_special_inode( ),

                      which initializes the i_rdev field of the inode object to the major and minor numbers of the device

                      file, and sets the i_fop field of the inode object to the address of either the def_blk_fops or the

                def_chr_fops file operation table, according to the type of device file. The service routine of the

                open( ) system call also invokes the dentry_open( ) function, which allocates a new file object and

                      sets its f_op field to the address stored in i_fopthat is, to the address of def_blk_fops or

                def_chr_fops once again. Thanks to these two tables, every system call issued on a device file will

                      activate a device driver's function rather than a function of the underlying filesystem.

 

    13.4. Device Drivers

                13.4.1. Device Driver Registration

                13.4.2. Device Driver Initialization

                13.4.3. Monitoring I/O Operations

                    13.4.3.1. Polling mode

                                  

                                   13.4.3.2. Interrupt mode

                13.4.4. Accessing the I/O Shared Memory

                    ioremap( ) or ioremap_nocache()

                13.4.5. Direct Memory Access (DMA)

                    13.4.5.1. Synchronous and asynchronous DMA

                        synchronous DMA the data transfers are triggered by processes

                        asynchronous DMA the data transfers are triggered by hardware devices

 

                            13.4.5.2. Helper functions for DMA transfers

   

                    13.4.5.3. Bus addresses

 

                    13.4.5.4. Cache coherency

                        Coherent DMA mapping

                                  Streaming DMA mapping

                    13.4.5.5. Helper functions for coherent DMA mappings

                                          dma_alloc_coherent( )/dma_free_coherent( )

 

                    13.4.5.6. Helper functions for streaming DMA mappings

 

                                          dma_map_single( )/dma_unmap_single( )

                13.4.6. Levels of Kernel Support

                    The Linux kernel does not fully support all possible existing I/O devices. Generally speaking, in fact,

                            there are three possible kinds of support for a hardware device:

                     

                            No support at all

                    Minimal support

                            Extended support  

 

                    The ioctl( ) system call was introduced to satisfy such needs :

                            let an application check whether the device is in a specific internal state

 

      13.5. Character Device Drivers

       

        cdev structure

           Table 13-8. The fields of the cdev structure

              Type                                Field                   Description

        struct kobject          kobj            Embedded kobject

        struct module *         owner           Pointer to the module implementing the driver, if any

        struct file_operations *        ops         Pointer to the file operations table of the device driver

        struct list_head            list            Head of the list of inodes relative to device files for this character device

        dev_t                   dev         Initial major and minor numbers assigned to the device driver

        unsigned int                count           Size of the range of device numbers assigned to the device driver

 

        Table 13-9. The fields of the probe object

              Type                                                     Field                   Description

        struct probe *                      next            Next element in hash collision list

        dev_t                               dev         Initial device number (major and minor) of the interval

        unsigned long                       range           Size of the interval

        struct module *                     owner           Pointer to the module that implements the device driver, if any

        struct kobject *(*)(dev_t, int *, void*)            get         Method for probing the owner of the interval

        int (*)(dev_t, void*)                   lock            Method for increasing the reference counter of the owner of the                                                                     interval

        void *                          data            Private data for the owner of the interval

 

        13.5.1. Assigning Device Numbers

            char_device_struct structure

            Table 13-10. The fields of the char_device_struct descriptor

                     Type                                                     Field                   Description

            unsigned char_device_struct *               next            The pointer to next element in hash collision list

            unsigned int                            major           The major number of the interval

            unsigned int                            baseminor       The initial minor number of the interval

            int                             minorct     The interval size

            const char *                            name            The name of the device driver that handles the interval

            struct file_operations *                    fops            Not used

            struct cdev *                       cdev            Pointer to the character device driver descriptor

           

           two ways:

            (1) register_chrdev_region( ) and alloc_chrdev_region( ) functions+cdev_add(

            (2) register_chrdev( )

 

            13.5.1.1. The register_chrdev_region( ) and alloc_chrdev_region( ) functions

                The _ _register_chrdev_region( ) function executes the following steps

            13.5.1.2. The register_chrdev( ) function

                (1):_ _register_chrdev_region( )

                (2):

        13.5.2. Accessing a Character Device Driver

                chrdev_open( )

        13.5.3. Buffering Strategies for Character Devices

                 This can be done by combining two different techniques:

                

                 1:Use of DMA to transfer blocks of data.

 

                 2:Use of a circular buffer of two or more elements, each element having the size of a block of

                 data.When an interrupt occurs signaling that a new block of data has been read,the interrupt

                 handler advances a pointer to the elements of the circular buffer so that further data will be

                 stored in an empty element.Conversely, whenever the driver succeeds in copying a block of

                 data into user address space, it releases an element of the circular buffer so that it is available

                 for saving new data from the hardware device.

 

Chapter 14. Block Device Drivers

      14.1. Block Devices Handling

 

           Figure 14-1. Kernel components affected by a block device operation

 

          

 

           Figure 14-2. Typical layout of a page including disk data

        

 

           14.1.1. Sectors

                 the sector is the basic unit of data transfer for the hardware devices

            In Linux, the size of a sector is conventionally set to 512 bytes; sector_t

 

        14.1.2. Blocks

                 the block is the basic unit of data transfer for the VFS

                 Each buffer has a "buffer head" descriptor of type buffer_head.

                 We will give a detailed explanation of all fields of the buffer head in Chapter 15

 

        14.1.3. Segments

            As we'll see, the generic block layer can merge different segments if the corresponding page frames

                 happen to be contiguous in RAM and the corresponding chunks of disk data are adjacent on disk.

                 The larger memory area resulting from this merge operation is called physical segment.

 

                 Yet another merge operation is allowed on architectures that handle the mapping between bus

                 addresses and physical addresses through a dedicated bus circuitry (the IO-MMU; see the section

                 "Direct Memory Access (DMA)" in Chapter 13). The memory area resulting from this kind of merge

                 operation is called hardware segment .

 

 

    14.2. The Generic Block Layer

 

        14.2.1. The Bio Structure

 

            Table 14-1. The fields of the bio structure

              Type                          Field                                      Description

            sector_t            bi_sector                   First sector on disk of block I/O operation

                 struct bio *             bi_next                        Link to the next bio in the request queue

            struct block_device *   bi_bdev                 Pointer to block device descriptor

            unsigned long       bi_flags                    Bio status flags

            unsigned long       bi_rw                       I/O operation flags

            unsigned short      bi_vcnt                 Number of segments in the bio's bio_vec array

            unsigned short      bi_idx                  Current index in the bio's bio_vec array of segments

            unsigned short      bi_phys_segments                Number of physical segments of the bio after merging

            unsigned short      bi_hw_segments              Number of hardware segments after merging

              unsigned int            bi_size                 Bytes (yet) to be transferred

                 unsigned int                  bi_hw_front_size                      Used by the hardware segment merge algorithm

                 unsigned int                  bi_hw_back_size                       Used by the hardware segment merge algorithm

            unsigned int            bi_max_vecs                 Maximum allowed number of segments in the bio's bio_vec array

                 struct bio_vec *            bi_io_vec                           Pointer to the bio's bio_vec array of segments

                 bio_end_io_t *         bi_end_io                           Method invoked at the end of bio's I/O operation

                 atomic_t                bi_cnt                               Reference counter for the bio

                 void *                    bi_private                          Pointer used by the generic block layer and the I/O

                                                                         completion method of the block device driver

                 bio_destructor_t*          bi_destructor                           Destructor method (usually bio_destructor()) invoked when

                                                                         the bio is being freed

              Table 14-2. The fields of the bio_vec structure

                     Type                         Field                                        Description

            struct page *       bv_page                 Pointer to the page descriptor of the segment's page frame

            unsigned int            bv_len                  Length of the segment in bytes

            unsigned int            bv_offset                   Offset of the segment's data in the page frame

 

        14.2.2. Representing Disks and Disk Partitions

 

                 Table 14-3. The fields of the gendisk object

 

              Type                                Field                          Description

            int                 major               Major number of the disk

            int                 first_minor         First minor number associated with the disk

            int                 minors          Range of minor numbers associated with the disk

            char [32]               disk_name           Conventional name of the disk (usually, the canonical

                                                                   name of the corresponding device file)

            struct hd_struct **         part                Array of partition descriptors for the disk

            struct block_device_operations* fops                Pointer to a table of block device methods

            struct request_queue *      queue               Pointer to the request queue of the disk (see "Request

                                                                   Queue Descriptors" later in this chapter)

            void *              private_data            Private data of the block device driver

            sector_t                capacity            Size of the storage area of the disk (in number of sectors)

            int                 flags               Flags describing the kind of disk (see below)

            char [64]               devfs_name          Device filename in the (nowadays deprecated) devfs special filesystem

            int                 number          No longer used

            struct device *         driverfs_dev            Pointer to the device object of the disk's hardware device

                                                                   (see the section "Components of the Device Driver Model" in Chapter 13)

            struct kobject          kobj                Embedded kobject (see the section "Kobjects" in Chapter 13)

            struct timer_rand_state     * random            Pointer to a data structure that records the timing of the

                                                                   disk's interrupts; used by the kernel built-in random number generator

            int                 policy          Set to 1 if the disk is read-only (write operations forbidden), 0 otherwise

            atomic_t                sync_io         Counter of sectors written to disk, used only for RAID

            unsigned long           stamp               Timestamp used to determine disk queue usage statistics

            unsigned long           stamp_idle          Same as above

            int                 in_flight           Number of ongoing I/O operations

            struct disk_stats *         dkstats         Statistics about per-CPU disk usage

 

 

            Table 14-4. The methods of the block devices

                     Method                            Triggers

            open                    Opening the block device file

            release             Closing the last reference to a block device file

            ioctl                   Issuing an ioctl( ) system call on the block device file (uses the big kernel lock )

            compat_ioctl                Issuing an ioctl( ) system call on the block device file (does not use the big kernel lock)

            media_changed           Checking whether the removable media has been changed (e.g., floppy disk)

            revalidate_disk         Checking whether the block device holds valid data

 

 

        14.2.3. Submitting a Request

                 generic_make_request()

 

    14.3. The I/O Scheduler

 

        14.3.1. Request Queue Descriptors

           struct request_queue *  queue

        Table 14-6. The fields of the request queue descriptor

         Type                                Field                   Description

        struct list_head            queue_head      List of pending requests

        struct request *            last_merge        Pointer to descriptor of the request in the queue to be considered first for possible merging

           elevator_t *                elevator        Pointer to the elevator object (see the later section "I/O Scheduling Algorithms")

           struct request_list         rq              Data structure used for allocation of request descriptors

        request_fn_proc *           request_fn      Method that implements the entry point of the strategy routine of the driver

           merge_request_fn*           back_merge_fn   Method to check whether it is possible to merge a bio to the last request in the queue

           merge_requests_fn *         merge_requests_fn   Method to attempt merging two adjacent requests in the queue

        make_request_fn *           make_request_fn Method invoked when a new request has to be insertedin the queue

        prep_rq_fn *                prep_rq_fn        Method to build the commands to be sent to the hardware device to process this request

        unplug_fn *             unplug_fn       Method to unplug the block device (see the section "Activating the Block Device Driver" later in the chapter)

        merge_bvec_fn *         merge_bvec_fn       Method that returns the number of bytes that can be inserted into an existing bio when adding a new segment                                                       (usually undefined)

        activity_fn *           activity_fn     Method invoked when a request is added to a queue(usually undefined)

        issue_flush_fn *            issue_flush_fn Method invoked when a request queue is flushed (the queue is emptied by processing all requests in a row)

        struct timer_list           unplug_timer        Dynamic timer used to perform device plugging (see the later section "Activating the Block Device Driver")

        int                 unplug_thresh       If the number of pending requests in the queue exceeds this value, the device is immediately unplugged                                                            (default is 4)

        unsigned long           unplug_delay        Time delay before device unplugging (default is 3 milliseconds)

        struct work_struct          unplug_work     Work queue used to unplug the device (see the later section "Activating the Block Device Driver")

           struct backing_dev_info       backing_dev_info     See the text following this table

           void *                         queuedata        Pointer to private data of the block device driver

           void *                         activity_data           Private data used by the activity_fn method

           unsigned long                     bounce_pfn       Page frame number above which buffer bouncing must be used (see the section "Submitting a Request" later in

                                                        this chapter)

           int                         bounce_gfp       Memory allocation flags for bounce buffers

           unsigned long                     queue_flags            Set of flags describing the queue status

           spinlock_t *                       queue_lock       Pointer to request queue lock

 

           struct kobject               kobj                Embedded kobject for the request queue

           unsigned long                     nr_requests            Maximum number of requests in the queue

           unsigned int                       nr_congestion_on     Queue is considered congested if the number of pending requests rises above this threshold

           unsigned int                       nr_congestion_off    Queue is considered not congested if the number of pending requests falls below this threshold

           unsigned int                       nr_batching       Maximum number (usually 32) of pending requests that can be submitted even when the queue is full by a

                                                        special "batcher" process

           unsigned short              max_sectors           Maximum number of sectors handled by a single request (tunable)

           unsigned short              max_hw_sectors      Maximum number of sectors handled by a single request(hardware constraint)

           unsigned short              max_phys_segments      Maximum number of physical segments handled by a single request

           unsigned short              max_hw_segments  Maximum number of hardware segments handled by a single request (the maximum number of distinct

                                                        memory areas in a scatter-gather DMA operation)

           unsigned short              hardsect_size    Size in bytes of a sector

           unsigned int                  max_segment_size Maximum size of a physical segment (in bytes

           unsigned long                     seg_boundary_mask      Memory boundary mask for segment merging

           unsigned int                       dma_alignment Alignment bitmap for initial address and length of DMA buffers (default 511)

           struct blk_queue_tag *         queue_tags       Bitmap of free/busy tags (used for tagged requests)

           atomic_t                      refcnt              Reference counter of the queue unsigned int in_flight Number of pending requests in the queue

           unsigned int                       sg_timeout        User-defined command time-out (used only by SCSI generic devices)

           unsigned int                       sg_reserved_size     Essentially unused

           struct list_head             drain_list           Head of a list of requests temporarily delayed until the I/O scheduler is dynamically replaced

 

 

           14.3.2. Request Descriptors

           request data structure

           Table 14-7. The fields of the request descriptor

         Type                         Field                          Description

        struct list_head        queuelist           Pointers for request queue list

        unsigned long       flags               Flags of the request (see below)

        sector_t            sector          Number of the next sector to be transferred

        unsigned long       nr_sectors          Number of sectors yet to be transferred in the whole request

           unsigned int                  current_nr_sectors        Number of sectors in the current segment of the current bio yet to be transferred

        sector_t            hard_sector         Number of the next sector to be transferred

        unsigned long       hard_nr_sectors     Number of sectors yet to be transferred in the whole request (updated by the generic block layer)

        unsigned int            hard_cur_sectors        Number of sectors in the current segment of the current bio yet to be transferred

                                                        (updated by the generic block layer)

        struct bio *            bio             First bio in the request that has not been completely transferred

        struct bio *            biotail         Last bio in the request list

        void *          elevator_private        Pointer to private data for the I/O scheduler int rq_status Request status:

                                                        essentially, either RQ_ACTIVE or RQ_INACTIVE

           struct gendisk *            rq_disk             The descriptor of the disk referenced by the request int errors Counter for the number of I/O errors that                                                          occurred on the current transfer

           unsigned long               start_time              Request's starting time (in jiffies)

           unsigned short        nr_phys_segments        Number of physical segments of the request

           unsigned short        nr_hw_segments           Number of hardware segments of the request

           int                   tag                  Tag associated with the request (only for hardware devices supporting multiple outstanding data transfers)

           char *              buffer                    Pointer to the memory buffer of the current data transfer (NULL if the buffer is in high-memory)

           int                   ref_count               Reference counter for the request

           request_queue_t *         q                     Pointer to the descriptor of the request queue containing the request

           struct request_list*        rl                     Pointer to request_list data structure

           struct completion*         waiting                   Completion for waiting for the end of the data transfers(see the section "Completions" in Chapter 5)

           void *                    special                   Pointer to data used when the request includes a "special"command to the hardware device

 

           unsigned int                  cmd_len                 Length of the commands in the cmd field

           unsigned char []      cmd                      Buffer containing the pre-built commands prepared by the request queue's prep_rq_fn method

           unsigned int                  data_len                 Usually, the length of data in the buffer pointed to by the data field

           void *                    data                      Pointer used by the device driver to keep track of the data to be transferred

           unsigned int                  sense_len               Length of buffer pointed to by the sense field (0 if the sense field is NULL)

           void *                    sense                    Pointer to buffer used for output of sense commands

           unsigned int                  timeout                  Request's time-out

           struct    request_pm_state*  pm                  Pointer to a data structure used for power-management commands

 

 

 

 

           The flags field stores a large number of flags, which are listed in Table 14-8.

 

 

                 14.3.2.1. Managing the allocation of request descriptors

                 blk_get_request( )

 

            blk_put_request( )

 

 

            14.3.2.2. Avoiding request queue congestion

                     blk_congestion_wait( )

 

        14.3.3. Activating the Block Device Driver

 

                 The blk_plug_device( ) function plugs a block deviceor more precisely

 

                 The blk_remove_plug( ) function unplugs a request queue q

 

 

           14.3.4. I/O Scheduling Algorithms

 

                 elevators.

                 Currently, Linux 2.6 offers four different types of I/O schedulersor elevatorscalled

                 "Anticipatory," "Deadline," "CFQ (Complete Fairness Queueing)," and "Noop (No Operation)."

          

 

                 The I/O scheduler algorithm used in a request queue is represented by an elevator object of type

                 elevator_t; its address is stored in the elevator field of the request queue descriptor

 

                 14.3.4.1. The "Noop" elevator

 

                 14.3.4.2. The "CFQ" elevator

 

                 14.3.4.3. The "Deadline" elevator

                

                 14.3.4.4. The "Anticipatory" elevator

 

        14.3.5. Issuing a Request to the I/O Scheduler

 

                 _ _make_request( )

            14.3.5.1. The blk_queue_bounce( ) function

 

    14.4. Block Device Drivers

 

            14.4.1. Block Devices

            block_device descriptor,

            Table 14-9. The fields of the block device descriptor

            Type                         Field                                 Description

            dev_t               bd_dev              Major and minor numbers of the block device

            struct inode *      bd_inode                Pointer to the inode of the file associated with the block device in the bdev filesystem

            int             bd_openers              Counter of how many times the block device has been opened

            struct semaphore        bd_sem                  Semaphore protecting the opening and closing of the block device

            struct semaphore        bd_mount_sem                Semaphore used to forbid new mounts on the block device

            struct list_head        bd_inodes               Head of a list of inodes of opened block device files for this block device

            void *          bd_holder               Current holder of block device descriptor

            int             bd_holders              Counter for multiple settings of the bd_holder field

            struct block_device *   bd_contains             If block device is a partition, it points to the block device descriptor of the whole disk;

                                                                   otherwise, it points to this block device descriptor

            unsigned            bd_block_size           Block size

            struct hd_struct*       bd_part             Pointer to partition descriptor (NULL if this block device is not a partition)

            unsigned                bd_part_count               Counter of how many times partitions included in this block device have been opened

            int                   bd_invalidated          Flag set when the partition table on this block device needs to be read

                 struct gendisk *            bd_disk             Pointer to gendisk structure of the disk underlying this block device

                 struct list_head *           bd_list             Pointers for the block device descriptor list

                

            struct backing_dev_info* bd_inode_backing_dev_info       Pointer to a specialized backing_dev_info descriptor for this block device (usually NULL)

 

                 unsigned long                bd_private              Pointer to private data of the block device holder

 

            Figure 14-3. Linking the block device descriptors with the other structures of the block subsystem

 

        

 

                14.4.1.1. Accessing a block device

 

              14.4.2. Device Driver Registration and Initialization

 

                 14.4.2.1. Defining a custom driver descriptor

                     First of all, the device driver needs a custom descriptor foo of type foo_dev_t holding the data required to drive the hardware device.   

                     struct foo_dev_t {

                     [...]

                     spinlock_t lock;

                     struct gendisk *gd;

                     [...]

                  } foo;

 

              register_blkdev()

            14.4.2.2. Initializing the custom descriptor

                            alloc_disk

                     14.4.2.3. Initializing the gendisk descriptor

                            blk_init_queue( )

                     14.4.2.4. Initializing the table of block device methods

                    

                     14.4.2.5. Allocating and initializing a request queue

 

                     14.4.2.6. Setting up the interrupt handler

 

                     14.4.2.7. Registering the disk

                            add_disk( )

 

        14.4.3. The Strategy Routine

                      blk_init_queue(foo_strategy)

           14.4.4. The Interrupt Handler   

 

 

       14.5. Opening a Block Device File

 

 

              Table 14-10. The default block device file operations (def_blk_fops table)

              Method                                          Function

        open                            blkdev_open( )

        release                     blkdev_close( )

        llseek                      block_llseek( )

        read                            generic_file_read( )

        write                           blkdev_file_write( )

        aio_read                        generic_file_aio_read( )

        aio_write                       blkdev_file_aio_write( )

        mmap                            generic_file_mmap( )

        fsync                           block_fsync( )

        ioctl                           block_ioctl( )

        compat-ioctl                        compat_blkdev_ioctl( )

        readv                           generic_file_readv( )

        writev                      generic_file_write_nolock( )

        sendfile                        generic_file_sendfile( )

 

Chapter 15. The Page Cache

 

      15.1. The Page Cache

         Kernel designers have implemented the page cache to fulfill two main requirements:

                 (1)Quickly locate a specific page containing data relative to a given owner. To take the maximum

                 advantage from the page cache, searching it should be a very fast operation.

 

                 (2) Keep track of how every page in the cache should be handled when reading or writing its
                 content. For instance, reading a page from a regular file, a block device file, or a swap area

                 must be performed in different ways, thus the kernel must select the proper operation

                 depending on the page's owner

         15.1.1. The address_space Object

        The core data structure of the page cache is the address_space object,

         Each page descriptor includes two fields called mapping and index,

                 The first field points to the address_space object of the inode that owns the page.

                 The second field specifies the offset in page-size units within the owner's "address space," that is, the position of the page's data inside the owner's disk image.

                 These two fields are used when looking for a page in the page cache.

         address_space object

                 Table 15-1. The fields of the address_space object

              Type                         Field                          Description

            struct inode *      host                Pointer to the inode hosting this object, if any

            struct radix_tree_root page_tree           Root of radix tree identifying the owner's pages

            spinlock_t          tree_lock           Spin lock protecting the radix tree

            unsigned int            i_mmap_writable     Number of shared memory mappings in the address space

            struct prio_tree_root   i_mmap          Root of the radix priority search tree (see Chapter 17)

            struct list_head        i_mmap_nonlinear        List of non-linear memory regions in the address space

            spinlock_t          i_mmap_lock         Spin lock protecting the radix priority search tree

            unsigned int            TRuncate_count      Sequence counter used when truncating the file

            unsigned long       nrpages         Total number of owner's pages

            unsigned long       writeback_index     Page index of the last write-back operation on the owner's pages

            struct address_space_operations *   a_ops           Methods that operate on the owner's pages

                 unsigned long               flags                     Error bits and memory allocator flags

                 struct backing_dev_info *     backing_dev_info           Pointer to the backing_dev_info of the block device holding the data of this owner

                 spinlock_t               private_lock            Usually, spin lock used when managing the private_list list

                 struct list head        private_list              Usually, a list of dirty buffers of indirect blocks associated with the inode

                 struct address_space *   assoc_mapping        Usually, pointer to the address_space object of the block device including the indirect blocks 

         Table 15-2. The methods of the address_space object

                     Method                     Description

            writepage           Write operation (from the page to the owner's disk image)

            readpage            Read operation (from the owner's disk image to the page)

            sync_page           Start the I/O data transfer of already scheduled operations on owner's pages

                 writepages             Write back to disk a given number of dirty owner's pages

                 set_page_dirty        Set an owner's page as dirty

                 readpages              Read a list of owner's pages from disk

            prepare_write       Prepare a write operation (used by disk-based filesystems)

            commit_write            Complete a write operation (used by disk-based filesystems)

            bmap                Get a logical block number from a file block index

            invalidatepage      Invalidate owner's pages (used when truncating the file)

            releasepage         Used by journaling filesystems to prepare the release of a page

            direct_IO           Direct I/O transfer of the owner's pages (bypassing the page cache)

 

 

                 15.1.2. The Radix Tree

 

         15.1.3. Page Cache Handling Functions

                            15.1.3.1. Finding a page

                            find_get_page( )

                find_get_pages( )

 

                15.1.3.2. Adding a page

                            The add_to_page_cache( ) function inserts a new page descriptor in the page cache

                      radix_tree_insert( )

                      15.1.3.3. Removing a page

                            The remove_from_page_cache( ) function removes a page descriptor from the page cache

                      radix_tree_delete( ) 

                            15.1.3.4. Updating a page

                            The read_cache_page( ) function ensures that the cache includes an up-to-date version of a given page.

                    

            15.1.4. The Tags of the Radix Tree

                      The radix_tree_tag_set( ) function is invoked when setting the PG_dirty or the PG_writeback flag of a cached page;

                      The radix_tree_tag_clear( ) function is invoked when clearing the PG_dirty or the PG_writeback flag of a cached page;                       

 

      15.2. Storing Blocks in the Page Cache

            Formally, a buffer page is a page of data associated with additional descriptors called "buffer heads

                 ," whose main purpose is to quickly locate the disk address of each individual block in the page. In

                 fact, the chunks of data stored in a page belonging to the page cache are not necessarily adjacent on disk.

            15.2.1. Block Buffers and Buffer Heads

                      buffer_head

 

                Table 15-4. The fields of a buffer head

             

              Type                         Field                                 Description

            unsigned long           b_state             Buffer status flags

            struct buffer_head *    b_this_page             Pointer to the next element in the buffer page's list

            struct page *       b_page              Pointer to the descriptor of the buffer page holding this block

            atomic_t            b_count             Block usage counter

            u32             b_size              Block size

            sector_t            b_blocknr               Block number relative to the block device (logical block number)

            char *          b_data              Position of the block inside the buffer page

            struct block_device *   b_bdev              Pointer to block device descriptor bh_end_io_t * b_end_io I/O completion method

            void *              b_private               Pointer to data for the I/O completion method

            struct list_head        b_assoc_buffers         Pointers for the list of indirect blocks associated with an inode(see the section "The                                                                                address_space Object" earlier in this chapter)

 

 

                 15.2.2. Managing the Buffer Heads

                 whose kmem_cache_s descriptor is stored in the bh_cachep variable.

                 The alloc_buffer_head( ) and free_buffer_head( ) functions are used to get and release a buffer head, respectively.

 

                 _ _getblk( )/_bforget( );

 

                 15.2.3. Buffer Pages

                 Figure 15-2. A buffer page including four buffers and their buffer heads

        

 

 

                 15.2.4. Allocating Block Device Buffer Pages

                      grow_dev_page( )

                 15.2.5. Releasing Block Device Buffer Pages

                      TRy_to_release_page( )

                 15.2.6. Searching Blocks in the Page Cache

                      15.2.6.1. The _ _find_get_block( ) function

                            15.2.6.2. The _ _getblk( ) function

                            15.2.6.3. The _ _bread( ) function

          

                 15.2.7. Submitting Buffer Heads to the Generic Block Layer

                      15.2.7.1. The submit_bh( ) function

                            15.2.7.2. The ll_rw_block( ) function

 

      15.3. Writing Dirty Pages to Disk

            15.3.1. The pdflush Kernel Threads

                 Each pdflush kernel thread has a pdflush_work descriptor (see Table 15-6). The descriptors of idle

                 pdflush kernel threads are collected in the pdflush_list list; the pdflush_lock spin lock protects that

                 list from concurrent accesses in multiprocessor systems. The nr_pdflush_threads variable[*] stores

                 the total number of pdflush kernel threads (idle and busy). Finally, the last_empty_jifs variable

                 stores the last time (in jiffies) since the pdflush_list list of pdflush threads became empty.

 

                 Table 15-6. The fields of the pdflush_work descriptor

              Type                        Field                          Description

            struct task_struct *    who             Pointer to kernel thread descriptor

            void(*)(unsigned long) fn              Callback function to be executed by the kernel thread

            unsigned long       arg0                Argument to callback function

            struct list         head list           Links for the pdflush_list list

                 unsigned long               when_i_went_to_sleep        Time in jiffies when kernel thread became available

 

            15.3.2. Looking for Dirty Pages To Be Flushed

                 The wakeup_bdflush( ) function receives as argument the number of dirty pages in the page cache that should be flushed;

                 The background_writeout( ) function acts on a single parameter: nr_pages, the minimum number of pages that should be flushed to disk

                 15.3.3. Retrieving Old Dirty Pages

                 The job of retrieving old dirty pages is delegated to a pdflush kernel thread that is periodically

                 woken up. During the kernel initialization, the page_writeback_init( ) function sets up the wb_timer

                 dynamic timer so that it decays after dirty_writeback_centisecs hundreds of a second (usually 500,

                 but this value can be adjusted by writing in the /proc/sys/vm/dirty_writeback_centisecs file). The

                 timer function, which is called wb_timer_fn( ), essentially invokes the pdflush_operation( )

                 function passing to it the address of the wb_kupdate( ) callback function.

 

    15.4. The sync( ), fsync( ), and fdatasync( ) System Calls

            15.4.1. The sync ( ) System Call

                 The service routine sys_sync( ) of the sync( ) system call invokes a series of auxiliary functions:

           wakeup_bdflush(0);

           sync_inodes(0);

           sync_supers( );

           sync_filesystems(0);

           sync_filesystems(1);

           sync_inodes(1);   

     

                 15.4.2. The fsync ( ) and fdatasync ( ) System Calls

            The fsync( ) system call forces the kernel to write to disk all dirty buffers that belong to the file

                 specified by the fd file descriptor parameter (including the buffer containing its inode, if necessary).

                 The corresponding service routine derives the address of the file object and then invokes the fsync

                 method. Usually, this method ends up invoking the _ _writeback_single_inode( ) function to write

                 back both the dirty pages associated with the selected inode and the inode itself (see the section

                 "Looking for Dirty Pages To Be Flushed" earlier in this chapter).

                 The fdatasync( ) system call is very similar to fsync( ), but writes to disk only the buffers that

                 contain the file's data, not those that contain inode information. Because Linux 2.6 does not have a

                 specific file method for fdatasync( ), this system call uses the fsync method and is thus identical to fsync( ).

 

 

Chapter 16. Accessing Files

         There are many different ways to access a file. In this chapter we will consider the following cases:

         1)   canonical mode

            2)   synchoronous  mode

            3)   Memory mapping mode

            4)   Direct I/O mode

                 5)   Asynchronous mode

     

        16.1. Reading and Writing a File

               

            16.1.1. Reading from a File

                generic_file_read( )

                The first descriptor is stored in the local variable local_iov of type iovec; it contains the address (buf) and the length (count) of the User

                      Mode buffer that shall receive the data read from the file.

                The second descriptor is stored in the local variable kiocb of type kiocb; it is used to keep track of the completion status of an ongoing

                      synchronous or asynchronous I/O operation.

                16.1.1.1. The readpage method for regular files

                int ext3_readpage(struct file *file, struct page *page)

              {

              return mpage_readpage(page, ext3_get_block);

              }

                The mpage_readpage( ) function chooses between two different strategies when reading a page from disk.

                16.1.1.2. The readpage method for block device files

                It is implemented by the blkdev_readpage( ) function,

                      which calls block_read_full_page( ):

              int blkdev_readpage(struct file * file, struct * page page)

              {

              return block_read_full_page(page, blkdev_get_block);

              }

        16.1.2. Read-Ahead of Files

                Read-ahead consists of reading several adjacent pages of data of a regular file or block device file

                      before they are actually requested

 

                The main data structure used by the read-ahead algorithm is the file_ra_state descriptor whose

                      fields are listed in Table 16-3. Each file object includes such a descriptor in its f_ra field.

                Table 16-3. The fields of the file_ra_state descriptor

                Type                         Field                          Description

                unsigned long       start               Index of first page in the current window

                unsigned long       size                Number of pages included in the current window

                                                                   (-1 for read-ahead temporarily disabled, 0 for empty current window)

                unsigned long       flags               Flags used to control the read-ahead

                      unsigned long               cache_hit               Number of consecutive cache hits

                                                                    (pages requested by the process and found in the page cache)

                unsigned long       prev_page           Index of the last page requested by the process

                unsigned long       ahead_start                 Index of the first page in the ahead window

                unsigned long       ahead_size             Number of pages in the ahead window (0 for an empty ahead window)

                unsigned long       ra_pages               Maximum size in pages of a read-ahead window (0 for read-ahead permanently disabled)

                unsigned long       mmap_hit               Read-ahead hit counter (used for memory mapped files)

                unsigned long       mmap_miss                 Read-ahead miss counter (used for memory mapped files)

                16.1.2.1. The page_cache_readahead( ) function

 

                            Figure 16-1. The flow diagram of the page_cache_readahead( ) function

 

           

 

                16.1.2.2. The handle_ra_miss( ) function

             

              16.1.3. Writing to a File

                 Many filesystems (including Ext2 or JFS ) implement the write method of the file object by means of

                 the generic_file_write( ) function, which acts on the following parameters:

 

            The _ _generic_file_aio_write_nolock( ) function receives four parameters

 

                 16.1.3.1. The prepare_write and commit_write methods for regular files

 

                     16.1.3.2. The prepare_write and commit_write methods for block device files

        16.1.4. Writing Dirty Pages to Disk

            int ext2_writepages(struct address_space *mapping,

           struct writeback_control *wbc)

           {

           return mpage_writepages(mapping, wbc, ext2_get_block);

           }

           The mpage_writepages( ) function essentially performs the following actions:

 

    16.2. Memory Mapping      

            As already mentioned in the section "Memory Regions" in Chapter 9, a memory region can be

                 associated with some portion of either a regular file in a disk-based filesystem or a block device file.

                 This means that an access to a byte within a page of the memory region is translated by the kernel

                 into an operation on the corresponding byte of the file. This technique is called memory mapping.

            Two kinds of memory mapping exist:

                 Shared

 

                 Private

 

                 16.2.1. Memory Mapping Data Structures

 

                 Figure 16-2. Data structures for file memory mapping

 

        

            File memory mapping depends on the demand paging mechanism described in the section "Demand

                 Paging" in Chapter 9. In fact, a newly established memory mapping is a memory region that doesn't

                 include any page; as the process references an address inside the region, a Page Fault occurs and

                 the Page Fault handler checks whether the nopage method of the memory region is defined. If nopage

                 is not defined, the memory region doesn't map a file on disk; otherwise, it does, and the method

                 takes care of reading the page by accessing the block device. Almost all disk-based filesystems and

                 block device files implement the nopage method by means of the filemap_nopage( ) function.     

 

           16.2.2. Creating a Memory Mapping

 

                 we refer to the enumeration used to describe do_mmap_pgoff( ) and point out the additional steps performed under the new condition.

 

           16.2.3. Destroying a Memory Mapping

 

           The sys_munmap( ) service routine of the system call essentially invokes the do_munmap( ) function

           already described in the section "Releasing a Linear Address Interval" in Chapter 9.

 

           16.2.4. Demand Paging for Memory Mapping

 

           The filemap_nopage( ) function executes the following steps:

 

           16.2.5. Flushing Dirty Memory Mapping Pages to Disk

           The msync( ) system call can be used by a process to flush to disk dirty pages belonging to a shared memory mapping

 

           16.2.6. Non-Linear Memory Mappings

         To create a non-linear memory mapping, the User Mode application first creates a normal shared

           memory mapping with the mmap( ) system call. Then, the application remaps some of the pages in

           the memory mapping region by invoking remap_file_pages( ). The sys_remap_file_pages( )

           service routine of the system call receives four parameters:

 

      16.3. Direct I/O Transfers

 

           generic_file_direct_IO( )         

     

 

      16.4. Asynchronous I/O

     

           16.4.1. Asynchronous I/O in Linux 2.6

           Table 16-5. Linux system calls for asynchronous I/O

              System call                           Description

        io_setup( )             Initializes an asynchronous context for the current process

        io_submit( )                Submits one or more asynchronous I/O operations

        io_getevents( )         Gets the completion status of some outstanding asynchronous I/O operations

        io_cancel( )                Cancels an outstanding I/O operation

        io_destroy( )           Removes an asynchronous context for the current process

 

           16.4.1.2. Submitting the asynchronous I/O operations

           To start some asynchronous I/O operations, the application invokes the io_submit( ) system call.

           The system call has three parameters:

 

Chapter 17. Page Frame Reclaiming

 

   17.1. The Page Frame Reclaiming Algorithm

      One of the goals of page frame reclaiming is thus to conserve a minimal pool of free page frames so

           that the kernel may safely recover from "low on memory" conditions.

 

           17.1.1. Selecting a Target Page

           The objective of the page frame reclaiming algorithm (PFRA ) is to pick up page frames and make them free

 

           Table 17-1. The types of pages considered by the PFRA

 

         Type of pages                Description Reclaim                                                             action

           Unreclaimable                     Free pages (included in buddy system lists)

                                       Reserved pages (with PG_reserved flag set)

                                       Pages dynamically allocated by the kernel

                                       Pages in the Kernel Mode stacks of the processes                             (No reclaiming allowed or needed)

                                       Temporarily locked pages (with PG_locked flag set)

                                       Memory locked pages (in memory regions with VM_LOCKED flag set)

                                      

           Swappable                   Anonymous pages in User Mode address spaces

                                       Mapped pages of tmpfs filesystem (e.g., pages of IPC shared memory)                Save the page contents in a swap area   

                                       Mapped pages in User Mode address spaces

                                       Pages included in the page cache and containing data of disk files

 

           Syncable                      Block device buffer pages

                                       Pages of some disk caches (e.g., the inode cache )                Synchronize the page with its image on disk, if necessary

                                      

 

           Discardable                  Unused pages included in memory caches (e.g., slab allocator caches)             Nothing to be done
                                       Unused pages of the dentry cache

 

           In the above table, a page is said to be mapped if it maps a portion of a file. For instance, all pages

           in the User Mode address spaces belonging to file memory mappings are mapped, as well as any

           other page included in the page cache. In almost all cases, mapped pages are syncable: in order to

           reclaim the page frame, the kernel must check whether the page is dirty and, if necessary, write the

           page contents in the corresponding disk file.

 

 

           Conversely, a page is said to be anonymous if it belongs to an anonymous memory region of a

           process (for instance, all pages in the User Mode heap or stack of a process are anonymous). In

           order to reclaim the page frame, the kernel must save the page contents in a dedicated disk

           partition or disk file called "swap area" (see the later section "Swapping"); therefore, all anonymous

           pages are swappable.

 

           Usually, the pages of special filesystems are not reclaimable. The only exceptions are the pages of

           the tmpfs special filesystem, which can be reclaimed by saving them in a swap area. As we'll see in

           Chapter 19, the tmpfs special filesystem is used by the IPC shared memory mechanism.

 

           17.1.2. Design of the PFRA

 

           Looking too close to the trees' leaves might lead us to miss the whole forest. Therefore, let us

           present a few general rules adopted by the PFRA. These rules are embedded in the functions that

           will be described later in this chapter.

 

           Free the "harmless" pages first

 

           Make all pages of a User Mode process reclaimable

          

           Reclaim a shared page frame by unmapping at once all page table entries that reference it

          

           Reclaim "unused" pages only

 

 

      17.2. Reverse Mapping

 

           The technique used in Linux 2.6 is named object-based reverse mapping. Essentially, for any reclaimable User Mode page, the kernel

           stores the backward links to all memory regions in the system (the "objects") that include the page

           itself. Each memory region descriptor stores a pointer to a memory descriptor, which in turn

           includes a pointer to a Page Global Directory.

 

 

           int try_to_unmap(struct page *page)

           {

                 int ret;

                 if (PageAnon(page))

                 ret = try_to_unmap_anon(page);

                 else

                 ret = try_to_unmap_file(page);

                 if (!page_mapped(page))

                 ret = SWAP_SUCCESS;

                 return ret;

           }

 

           17.2.1. Reverse Mapping for Anonymous Pages

 

           Figure 17-1. Object-based reverse mapping for anonymous pages

 

    

 

              17.2.1.1. The try_to_unmap_anon( ) function

                 -->try_to_unmap_one( )

           

            17.2.1.2. The try_to_unmap_one( ) function

 

 

           17.2.2. Reverse Mapping for Mapped Pages

 

                 17.2.2.1. The priority search tree

 

                     17.2.2.2. The try_to_unmap_file( ) function    

 

 

 

      17.3. Implementing the PFRA

      Figure 17-3. The main functions of the PFRA

 

    

 

 

         17.3.1. The Least Recently Used (LRU) Lists

 

              17.3.1.1. Moving pages across the LRU lists

 

                 Figure 17-4. Moving pages across the LRU lists

        

 

              17.3.1.2. The mark_page_accessed( ) function

 

              17.3.1.3. The page_referenced( ) function

 

              17.3.1.4. The refill_inactive_zone( ) function

                   Table 17-2. The fields of the scan_control descriptor

 

              Type                         Field                                 Description

            unsigned long       nr_to_scan              Target number of pages to be scanned in the active list.

            unsigned long       nr_scanned              Number of inactive pages scanned in the current iteration.

            unsigned long       nr_reclaimed                Number of pages reclaimed in the current iteration.

            unsigned long       nr_mapped               Number of pages referenced in the User Mode address spaces.

                 int                   nr_to_reclaim           Target number of pages to be reclaimed.

                 unsigned int                  priority                Priority of the scanning, ranging between 12 and 0. Lower priority implies scanning more pages.

            unsigned int            gfp_mask                GFP mask passed from calling function.

            int             may_writepage           If set, writing a dirty page to disk is allowed (only for laptop mode).

                  

      17.3.2. Low On Memory Reclaiming 

         17.3.2.1. The free_more_memory( ) function

 

              17.3.2.2. The try_to_free_pages( ) function

 

                     17.3.2.3. The shrink_caches( ) function

 

              17.3.2.4. The shrink_zone( ) function

 

                     17.3.2.5. The shrink_cache( ) function

 

                     17.3.2.6. The shrink_list( ) function

 

                     Figure 17-5. The page reclaiming logic of the shrink_list( ) function

        

                  

                    

                     17.3.2.7. The pageout( ) function

                     The pageout( ) function is invoked by shrink_list( ) when a dirty page must be written to disk

 

 

              17.3.3. Reclaiming Pages of Shrinkable Disk Caches

 

                     17.3.3.1. Reclaiming page frames from the dentry cache

                     The shrink_dcache_memory( ) function is the shrinker function for the dentry cache;

 

                     17.3.3.2. Reclaiming page frames from the inode cache

 

 

              17.3.4. Periodic Reclaiming

 

                     The PFRA performs periodic reclaiming by using two different mechanisms: the kswapd kernel

                 threads, which invoke shrink_zone( ) and shrink_slab( ) to reclaim pages from the LRU lists, and

                 the cache_reap function, which is invoked periodically to reclaim unused slabs from the slab allocator.

 

 

                     17.3.4.1. The kswapd kernel threads

 

                     following steps:

                 1:   Invokes finish_wait( ) to remove the kernel thread from the node's kswapd_wait wait queue

                 (see the section "How Processes Are Organized" in Chapter 3).

 

                 2:   Invokes balance_pgdat( ) to perform the memory reclaiming on the kswapd's memory node

                 (see below).

 

                 3:   Invokes prepare_to_wait( ) to set the process in the TASK_INTERRUPTIBLE state and to put it to

                 sleep in the node's kswapd_wait wait queue.

 

                     4:    Invokes schedule( ) to yield the CPU to some other runnable process

 

                     17.3.4.2. The cache_reap( ) function

 

                     The PFRA must also reclaim the pages owned by the slab allocator caches (see the section "Memory

                 Area Management " in Chapter 8). To do this, it relies on the cache_reap( ) function, which is

                 periodically scheduled approximately once every two secondsin the predefined events work queue

                 (see the section "Work Queues" in Chapter 4). The address of the cache_reap( ) function is stored in

                 the func field of the reap_work per-CPU variable of type work_struct.

      

              17.3.5. The Out of Memory Killer

 

                 The out_of_memory( ) function is invoked by _ _alloc_pages( ) when the free memory is very low

                 and the PFRA has not succeeded in reclaiming any page frames (see the section "The Zone Allocator"

                 in Chapter 8). The function invokes select_bad_process( ) to select a victim among the existing

                 processes, then invokes oom_kill_process( ) to perform the sacrifice

 

           17.3.6. The Swap Token

 

                

    17.4. Swapping

 

           17.4.1. Swap Area

           The first page slot of a swap area is used to persistently store some information about the

           swap area; its format is described by the swap_header union composed of two structures

           magic

        info

 

                 17.4.1.1. Creating and activating a swap area

                 Each swap area consists of one or more swap extents , each of which is represented by a

            swap_extent descriptor

 

                 17.4.1.2. How to distribute pages in the swap areas

 

           17.4.2. Swap Area Descriptor

 

           swap_info_struct descriptor in memory

 

           Table 17-3. Fields of a swap area descriptor

 

         Type                         Field                                 Description

        unsigned int            flags                   Swap area flags

        spinlock_t          sdev_lock               Spin lock protecting the swap area

        struct file *       swap_file               Pointer to the file object of the regular file or device file that stores the swap area

        struct  block_device *  bdev                    Descriptor of the block device containing the swap area

        struct list head        extent_list             Head of the list of extents that compose the swap area

                                                              int nr_extents Number of extents composing the swap area

           struct swap_extent *      curr_swap_extent               Pointer to the most recently used extent descriptor

        unsigned int            old_block_size          Natural block size of the partition containing the swap area

        unsigned short *        swap_map                Pointer to an array of counters, one for each swap area page slot

        unsigned int            lowest_bit              First page slot to be scanned when searching for a free one

        unsigned int            highest_bit             Last page slot to be scanned when searching for a free one

        unsigned int            cluster_next                Next page slot to be scanned when searching for a free one

        unsigned int            cluster_nr              Number of free page slot allocations before restarting from the beginning

        int             prio                    Swap area priority

        int             pages                   Number of usable page slots

        unsigned long       max                 Size of swap area in pages

           unsigned long               inuse_pages                       Number of used page slots in the swap area

        int             next                    Pointer to next swap area descriptor

             

              17.4.3. Swapped-Out Page Identifier

 

              17.4.4. Activating and Deactivating a Swap Area

             

              17.4.4.1. The sys_swapon( ) service routine

 

                     17.4.4.2. The sys_swapoff( ) service routine

 

                     17.4.4.3. The try_to_unuse( ) function

 

 

              17.4.5. Allocating and Releasing a Page Slot

 

                     17.4.5.1. The scan_swap_map( ) function

 

                     17.4.5.2. The get_swap_page( ) function

 

                     17.4.5.3. The swap_free( ) function

 

 

 

              17.4.6. The Swap Cache

 

 

                     17.4.6.1. Swap cache implementation

 

 

                     17.4.6.2. Swap cache helper functions

 

 

              17.4.7. Swapping Out Pages

                    

                     17.4.7.1. Inserting the page frame in the swap cache

 

                     17.4.7.2. Updating the Page Table entries

 

                     17.4.7.3. Writing the page into the swap area

 

                     17.4.7.4. Removing the page frame from the swap cache

 

              17.4.8. Swapping in Pages

 

                 17.4.8.1. The do_swap_page( ) function

 

                 17.4.8.2. The read_swap_cache_async( ) function

     

 

Chapter 18. The Ext2 and Ext3 Filesystems

 

      18.1. General Characteristics of Ext2

 

    18.2. Ext2 Disk Data Structures

 

 

 

           The first block in each Ext2 partition is never managed by the Ext2 filesystem, because it is reserved for the partition boot sector

 

           The rest of the Ext2 partition is split into block groups, each of which has the layout shown in Figure 18-1.

 

           How many block groups are there? Well, that depends both on the partition size and the block size.

           The main constraint is that the block bitmap, which is used to identify the blocks that are used and

           free inside a group, must be stored in a single block. Therefore, in each block group, there can be at

           most 8xb blocks, where b is the block size in bytes. Thus, the total number of block groups is

           roughly s/(8xb), where s is the partition size in blocks.

           For example, let's consider a 32-GB Ext2 partition with a 4-KB block size. In this case, each 4-KB

           block bitmap describes 32K data blocks that is, 128 MB. Therefore, at most 256 block groups are

           needed. Clearly, the smaller the block size, the larger the number of block groups.

 

           18.2.1. Superblock

 

           An Ext2 disk superblock is stored in an ext2_super_block structure

 

           Table 18-1. The fields of the Ext2 superblock

 

         Type                         Field                          Description

        _ _le32         s_inodes_count      Total number of inodes

        _ _le32         s_blocks_count      Filesystem size in blocks

        _ _le32         s_r_blocks_count        Number of reserved blocks

        _ _le32         s_free_blocks_count     Free blocks counter

        _ _le32         s_free_inodes_count     Free inodes counter

        _ _le32         s_first_data_block      Number of first useful block (always 1)

        _ _le32         s_log_block_size        Block size

        _ _le32         s_log_frag_size     Fragment size

        _ _le32         s_blocks_per_group      Number of blocks per group

        _ _le32         s_frags_per_group       Number of fragments per group

        _ _le32         s_inodes_per_group      Number of inodes per group

        _ _le32         s_mtime         Time of last mount operation

        _ _le32         s_wtime         Time of last write operation

        _ _le16         s_mnt_count         Mount operations counter

        _ _le16         s_max_mnt_count     Number of mount operations before check

        _ _le16         s_magic         Magic signature

        _ _le16         s_state         Status flag

        _ _le16         s_errors            Behavior when detecting errors

           _ _le16         s_minor_rev_level       Minor revision level

        _ _le32         s_lastcheck         Time of last check

        _ _le32         s_checkinterval     Time between checks

        _ _le32         s_creator_os            OS where filesystem was created

        _ _le32         s_rev_level         Revision level of the filesystem

        _ _le16         s_def_resuid            Default UID for reserved blocks

        _ _le16         s_def_resgid            Default user group ID for reserved blocks

        _ _le32         s_first_ino         Number of first nonreserved inode

        _ _le16         s_inode_size            Size of on-disk inode structure

        _ _le16         s_block_group_nr        Block group number of this superblock

        _ _le32         s_feature_compat        Compatible features bitmap

        _ _le32         s_feature_incompat      Incompatible features bitmap

        _ _le32         s_feature_ro_compat     Read-only compatible features bitmap

        _ _u8 [16]          s_uuid          128-bit filesystem identifier

        char [16]           s_volume_name       Volume name

        char [64]           s_last_mounted      Pathname of last mount point

        _ _le32         s_algorithm_usage_bitmap    Used for compression

        _ _u8               s_prealloc_blocks       Number of blocks to preallocate

        _ _u8               s_prealloc_dir_blocks   Number of blocks to preallocate for directories

        _ _u16          s_padding1          Alignment to word

        _ _u32 [204]            s_reserved          Nulls to pad out 1,024 bytes

 

      18.2.2. Group Descriptor and Bitmap

           ext2_group_desc

           Table 18-2. The fields of the Ext2 group descriptor

              Type                         Field                          Description

        _ _le32         bg_block_bitmap     Block number of block bitmap

        _ _le32         bg_inode_bitmap     Block number of inode bitmap

        _ _le32         bg_inode_table     Block number of first inode table block

        _ _le16         bg_free_blocks_count    Number of free blocks in the group

        _ _le16         bg_free_inodes_count    Number of free inodes in the group

        _ _le16         bg_used_dirs_count      Number of directories in the group

        _ _le16         bg_pad          Alignment to word

        _ _le32 [3]         bg_reserved         Nulls to pad out 24 bytes

 

 

      18.2.3. Inode Table

 

           The inode table consists of a series of consecutive blocks, each of which contains a predefined

           number of inodes. The block number of the first block of the inode table is stored in the

        n bg_inode_table field of the group descriptor.

 

           Each Ext2 inode is an ext2_inode structure whose fields are illustrated in Table 18-3.

                                                                                      

              Table 18-3. The fields of an Ext2 disk inode

 

         Type                         Field                          Description

        _ _le16         i_mode          File type and access rights

        _ _le16         i_uid               Owner identifier

        _ _le32         i_size          File length in bytes

        _ _le32         i_atime         Time of last file access

        _ _le32         i_ctime         Time that inode last changed

           _ _le32         i_mtime         Time that file contents last changed

        _ _le32         i_dtime         Time of file deletion

        _ _le16         i_gid               User group identifier

        _ _le16         i_links_count       Hard links counter

        _ _le32         i_blocks            Number of data blocks of the file

        _ _le32         i_flags         File flags

        union               osd1                Specific operating system information

        _ _le32 [EXT2_N_BLOCKS] i_block         Pointers to data blocks

        _ _le32         i_generation            File version (used when the file is accessed by anetwork filesystem)

        _ _le32         i_file_acl          File access control list

        _ _le32         i_dir_acl           Directory access control list

        _ _le32         i_faddr         Fragment address

        union               osd2                Specific operating system information

        

 

       18.2.4. Extended Attributes of an Inode

           The i_file_acl field of an inode points to the block containing the extended attributes

           ext2_xattr_entry descriptor

 

           ext2_xattr_entry descriptor together with the name of the attribute are placed at the beginning of

           the block, while the value of the attribute is placed at the end of the block.

 

     

 

 

      18.2.5. Access Control Lists

 

 

      18.2.6. How Various File Types Use Disk Blocks

 

           Table 18-4. Ext2 file types

              File_type                  Description

           0                     Unknown

           1                     Regular file

           2                     Directory

           3                     Character device

           4                     Block device

           5                     Named pipe

           6                     Socket

           7                     Symbolic link    

 

           18.2.6.1. Regular file

 

           18.2.6.2. Directory

 

              18.2.6.3. Symbolic link

 

           18.2.6.4. Device file, pipe, and socket

 

    18.3. Ext2 Memory Data Structures

        Table 18-6. VFS images of Ext2 data structures

              Type                  Disk data structure                           Memory data structure                           Caching mode

           Superblock        ext2_super_block                ext2_sb_info                        Always cached

           Group descriptor      ext2_group_desc             ext2_group_desc                 Always cached

           Block bitmap           Bit array in block                      Bit array in buffer                          Dynamic

           inode bitmap           Bit array in block                      Bit array in buffer                          Dynamic

           inode               ext2_inode                  ext2_inode_info                 Dynamic

           Data block        Array of bytes                    VFS buffer                             Dynamic

           Free inode        ext2_inode                  None                                      Never

           Free block        Array of bytes                    None                                      Never

 

        18.3.1. The Ext2 Superblock Object

           As stated in the section "Superblock Objects" in Chapter 12, the s_fs_info field of the VFS

           superblock points to a structure containing filesystem-specific data. In the case of Ext2, this field

           points to a structure of type ext2_sb_info

          

     

 

        18.3.2. The Ext2 inode Object

           When the VFS accesses an Ext2 disk inode, it creates a corresponding inode descriptor of type ext2_inode_info

 

    18.4. Creating the Ext2 Filesystem

        There are generally two stages to creating a filesystem on a disk. The first step is to format it so that

            the disk driver can read and write blocks on it.

          

           The second step involves creating a filesystem, which means

           setting up the structures described in detail earlier in this chapter

 

      18.5. Ext2 Methods

        18.5.1. Ext2 Superblock Operations

           The addresses of the superblock methods are stored in the ext2_sops array of pointers

 

           18.5.2. Ext2 inode Operations

 

           The addresses of the Ext2 methods for regular files and directories are stored in the ext2_file_inode_operations

           and ext2_dir_inode_operations tables, respectively.

          

        18.5.3. Ext2 File Operations

           The addresses of these methods are stored in the ext2_file_operations table.

 

    18.6. Managing Ext2 Disk Space

        18.6.1. Creating inodes

           The ext2_new_inode( ) function creates an Ext2 disk inode

           18.6.2. Deleting inodes

           The ext2_free_inode( ) function deletes a disk inode

 

           18.6.3. Data Blocks Addressing

     

 

            18.6.5. Allocating a Data Block

                 ext2_get_block( )

                 ext2_alloc_block( )

                 18.6.6. Releasing a Data Block

                 ext2_truncate( ),

 

        18.7. The Ext3 Filesystem

            18.7.1. Journaling Filesystems

                 The goal of a journaling filesystem is to avoid running time-consuming consistency checks on the

                 whole filesystem by looking instead in a special disk area that contains the most recent disk write

                 operations named journal. Remounting a journaling filesystem after a system failure is a matter of a

                 few seconds.

 

                 18.7.2. The Ext3 Journaling Filesystem

            it offers three different journaling modes

                 Journal

                 Ordered

                 Writeback

            18.7.3. The Journaling Block Device Layer

                18.7.3.1. Log records

                18.7.3.2. Atomic operation handles

                18.7.3.3. Transactions\

                     18.7.4. How Journaling Works

 

Chapter 19. Process Communication

            As usual, application programmers have a variety of needs that call for different communication

                 mechanisms. Here are the basic mechanisms that Unix systems offer to allow interprocess

                 communication:

 

                 Pipes and FIFOs (named pipes)

            Semaphores

                

                 Messages

          

                 Shared memory regions

 

                 Sockets

   

        19.1. Pipes

                19.1.1. Using a Pipe

                In Linux, popen( ) and pclose( ) are included in the C library. The popen( ) function receives two

                      parameters: the filename pathname of an executable file and a type string specifying the direction

                      of the data transfer

 

                      19.1.2. Pipe Data Structures

                      pipe_inode_info

               

                    19.1.2.1. The pipefs special filesystem

                    

                19.1.3. Creating and Destroying a Pipe

                      The pipe( ) system call is serviced by the sys_pipe( ) function, which in turn invokes the do_pipe() function. To create a new pipe

 

                      19.1.4. Reading from a Pipe

                      The pipe_read( ) function is quite involved

 

                      19.1.5. Writing into a Pipe   

                            A process wishing to put data into a pipe issues a write( ) system call, specifying the file descriptor

                      for the writing end of the pipe. The kernel satisfies this request by invoking the write method of the

                      proper file object; the corresponding entry in the write_pipe_fops table points to the pipe_write( ) function.

 

           19.2. FIFOs  

                     FIFOs and PIPE are only two significant differences

          
              1):  FIFO inodes appear on the system directory tree rather than on the pipefs special filesystem

                 2):  FIFOs are a bidirectional communication channel; that is, it is possible to open a FIFO in read/write mode.

 

 

                     19.2.1. Creating and Opening a FIFO

                     A process creates a FIFO by issuing a mknod( )[*] system call (see the section "Device Files" in

                 Chapter 13), passing to it as parameters the pathname of the new FIFO and the value S_IFIFO

                 (0x10000) logically ORed with the permission bit mask of the new file. POSIX introduces a function

                 named mkfifo( ) specifically to create a FIFO. This call is implemented in Linux, as in System V

                 Release 4, as a C library function that invokes mknod().

 

                 The fifo_open( ) function initializes the data structures specific to the FIFO; in particular

 

                

              19.3. System V IPC

                    

                     IPC

                     1):  Synchronize itself with other processes by means of semaphores

                 2):  Send messages to other processes or receive messages from them

                 3):  Share a memory area with other processes

                     19.3.1. Using an IPC Resource

                     IPC resources are created by invoking the semget( ), msgget( ), or shmget( ) functions, depending

                 on whether the new resource is a semaphore, a message queue, or a shared memory region.

                

                 Table 19-8. The fields of the ipc_ids data structure

             

              Type                                       Field                                 Description

            int                     in_use              Number of allocated IPC resources

            int                     max_id              Maximum slot index in use

            unsigned short              seq                 Slot usage sequence number for the next allocation

            unsigned short              seq_max             Maximum slot usage sequence number

            struct semaphore                sem                 Semaphore protecting the ipc_ids data structure

            struct ipc_id_ary               nullentry               Fake data structure pointed to by the entries field if this IPC resource

                                                                              cannot be initialized (normally not used)

            struct  ipc_id_ary *                enTRies             Pointer to the ipc_id_ary data structure for this resource

 

                 The ipc_id_ary data structure consists of two fields: p and size. The p field is an array of pointers to

            kern_ipc_perm data structures, one for every allocatable resource. The size field is the size of this array.

 

                 Each kern_ipc_perm data structure is associated with an IPC resource and contains the fields shown

                 in Table 19-9.

 

 

                 Table 19-9. The fields in the kern_ipc_ perm structure

                     Type                         Field                                 Description

                 spinlock_t               lock                            Spin lock protecting the IPC resource descriptor

                 int                   deleted                  Flag set if the resource has been released

            int             key                 IPC key

            unsigned int            uid                 Owner user ID

            unsigned int            gid                 Owner group ID

            unsigned int            cuid                    Creator user ID

            unsigned int            cgid                    Creator group ID

            unsigned short      mode                    Permission bit mask

            unsigned long       seq                 Slot usage sequence number

                 void *                    security                       Pointer to a security structure (used by SELinux)

 

 

 

                     19.3.2. The ipc( ) System Call

 

 

                     19.3.3. IPC Semaphores

 

          

 

                      Table 19-10. The fields in the sem_array data structure

                            Type                          Field                               Description

                struct kern_ipc_perm    sem_perm kern_ipc_perm      data structure

                long                sem_otime               Timestamp of last semop( )

                long                sem_ctime               Timestamp of last change

                struct sem *            sem_base                Pointer to first sem structure

                struct sem_queue *      sem_pending             Pending operations

                struct sem_queue **     sem_pending_last            Last pending operation

                struct sem_undo *       undo                    Undo requests

                unsigned long       sem_nsems               Number of semaphores in array

 

                            19.3.3.1. Undoable semaphore operations

 

                            19.3.3.2. The queue of pending requests

 

                     19.3.4. IPC Messages

                      To send a message, a process invokes the msgsnd( ) function, passing the following as parameters:

 

                      To retrieve a message, a process invokes the msgrcv( ) function, passing to it:

     

                

 

                   Table 19-12. The msg_queue data structure

                            Type                         Field                          Description

                struct kern_ipc_perm    q_perm          kern_ipc_perm data structure

                long                q_stime         Time of last msgsnd( )

                long                q_rtime         Time of last msgrcv( )

                long                q_ctime         Last change time

                unsigned long       q_qcbytes           Number of bytes in queue

                unsigned long       q_qnum          Number of messages in queue

                unsigned long       q_qbytes            Maximum number of bytes in queue

                int             q_lspid         PID of last msgsnd( )

                int             q_lrpid         PID of last msgrcv( )

                struct list_head        q_messages          List of messages in queue

                struct list_head        q_receivers         List of processes receiving messages

                struct list_head        q_senders           List of processes sending messages

 

                      Type                         Field                          Description

                struct list_head        m_list         Pointers for message list

                long                 m_type         Message type

                int              m_ts               Message text size

                struct msg_msgseg *      next               Next portion of the message

                      void *                    security                 Pointer to a security data structure (used by SELinux)

 

 

                 19.3.5. IPC Shared Memory

 

                      As with semaphores and message queues, the shmget( ) function is invoked to get the IPC identifier

                      of a shared memory region, optionally creating it if it does not already exist.

 

                      The shmat( ) function is invoked to "attach" an IPC shared memory region to a process

 

                      The shmdt( ) function is invoked to "detach" an IPC shared memory region specified by its IPC

                      identifierthat is, to remove the corresponding memory region from the process's address space.

 

                      Figure 19-3. IPC shared memory data structures

 

                

 

                      Table 19-14. The fields in the shmid_kernel data structure

 

                      Type                                Field                                 Description

                struct  kern_ipc_perm           shm_perm kern_ipc_perm      data structure

                struct file *           shm_file                Special file of the segment

                int                 id                  Slot index of the segment

                unsigned long           shm_nattch              Number of current attaches

                unsigned long           shm_segsz               Segment size in bytes

                long                    shm_atim                Last access time

                long                    shm_dtim                Last detach time

                long                    shm_ctim                Last change time

                int                 shm_cprid               PID of creator

                int                 shm_lprid               PID of last accessing process

                      struct user_struct *             mlock_user                  Pointer to the user_struct descriptor of the user that locked in RAM

                                                                              the shared memory resource (see the section "The clone( ), fork( ),

                                                                              and vfork( ) System Calls" in Chapter 3)

 

                      19.3.5.1. Swapping out pages of IPC shared memory regions

 

                      19.3.5.2. Demand paging for IPC shared memory regions

 

           19.4. POSIX Message Queues

 

            Table 19-15. Library functions for POSIX message queues

                     Function names                                  Description

            mq_open( )                      Open (optionally creating) a POSIX message queue

            mq_close( )                     Close a POSIX message queue (without destroying it)

            mq_unlink( )                        Destroy a POSIX message queue

            mq_send( ) ,mq_timedsend( )             Send a message to a POSIX message queue; the latter function defines a time limit for the operation

            mq_receive( ) ,mq_timedreceive()        Fetch a message from a POSIX message queue; the latter function defines a time limit for the operation

            mq_notify( )                        Establish an asynchronous notification mechanism for the arrival of messages in an empty POSIX message                                                          queue

            mq_getattr( ) ,mq_setattr( )        Respectively get and set attributes of a POSIX message queue (essentially,

                                                        whether the send and receive operations should be blocking or nonblocking)

 

 

 

 

 

     776   

 

猜你喜欢

转载自blog.csdn.net/u011961033/article/details/83088865