深入理解Linux内核第3版.pdf
Understanding the Linux Kernel, 3rd Edition
Preface
The Audience for This Book
we try to go beyond superficial features. We offer a background, such as the history of major features and the reasons why they were used
Organization of the Material
We tried a bottom-up approach: start with topics that are hardware-dependent and end with those that are totally hardware-independent.
Level of Description
Overview of the Book
Conventions in This Book
How to Contact Us
Chapter 1. Introduction
1.1. Linux Versus Other Unix-Like Kernels
Linux regards lightweight processes as the basic execution context and handles them via the nonstandard clone( ) system call
1.2. Hardware Dependency
1.3. Linux Versions
1.4. Basic Operating System Concepts
1.4.1. Multiuser Systems
1.4.2. Users and Groups
1.4.2. Processes
A process can be defined either as "an instance of a program in execution" or as the "execution context" of a running program.
1.4.2. kenerl architecture
monolithic/microkenerl(module)
1.5. An Overview of the Unix Filesystem
1.5.1. Files
1.5.2. Hard and Soft Links
1.5.3. File Types
1.5.4. File Descriptor and Inode
1.5.5. Access Rights and File Mode
When a file is created by a process, its owner ID is the UID of the process.
Its owner user group ID can be either the process group ID of the creator process or the user group ID of the parent directory,
depending on the value of the sgid flag of the parent directory.
1.5.6. File-Handling System Calls
1.5.6.1. Opening a file
1.5.6.2. Accessing an opened file
1.5.6.3. Closing a file
1.5.6.4. Renaming and deleting a file
1.6. An Overview of Unix Kernels(需要再次理解阅读)
1.6.1. The Process/Kernel Model
kernel routines can be activated in several ways:
1:A process invokes a system call.
2:The CPU executing the process signals an exception, which is an unusual condition such as an invalid instruction.
The kernel handles the exception on behalf of the process that caused it.
3:A peripheral device issues an interrupt signal to the CPU to notify it of an event such as a request for attention,
a status change, or the completion of an I/O operation.
Each interrupt signal is dealt by a kernel program called an interrupt handler.
Because peripheral devices operate asynchronously with respect to the CPU, interrupts occur at unpredictable times.
4:A kernel thread is executed. Because it runs in Kernel Mode, the corresponding program must be considered part of the kernel.
1.6.2. Process Implementation
When the kernel stops the execution of a process, it saves the current contents of several processor registers in the process descriptor.
These include:
1:The program counter (PC) and stack pointer (SP) registers
2:The general purpose registers
3:The floating point registers
4:The processor control registers (Processor Status Word) containing information about the CPU state
5:The memory management registers used to keep track of the RAM accessed by the process
1.6.3. Reentrant Kernels
1.6.4. Process Address Space
1.6.5. Synchronization and Critical Regions
1.6.5.1. Kernel preemption disabling
1.6.5.2. Interrupt disabling
1.6.5.3. Semaphores
1.6.5.4. Spin locks
1.6.5.5. Avoiding deadlocks
1.6.6. Signals and Interprocess Communication
1.6.7. Process Management
1.6.7.1. Zombie processes
1.6.7.2. Process groups and login sessions
1.6.8. Memory Management
1.6.8.1. Virtual memory
1.6.8.2. Random access memory usage
1.6.8.3. Kernel Memory Allocator
1.6.8.4. Process virtual address space handling
1.6.8.5. Caching
1.6.9. Device Drivers
Chapter 2. Memory Addressing
2.1. Memory Addresses
(1)Logical address
(2)Linear address
(3)Physical address
The Memory Management Unit (MMU) transforms a logical address into a linear address by means of a hardware circuit called a segmentation unit.
a second hardware circuit called a paging unit transforms the linear address into a physical address .
Figure 2-1. Logical address translation
2.2. Segmentation in Hardware
2.2.1. Segment Selectors and Segmentation Registers
(1)Segment Selectors
(2 )Segmentation Registers
To make it easy to retrieve segment selectors quickly, the processor provides segmentation registerswhose only purpose is to hold Segment Selectors
cs, ss, ds, es, fs,gs.
2.2.2. Segment Descriptors
Global Descriptor Table (GDT )
Local Descriptor Table(LDT).
Code Segment Descriptor
Data Segment Descriptor
Task State Segment Descriptor (TSSD)
Local Descriptor Table Descriptor (LDTD)
2.2.3. Fast Access to Segment Descriptors
2.2.4. Segmentation Unit
2.3. Segmentation in Linux
The 2.6 version of Linux uses segmentation only when required by the 80 x 86 architecture
2.3.1. The Linux GDT
1. A Task State Segment (TSS)
2. kernel code and data segments
3.A segment including the default Local Descriptor Table (LDT),
4.Three Thread-Local Storage (TLS) segments
5. Three segments related to Advanced Power Management (APM ):
6.Five segments related to Plug and Play (PnP ) BIOS services
7.A special TSS segment used by the kernel to handle "Double fault " exceptions
a few entries in the GDT may depend on the process that the CPU is executing (LDT and TLS Segment Descriptors).
2.3.2. The Linux LDTs
2.4. Paging in Hardware
page frames/page
2.4.1. Regular Paging
2.4.2. Extended Paging
2.4.3. Hardware Protection Scheme
2.4.4. An Example of Regular Paging
A simple example will help in clarifying how regular paging works. Let's assume that the kernel
assigns the linear address space between 0x20000000 and 0x2003ffff to a running process.[ ] This
space consists of exactly 64 pages. We don't care about the physical addresses of the page frames
containing the pages; in fact, some of them might not even be in main memory. We are interested
only in the remaining fields of the Page Table entries.
[ ] As we shall see in the following chapters, the 3 GB linear address space is an upper limit, but a User Mode process is allowed to
reference only a subset of it.
Let's start with the 10 most significant bits of the linear addresses assigned to the process, which
are interpreted as the Directory field by the paging unit. The addresses start with a 2 followed by
zeros, so the 10 bits all have the same value, namely 0x080 or 128 decimal. Thus the Directory field
in all the addresses refers to the 129th entry of the process Page Directory. The corresponding entry
must contain the physical address of the Page Table assigned to the process (see Figure 2-9). If no
other linear addresses are assigned to the process, all the remaining 1,023 entries of the Page
Directory are filled with zeros.
The values assumed by the intermediate 10 bits, (that is, the values of the Table field) range from 0
to 0x03f, or from 0 to 63 decimal. Thus, only the first 64 entries of the Page Table are valid. The
remaining 960 entries are filled with zeros.
Suppose that the process needs to read the byte at linear address 0x20021406. This address is
handled by the paging unit as follows:
1. The Directory field 0x80 is used to select entry 0x80 of the Page Directory, which points to the
Page Table associated with the process's pages.
2.
The Table field 0x21 is used to select entry 0x21 of the Page Table, which points to the page
frame containing the desired page.
3.
Finally, the Offset field 0x406 is used to select the byte at offset 0x406 in the desired page
frame.
If the Present flag of the 0x21 entry of the Page Table is cleared, the page is not present in main
memory; in this case, the paging unit issues a Page Fault exception while translating the linear
address. The same exception is issued whenever the process attempts to access linear addresses
outside of the interval delimited by 0x20000000 and 0x2003ffff, because the Page Table entries not
assigned to the process are filled with zeros; in particular, their Present flags are all cleared.
Figure 2-9. An example of paging
2.4.5. The Physical Address Extension (PAE) Paging Mechanism
2.4.6. Paging for 64-bit Architectures
2.4.7. Hardware Cache/L1-cache/
The cache memory stores the actual lines of memory. The cache controller stores an array of entries, one entry for each line of the
cache memory. Each entry includes a tag and a few flags that describe the status of the cache line.
The tag consists of some bits that allow the cache controller to recognize the memory location
currently mapped by the line. The bits of the memory's physical address are usually split into three
groups: the most significant ones correspond to the tag, the middle ones to the cache controller
subset index, and the least significant ones to the offset within the line.
write-through:
the controller always writes into both RAM and the cache line, effectively switching off the cache for write operations
write-back:
the cache line is updated and the contents of the RAM are left
unchanged. After a write-back, of course, the RAM must eventually be updated. The cache controller
writes the cache line back into RAM only when the CPU executes an instruction requiring a flush of
cache entries or when a FLUSH hardware signal occurs (usually after a cache miss).
2.4.8. Translation Lookaside Buffers (TLB)//
Translation Lookaside Buffers (TLB) to speed up linear address translation. When a linear address is
used for the first time, the corresponding physical address is computed through slow accesses to the
Page Tables in RAM. The physical address is then stored in a TLB entry so that further references to
the same linear address can be quickly translated.
2.5. Paging in Linux
Linux's handling of processes relies heavily on paging. In fact, the automatic translation of linear
addresses into physical ones makes the following design objectives feasible:
1.Assign a different physical address space to each process, ensuring an efficient protection
against addressing errors.
2.Distinguish pages (groups of data) from page frames (physical addresses in main memory).
This allows the same page to be stored in a page frame, then saved to disk and later reloaded
in a different page frame. This is the basic ingredient of the virtual memory mechanism (see Chapter 17).
pgd
2.5.1. The Linear Address Fields
PAGE_SHIFT/PMD_SHIFT/PUD_SHIFT/PGDIR_SHIFT
PTRS_PER_PTE, PTRS_PER_PMD, PTRS_PER_PUD, and PTRS_PER_PGD
2.5.2. Page Table Handling
(1):type-conversion macros
_ _ pte, _ _ pmd, _ _ pud, _ _ pgd _ _ pgprot//(protect)
pte_val, pmd_val, pud_val, pgd_val,pgprot_val
(2):macros and functions to read or modify page table entries
pte_none, pmd_none, pud_none pgd_none
pte_clear, pmd_clear, pud_clear pgd_clear
set_pte, set_pmd, set_pud set_pgd
pte_same(a,b)
pmd_large(e)
pmd_bad pud_bad pgd_bad
pte_present
The pmd_bad macro is used by functions to check Page Middle Directory entries passed as input
parameters. It yields the value 1 if the entry points to a bad Page Table that is, if at least one of the
following conditions applies:
(1)The page is not in main memory (Present flag cleared).
(2)The page allows only Read access (Read/Write flag cleared).
(3)Either Accessed or Dirty is cleared (Linux always forces these flags to be set for every existing Page Table).
The pte_present macro yields the value 1 if either the Present flag or the Page Size flag of a Page
Table entry is equal to 1, the value 0 otherwise. Recall that the Page Size flag in Page Table entries
has no meaning for the paging unit of the microprocessor; the kernel, however, marks Present
equal to 0 and Page Size equal to 1 for the pages present in main memory but without read, write,
or execute privileges. In this way, any access to such pages triggers a Page Fault exception because
Present is cleared, and the kernel can detect that the fault is not due to a missing page by checking
the value of Page Size.
(3): Page flag reading/setting functions
pte_user( )
pte_read( )
pte_write( )
pte_exec( )
pte_dirty( )
pte_young( )
pte_file( )
mk_pte_huge( )
pte_wrprotect( )
pte_rdprotect( )
pte_exprotect( )
pte_mkwrite( )
pte_mkread( )
pte_mkexec( )
pte_mkclean( )
pte_mkdirty( )
pte_mkold( )
pte_mkyoung( )
pte_modify(p,v)
ptep_set_wrprotect()
ptep_set_access_flags()
ptep_mkdirty()
ptep_test_and_clear_dirty()
ptep_test_and_clear_young()
(4): Macros acting on Page Table entries
pgd_index(addr)
pgd_offset(mm, addr)
pgd_offset_k(addr)
pgd_page(pgd)
pud_offset(pgd, addr)
pud_page(pud)
pmd_index(addr)
pmd_offset(pud, addr)
pmd_page(pmd)
mk_pte(p,prot)
pte_index(addr)
pte_offset_kernel(dir, addr)
pte_offset_map(dir,ddr)
pte_to_pgoff(pte)
pgoff_to_pte(offset
(5): Page allocation functions
pgd_alloc(mm)
pgd_free( pgd)
pud_alloc(mm, pgd,addr)
pud_free(x)
pmd_alloc(mm, pud,addr)
pmd_free(x)
pte_alloc_map(mm, pmd,addr)
pte_alloc_kernel(mm,pmd, addr)
pte_free(pte)
pte_free_kernel(pte)
clear_page_range(mmu,start,end)
2.5.3. Physical Memory Layout
As a general rule, the Linux kernel is installed in RAM starting from the physical address 0x00100000 i.e., from the second megabyte.
Figure 2-13. The first 768 page frames (3 MB) in Linux 2.6
1:Page frame 0 is used by BIOS to store the system hardware configuration detected during the
Power-On Self-Test(POST); the BIOS of many laptops, moreover, writes data on this page
frame even after the system is initialized.
2:Physical addresses ranging from 0x000a0000 to 0x000fffff are usually reserved to BIOS
routines and to map the internal memory of ISA graphics cards. This area is the well-known
hole from 640 KB to 1 MB in all IBM-compatible PCs: the physical addresses exist but they are
reserved, and the corresponding page frames cannot be used by the operating system.
3:Additional page frames within the first megabyte may be reserved by specific computer
models. For example, the IBM ThinkPad maps the 0xa0 page frame into the 0x9f one.
Table 2-10. Variables describing the kernel's physical memory layout
Variable name Description
num_physpages Page frame number of the highest usable page frame
totalram_pages Total number of usable page frames
min_low_pfn Page frame number of the first usable page frame after the kernel image in RAM
max_pfn Page frame number of the last usable page frame
max_low_pfn Page frame number of the last page frame directly mapped by the kernel (low memory)
totalhigh_pages Total number of page frames not directly mapped by the kernel (high memory)
highstart_pfn Page frame number of the first page frame not directly mapped by the kernel
highend_pfn Page frame number of the last page frame not directly mapped by the kernel
2.5.4. Process Page Tables
The linear address space of a process is divided into two parts:
Linear addresses from 0x00000000 to 0xbfffffff can be addressed when the process runs in either User or Kernel Mode.
Linear addresses from 0xc0000000 to 0xffffffff can be addressed only when the process runs in Kernel Mode.
The content of the first entries of the Page Global Directory that map linear addresses lower than
0xc0000000 (the first 768 entries with PAE disabled, or the first 3 entries with PAE enabled) depends
on the specific process. Conversely, the remaining entries should be the same for all processes and
equal to the corresponding entries of the master kernel Page Global Directory (see the following
section).?????
2.5.5. Kernel Page Tables???
In the first phase, the kernel creates a limited address space including the kernel's code and data segments, the initial Page Tables,
and 128 KB for some dynamic data structures. This minimal address space is just large enough to install the kernel in RAM and to initialize its core
data structures.
In the second phase, the kernel takes advantage of all of the existing RAM and sets up the page tables properly.
Let us examine how this plan is executed.
2.5.5.1. Provisional kernel Page Tables
2.5.5.2. Final kernel Page Table when RAM size is less than 896 MB
2.5.5.3. Final kernel Page Table when RAM size is between 896 MB and 4096 MB
2.5.5.4. Final kernel Page Table when RAM size is more than 4096 MB
2.5.6. Fix-Mapped Linear Addresses
fix_to_virt( )
2.5.7. Handling the Hardware Cache and the TLB
2.5.7.1. Handling the hardware cache
L1_CACHE_BYTES macro yields the size of a cache line in bytes
2.5.7.2. Handling the TLB
Chapter 3. Processes
3.1. Processes, Lightweight Processes, and Threads
3.2. Process Descriptor
struct task_struct
3.2.1. Process State
eg:p->state = TASK_RUNNING;
3.2.2. Identifying a Process
process descriptor pointers
tgid(thread groups identify/pid
3.2.2.1. Process descriptors handling
Figure 3-2. Storing the thread_info structure and the process(task_struct) kernel stack in two page frames
union thread_union {
struct thread_info thread_info;
unsigned long stack[2048]; /* 1024 for 4KB stacks */
};
1):esp is the CPU stack pointer
3.2.2.2. Identifying the current process
current_thread_info( )denote the tHRead_info structure pointer of the process running on the CPU that executes the instruction.
current_thread_info( )->task or current denote the process descriptor pointer of the process running on the CPU.
#define task_stack_page(task) ((task)->stack) ,该宏根据task_struct得到栈底,也就是thread_info地址。//等价current
#define task_thread_info(task) ((struct thread_info *)(task)->stack),该宏根据task_struct得到thread_info指针。//等价current_thread_info( )
3.2.2.3. Doubly linked lists
that the pointers in a list_head field store the addresses of other list_head fields rather than the addresses of the whole data structures
in which the list_head tructure is included;
3.2.2.4. The process list
Another useful macro, called for_each_process, scans the whole process list
3.2.2.5. The lists of TASK_RUNNING processes
enqueue_task(p,array) /dequeue_task(p,array)
3.2.3. Relationships Among Processes
3.2.3.1. The pidhash table and chained lists--???
each hash table is stored in four page frames
3.2.4. How Processes Are Organized
3.2.4.1. Wait queues
wait_queue_head_t:
struct _ _wait_queue_head {
spinlock_t lock;
struct list_head task_list;
};
typedef struct _ _wait_queue_head wait_queue_head_t;
Elements of a wait queue list are of type
wait_queue_t:
struct _ _wait_queue {
unsigned int flags;
struct task_struct * task;
wait_queue_func_t func;
struct list_head task_list;
};
typedef struct _ _wait_queue wait_queue_t;
3.2.4.2. Handling wait queues
The prepare_to_wait( ), prepare_to_wait_exclusive( ), and finish_wait( ) functions,
introduced in Linux 2.6, offer yet another way to put the current process to sleep in a wait
queue. Typically, they are used as follows:
DEFINE_WAIT(wait);
prepare_to_wait_exclusive(&wq, &wait, TASK_INTERRUPTIBLE);
/* wq is the head of the wait queue */
...
if (!condition)
schedule();
finish_wait(&wq, &wait);
3.2.5. Process Resource Limits
The resource limits for the current process are stored in the current->signal->rlim field, that is, in
a field of the process's signal descriptor
3.2. Process Switch
3.3.1. Hardware Context
a part of the hardware context of a process is stored in the process descriptor, while the remaining part is saved
in the Kernel Mode stack.
3.3.2. Task State Segment
The TSSDs created by Linux are stored in the Global Descriptor Table (GDT), whose base address is
stored in the gdtr register of each CPU. The tr register of each CPU contains the TSSD Selector of
the corresponding TSS. The register also includes two hidden, nonprogrammable fields: the Base
and Limit fields of the TSSD. In this way, the processor can address the TSS directly without having
to retrieve the TSS address from the GDT.
3.3.2.1. The thread field
Thus, each process descriptor includes a field called thread of type thread_struct, in which the
kernel saves the hardware context whenever the process is being switched out.
3.3.3. Performing the Process Switch
Essentially, every process switch consists of two steps:
1.Switching the Page Global Directory to install a new address space; we'll describe this step in Chapter 9.
2.Switching the Kernel Mode stack and the hardware context, which provides all the information
needed by the kernel to execute the new process, including the CPU registers.
3.3.3.1. The switch_to macro---???
3.3.3.2. The _ _switch_to ( ) function--???
3.3.4. Saving and Loading the FPU, MMX, and XMM Registers
3.3.4.1. Saving the FPU registers
3.3.4.2. Loading the FPU registers
3.3.4.3. Using the FPU, MMX, and SSE/SSE2 units in Kernel Mode
3.4. Creating Processes
3.4.1. The clone( ), fork( ), and vfork( ) System Calls
Lightweight processes are created in Linux by using a function named clone( ), which uses the following parameters:
fn,arg,flags,child_stack,tls,ptid,ctid
3.4.1.1. The do_fork( ) function
do_fork( clone_flags,stack_start,regs,stack_size,parent_tidptr,child_tidptr)
clone_flags
Same as the flags parameter of clone( )
stack_start:
Specifies the User Mode stack pointer to be assigned to the esp register of the child process.
The invoking process (the parent) should always allocate a new stack for the child.
regs:
Pointer to the values of the general purpose registers saved into the Kernel Mode stack when
switching from User Mode to Kernel Mode (see the section "The do_IRQ( ) function" in Chapter
4)
stack_size
Unused (always set to 0)
parent_tidptr,
Specifies the address of a User Mode variable of the parent process that will hold the PID of
the new lightweight process. Meaningful only if the CLONE_PARENT_SETTID flag is set.
child_tidptr,
Specifies the address of a User Mode variable of the new lightweight process that will hold the
PID of such process. Meaningful only if the CLONE_CHILD_SETTID flag is set.
3.4.1.2. The copy_process( ) function
3.4.2. Kernel Threads
3.4.2.1. Creating a kernel thread
Kernel threads run only in Kernel Mode, while regular processes run alternatively in Kernel Mode and in User Mode.
Because kernel threads run only in Kernel Mode, they use only linear addresses greater than PAGE_OFFSET.
Regular processes, on the other hand, use all four gigabytes of linear addresses, in either User Mode or Kernel Mode.
3.4.2.2. Process 0 swapper process/idle process
3.4.2.3. Process 1 init
3.4.2.4. Other kernel threads
keventd (also called events)
Executes the functions in the keventd_wq workqueue (see Chapter 4).
kapmd
Handles the events related to the Advanced Power Management (APM).
kswapd
Reclaims memory, as described in the section "Periodic Reclaiming" in Chapter 17.
pdflush
Flushes "dirty" buffers to disk to reclaim memory, as described in the section "The pdflush Kernel Threads" in Chapter 15.
kblockd
Executes the functions in the kblockd_workqueue workqueue. Essentially, it periodically activates the block devic
ksoftirqd
Runs the tasklets (see section "Softirqs and Tasklets" in Chapter 4); there is one of these kernel threads for each CPU in the system.
3.5. Destroying Processes
The exit( ) library function may be inserted by the programmer explicitly. Additionally,
the C compiler always inserts an exit( ) unction call right after the last statement of the main( ) function.
3.5.1. Process Termination
3.5.1.1. The do_group_exit( ) function
3.5.1.2. The do_exit( ) function
3.5.2. Process Removal
The release_task( ) function detaches the last data structures from the descriptor of a zombie
process; it is applied on a zombie process in two possible ways: by the do_exit( ) function if the
parent is not interested in receiving signals from the child, or by the wait4( ) or waitpid( ) system
calls after a signal has been sent to the parent. In the latter case, the function also will reclaim the
memory used by the process descriptor, while in the former case the memory reclaiming will be
done by the scheduler (see Chapter 7). This function executes the following steps:
Chapter 4. Interrupts and Exceptions
Synchronous interrupts are produced by the CPU control unit while executing instructions and
are called synchronous because the control unit issues them only after terminating the
execution of an instruction
Asynchronous interrupts are generated by other hardware devices at arbitrary times with
respect to the CPU clock signals.
4.1. The Role of Interrupt Signals
4.2. Interrupts and Exceptions
Interrupts:
Maskable interrupts
Nonmaskable interrupts
Exceptions:
Processor-detected exceptions
Faults
Traps
Aborts
Programmed exceptions
4.2.1. IRQs and Interrupts
4.2.1.1. The Advanced Programmable Interrupt Controller (APIC)
4.2.2. Exceptions
4.2.3. Interrupt Descriptor Table
Interrupt Descriptor Table (IDT )
256 x 8 = 2048 bytes
4.2.4. Hardware Handling of Interrupts and Exceptions
4.3. Nested Execution of Exception and Interrupt Handlers
4.4.1. Interrupt, Trap, and System Gates
4.4.2. Preliminary Initialization of the IDT
4.4. Initializing the Interrupt Descriptor Table
4.5. Exception Handling
Exception handlers have a standard structure consisting of three steps:
1. Save the contents of most registers in the Kernel Mode stack (this part is coded in assembly language).
2. Handle the exception by means of a high-level C function.
3. Exit from the handler by means of the ret_from_exception( ) function.
4.5.1. Saving the Registers for the Exception Handler
4.5.2. Entering and Leaving the Exception Handler
The exception handler always checks whether the exception occurred in User Mode or in Kernel Mode
and, in the latter case, whether it was due to an invalid argument passed to a system call. We'll
describe in the section "Dynamic Address Checking: The Fix-up Code" in Chapter 10 how the kernel
defends itself against invalid arguments passed to system calls. Any other exception raised in Kernel
Mode is due to a kernel bug. In this case, the exception handler knows the kernel is misbehaving. In
order to avoid data corruption on the hard disks, the handler invokes the die( ) function, which
prints the contents of all CPU registers on the console (this dump is called kernel oops ) and
terminates the current process by calling do_exit( ) (see "Process Termination" in Chapter 3).
4.6. Interrupt Handling
I/O interrupts
Timer interrupts
Interprocessor interrupts
4.6.1. I/O Interrupt Handling
1. Save the IRQ value and the register's contents on the Kernel Mode stack.
2.Send an acknowledgment to the PIC that is servicing the IRQ line, thus allowing it to issue
further interrupts.
3. Execute the interrupt service routines (ISRs) associated with all the devices that share the IRQ.
4. Terminate by jumping to the ret_from_intr( ) address.
Vector range Use
0~19 (0x0-0x13) Nonmaskable interrupts and exceptions
20~31 (0x14-0x1f) Intel-reserved
32~127 (0x20-0x7f)External interrupts (IRQs)
128 (0x80) Programmed exception for system calls (see Chapter 10)
129~238 (0x81-0xee)External interrupts (IRQs)
239 (0xef) Local APIC timer interrupt (see Chapter 6)
240 (0xf0) Local APIC thermal interrupt (introduced in the Pentium 4 models)
241~250 (0xf1-0xfa)Reserved by Linux for future use
251~253 (0xfb-0xfd)Interprocessor interrupts (see the section "Interprocessor Interrupt Handling" later in this chapter)
254 (0xfe)Local APIC error interrupt (generated when the local APIC detects an erroneous condition)
255 (0xff) Local APIC spurious interrupt (generated if the CPU masks an interrupt while the hardware device raises it)
4.6.1.2. IRQ data structures
Figure 4-5. IRQ descriptors
1:irq_desc
Field Description
handler Points to the PIC object (hw_irq_controller descriptor) that services the IRQ line.
handler_data Pointer to data used by the PIC methods.
action Identifies the interrupt service routines to be invoked when the IRQ occurs. The
field points to the first element of the list of irqaction descriptors associated with
the IRQ. The irqaction descriptor is described later in the chapter.
status A set of flags describing the IRQ line status (see Table 4-5).
depth Shows 0 if the IRQ line is enabled and a positive value if it has been disabled at
least once.
irq_count Counter of interrupt occurrences on the IRQ line (for diagnostic use only).
irqs_unhandled Counter of unhandled interrupt occurrences on the IRQ line (for diagnostic use
only).
lock A spin lock used to serialize the accesses to the IRQ descriptor and to the PIC (see Chapter 5).
Table 4-5. Flags describing the IRQ line status
Flag name Description
IRQ_INPROGRESS A handler for the IRQ is being executed.
IRQ_DISABLED The IRQ line has been deliberately disabled by a device driver.
IRQ_PENDING An IRQ has occurred on the line; its occurrence has been acknowledged to the PIC,
but it has not yet been serviced by the kernel.
IRQ_REPLAY The IRQ line has been disabled but the previous IRQ occurrence has not yet been
acknowledged to the PIC.
IRQ_AUTODETECT The kernel is using the IRQ line while performing a hardware device probe.
IRQ_WAITING The kernel is using the IRQ line while performing a hardware device probe;
moreover, the corresponding interrupt has not been raised.
IRQ_LEVEL Not used on the 80 x 86 architecture.
IRQ_MASKED Not used.
IRQ_PER_CPU Not used on the 80 x 86 architecture.
2:irqaction descriptors
Fieldname Description
handler Points to the interrupt service routine for an I/O device. This is the key field that allows
many devices to share the same IRQ.
flags This field includes a few fields that describe the relationships between the IRQ line and
the I/O device (see Table 4-7).
mask Not used.
name The name of the I/O device (shown when listing the serviced IRQs by reading the
/proc/interrupts file).
dev_id A private field for the I/O device. Typically, it identifies the I/O device itself (for instance,
it could be equal to its major and minor numbers; see the section "Device Files" in
Chapter 13), or it points to the device driver's data.
next Points to the next element of a list of irqaction descriptors. The elements in the list refer
to hardware devices that share the same IRQ.
irq irq IRQ line.
dir Points to the descriptor of the /proc/irq/n directory associated with the IRQn.
4.6.1.3. IRQ distribution in multiprocessor systems
4.6.1.4. Multiple Kernel Mode stacks
4.6.1.5. Saving the registers for the interrupt handler
4.6.1.6. The do_IRQ( ) function
4.6.1.7. The _ _do_IRQ( ) function
4.6.1.8. Reviving a lost interrupt
4.6.1.9. Interrupt service routines
4.6.1.10. Dynamic allocation of IRQ lines
4.6.2. Interprocessor Interrupt Handling /IPI
CALL_FUNCTION_VECTOR (vector 0xfb)
RESCHEDULE_VECTOR (vector 0xfc)
INVALIDATE_TLB_VECTOR (vector 0xfd)
4.7. Softirqs and Tasklets
4.7.1. Softirqs
Table 4-9. Softirqs used in Linux 2.6
Softirq Index (priority) Description
HI_SOFTIRQ 0 Handles high priority tasklets
TIMER_SOFTIRQ 1 Tasklets related to timer interrupts
NET_TX_SOFTIRQ 2 Transmits packets to network cards
NET_RX_SOFTIRQ 3 Receives packets from network cards
SCSI_SOFTIRQ 4 Post-interrupt processing of SCSI commands
TASKLET_SOFTIRQ 5 Handles regular tasklets
4.7.1.1. Data structures used for softirqs
1.softirq_action
2.Another critical field used to keep track both of kernel preemption and of nesting of kernel control
paths is the 32-bit preempt_count field stored in the tHRead_info field of each process descriptor
4.7.1.2. Handling softirqs
4.7.1.3. The do_softirq( ) function
4.7.1.4. The _ _do_softirq( ) function
4.7.1.5. The ksoftirqd kernel threads
4.7.2. Tasklets
tasklet descriptors
tasklet_struct
Field name Description
next Pointer to next descriptor in the list
state Status of the tasklet
count Lock counter
func Pointer to the tasklet function
data An unsigned long integer that may be used by the tasklet function
4.8. Work Queues
Chapter 5. Kernel Synchronization
5.1. How the Kernel Services Requests
5.1.1. Kernel Preemption
kernel preemption is disabled when the preempt_count field
in the tHRead_info descriptor referenced by the current_thread_info( ) macro is greater than zero
Table 5-1. Macros dealing with the preemption counter subfield
Macro Description
preempt_count( ) Selects the preempt_count field in the tHRead_info descriptor
preempt_disable( ) Increases by one the value of the preemption counter
preempt_enable_no_resched()Decreases by one the value of the preemption counter
preempt_enable( ) Decreases by one the value of the preemption counter, and invokes preempt_schedule( )
if the TIF_NEED_RESCHED flag in thethread_info descriptor is set
get_cpu( ) Similar to preempt_disable( ), but also returns the number of the local CPU
put_cpu( ) Same as preempt_enable( )
put_cpu_no_resched( ) Same as preempt_enable_no_resched( )
5.1.2. When Synchronization Is Necessary
5.1.3. When Synchronization Is Not Necessary
5.2. Synchronization Primitives
5.2.1. Per-CPU Variables
a kernel control path should access a per-CPU variable with kernel preemption disabled.
5.2.2. Atomic Operations
Every such operation must be executed in a single instruction without being interrupted in the middle and avoiding accesses to the same memory
5.2.3. Optimization and Memory Barriers
1:Optimization Barriers
An optimization barrier primitive ensures that the assembly language instructions corresponding to
C statements placed before the primitive are not mixed by the compiler with assembly language
i nstructions corresponding to C statements placed after the primitive. In Linux the barrier( )
macro, which expands into asm volatile("":::"memory"), acts as an optimization barrier.
2:Memory Barriers
A memory barrier primitive ensures that the operations placed before the primitive are finished
before starting the operations placed after the primitive.
Table 5-6. Memory barriers in Linux
Macro Description
mb( ) Memory barrier for MP and UP
rmb( ) Read memory barrier for MP and UP
wmb( ) Write memory barrier for MP and UP
smp_mb( ) Memory barrier for MP only
smp_rmb( ) Read memory barrier for MP only
smp_wmb( ) Write memory barrier for MP only
5.2.4. Spin Locks
Spin locks are a special kind of lock designed to work in a multiprocessor environment. If the kernel
control path finds the spin lock "open," it acquires the lock and continues its execution. Conversely,
if the kernel control path finds the lock "closed" by a kernel control path running on another CPU, it
"spins" around, repeatedly executing a tight instruction loop, until the lock is released.
The instruction loop of spin locks represents a "busy wait." The waiting kernel control path keeps
running on the CPU, even if it has nothing to do besides waste time. Nevertheless, spin locks are
usually convenient, because many kernel resources are locked for a fraction of a millisecond only;
therefore, it would be far more time-consuming to release the CPU and reacquire it later
In Linux, each spin lock is represented by a spinlock_t structure consisting of two fields:
slock
Encodes the spin lock state: the value 1 corresponds to the unlocked state, while every
negative value and 0 denote the locked state
break_lock
Flag signaling that a process is busy waiting for the lock (present only if the kernel supports
both SMP and kernel preemption)
Table 5-7. Spin lock macros
Macro Description
spin_lock_init( ) Set the spin lock to 1 (unlocked)
spin_lock( ) Cycle until spin lock becomes 1 (unlocked), then set it to 0 (locked)
spin_unlock( ) Set the spin lock to 1 (unlocked)
spin_unlock_wait() Wait until the spin lock becomes 1 (unlocked)
spin_is_locked( ) Return 0 if the spin lock is set to 1 (unlocked); 1 otherwise
spin_trylocked( ) Set the spin lock to 0 (locked), and return 1 if the previous value of the lock was 1; 0 otherwise
5.2.4.1. The spin_lock( ) macro with kernel preemption
5.2.4.2. The spin_lock( ) macro without kernel preemption
5.2.4.3. The spin_unlock( ) macro
5.2.4.1 Read/Write Spin Locks
Each read/write spin lock is a rwlock_t structure; its lock field is a 32-bit field that encodes two
distinct pieces of information:
1:A 24-bit counter denoting the number of kernel control paths currently reading the protected
data structure. The two's complement value of this counter is stored in bits 023 of the field.
2:An unlock flag that is set when no kernel control path is reading or writing, and clear
otherwise. This unlock flag is stored in bit 24 of the field.
Notice that the lock field stores the number 0x01000000 if the spin lock is idle (unlock flag set and no
readers), the number 0x00000000 if it has been acquired for writing (unlock flag clear and no
readers), and any number in the sequence 0x00ffffff, 0x00fffffe, and so on, if it has been
acquired for reading by one, two, or more processes (unlock flag clear and the two's complement on
24 bits of the number of readers). As the spinlock_t structure, the rwlock_t structure also includes
a break_lock field.
5.2.5.1. Getting and releasing a lock for reading
5.2.5.2. Getting and releasing a lock for writing
5.2.4.2. Seqlocks
Each seqlock is a seqlock_t structure consisting of two fields: a lock field of type spinlock_t and an
integer sequence field. This second field plays the role of a sequence counter. Each reader must read
this sequence counter twice, before and after reading the data, and check whether the two values
coincide. In the opposite case, a new writer has become active and has increased the sequence
counter, thus implicitly telling the reader that the data just read is not valid.
seqlock_init
write_seqlock( )
write_sequnlock( )
read_seqbegin()
5.2.4.3. Read-Copy Update (RCU)
How does RCU obtain the surprising result of synchronizing several CPUs without shared data
structures? The key idea consists of limiting the scope of RCU as follows:
1.Only data structures that are dynamically allocated and referenced by means of pointers can be
protected by RCU.
2. No kernel control path can sleep inside a critical region protected by RCU.
rcu_read_lock( )
rcu_read_unlock( )
5.2.5. Semaphores
they implement a locking primitive that allows waiters to sleep until the desired resource becomes free.
Actually, Linux offers two kinds of semaphores
1.Kernel semaphores, which are used by kernel control paths
A kernel semaphore is similar to a spin lock, in that it doesn't allow a kernel control path to proceed
unless the lock is open. However, whenever a kernel control path tries to acquire a busy resource
protected by a kernel semaphore, the corresponding process is suspended. It becomes runnable
again when the resource is released. Therefore, kernel semaphores can be acquired only by
functions that are allowed to sleep; interrupt handlers and deferrable functions cannot use them.
2.System V IPC semaphores, which are used by User Mode processes
struct semaphore
count:
Stores an atomic_t value. If it is greater than 0, the resource is free that is, it is currently
available. If count is equal to 0, the semaphore is busy but no other process is waiting for the
protected resource. Finally, if count is negative, the resource is unavailable and at least one
process is waiting for it.
wait
Stores the address of a wait queue list that includes all sleeping processes that are currently
waiting for the resource. Of course, if count is greater than or equal to 0, the wait queue is
empty.
sleepers
Stores a flag that indicates whether some processes are sleeping on the semaphore. We'll see
this field in operation soon.
Getting and releasing semaphores
up( )
down( )
5.2.5.1 Read/Write Semaphores
Each read/write semaphore is described by a rw_semaphore structure that includes the following
fields:
count
Stores two 16-bit counters. The counter in the most significant word encodes in two's
complement form the sum of the number of nonwaiting writers (either 0 or 1) and the number
of waiting kernel control paths. The counter in the less significant word encodes the total
number of nonwaiting readers and writers.
wait_list
Points to a list of waiting processes. Each element in this list is a rwsem_waiter structure,
including a pointer to the descriptor of the sleeping process and a flag indicating whether the
process wants the semaphore for reading or for writing.
wait_lock
A spin lock used to protect the wait queue list and the rw_semaphore structure itself.
5.2.6. Completions
The real difference between completions and semaphores is how the spin lock included in the wait
queue is used. In completions, the spin lock is used to ensure that complete( ) and
wait_for_completion( ) cannot execute concurrently. In semaphores, the spin lock is used to avoid
letting concurrent down( )'s functions mess up the semaphore data structure
5.2.7. Local Interrupt Disabling
local_irq_disable( )
local_irq_enable( )
local_irq_save()
local_irq_restore()
5.2.8. Disabling and Enabling Deferrable Functions
local_bh_disable
local_bh_enable
5.3. Synchronizing Accesses to Kernel Data Structures
5.3.1. Choosing Among Spin Locks, Semaphores, and Interrupt Disabling
5.4. Examples of Race Condition Prevention
5.4.1. Reference Counters
A reference counter is just an atomic_t counter associated with a specific resource such as a memory page, a module, or a file
5.4.2. The Big Kernel Lock
lock_kernel( )/unlock_kernel( )
5.4.3. Memory Descriptor Read/Write Semaphore
Each memory descriptor of type mm_struct includes its own semaphore in the mmap_sem field (see the section "The Memory Descriptor" in Chapter 9)
5.4.4. Slab Cache List Semaphore
The list of slab cache descriptors (see the section "Cache Descriptor" in Chapter 8) is protected by the cache_chain_sem semaphore,
which grants an exclusive right to access and modify the list.
5.4.5. Inode Semaphore
Chapter 6. Timing Measurements
6.1. Clock and Timer Circuits
6.1.1. Real Time Clock (RTC)
6.1.2. Time Stamp Counter (TSC)
6.1.3. Programmable Interval Timer (PIT)
6.1.4. CPU Local Timer
6.1.5. High Precision Event Timer (HPET)
6.1.6. ACPI Power Management Timer
6.2. The Linux Timekeeping Architecture
6.2.1. Data Structures of the Timekeeping Architecture
6.2.1.1. The timer object
a descriptor of type timer_opts consisting of the timer name and of four standard methods shown in Table 6-1.
Table 6-1. The fields of the timer_opts data structure
Field name Description
name A string identifying the timer source
mark_offset Records the exact time of the last tick; it is invoked by the timer interrupt handler
get_offset Returns the time elapsed since the last tick
monotonic_clock Returns the number of nanoseconds since the kernel initialization
delay Waits for a given number of "loops" (see the later section "Delay Functions")
6.2.1.2. The jiffies variable
The jiffies variable is a counter that stores the number of elapsed ticks since the system was started.
6.2.1.3. The xtime variable
The xtime variable stores the current time and date; it is a structure of type timespec having two
fields:
tv_sec
Stores the number of seconds that have elapsed since midnight of January 1, 1970 (UTC)
tv_nsec
Stores the number of nanoseconds that have elapsed within the last second (its value ranges
between 0 and 999,999,999)
6.2.2. Timekeeping Architecture in Uniprocessor Systems
6.2.2.1. Initialization phase
During kernel initialization, the time_init( ) function is invoked to set up the timekeeping architecture.
6.2.2.2. The timer interrupt handler
The timer_interrupt( ) function is the interrupt service routine (ISR) of the PIT or of the HPET;
6.2.3. Timekeeping Architecture in Multiprocessor Systems
6.2.3.1. Initialization phase
6.2.3.2. The global timer interrupt handler
6.2.3.3. The local timer interrupt handler
6.3. Updating the Time and Date
update_times( )-->update_wall_time( )
6.4. Updating System Statistics
6.4.1. Updating Local CPU Statistics
update_process_times( ) function
6.4.2. Keeping Track of System Load
calc_load( ) function
6.4.3. Profiling the Kernel Code
profile_tick( ) function
1|shell@android:/ # oprofiled --usage
Usage: oprofiled [-v?] [--session-dir=/var/lib/oprofile]
[-r|--kernel-range start-end] [-k|--vmlinux file] [--no-vmlinux]
[--xen-range=start-end] [--xen-image=file]
[--image=profile these comma separated image] [--separate-lib=[0|1]]
[--separate-kernel=[0|1]] [--separate-thread=[0|1]]
[--separate-cpu=[0|1]] [-e|--events [events]] [-v|--version]
[-V|--verbose all,sfile,arcs,samples,module,misc]
[-x|--ext-feature <extended-feature-name>:[args]] [-?|--help] [--usage]
6.4.4. Checking the NMI Watchdogs
do_nmi( )
6.5. Software Timers and Delay Functions
6.5.1. Dynamic Timers
struct timer_list {
struct list_head entry;
unsigned long expires;
spinlock_t lock;
unsigned long magic;
void (*function)(unsigned long);
unsigned long data;
tvec_base_t *base;
};
6.5.1.1. Dynamic timers and race conditions
6.5.1.2. Data structures for dynamic timers
typedef struct tvec_t_base_s {
spinlock_t lock;
unsigned long timer_jiffies;
struct timer_list *running_timer;
tvec_root_t tv1;
tvec_t tv2;
tvec_t tv3;
tvec_t tv4;
tvec_t tv5;
} tvec_base_t;
6.5.1.3. Dynamic timer handling
run_timer_softirq( )
6.5.2. An Application of Dynamic Timers: the nanosleep( ) System Call
6.5.3. Delay Functions
6.6. System Calls Related to Timing Measurements
6.6.1. The time( ) and gettimeofday( ) System Calls
6.6.2. The adjtimex( ) System Call
6.6.3. The setitimer( ) and alarm( ) System Calls
ITIMER_REAL
The actual elapsed time; the process receives SIGALRM signals.
ITIMER_VIRTUAL
The time spent by the process in User Mode; the process receives SIGVTALRM signals.
ITIMER_PROF
The time spent by the process both in User and in Kernel Mode; the process receives SIGPROF
signals.
The ITIMER_REAL interval timer is implemented by using dynamic timers because the kernel must
send signals to the process even when it is not running on the CPU. Therefore, each process
descriptor includes a dynamic timer object called real_timer. The setitimer( ) system call
initializes the real_timer fields and then invokes add_timer( ) to insert the dynamic timer in the
proper list. When the timer expires, the kernel executes the it_real_fn( ) timer function. In turn,
the it_real_fn( ) function sends a SIGALRM signal to the process; then, if it_real_incr is not null, it
sets the expires field again, reactivating the timer.
The ITIMER_VIRTUAL and ITIMER_PROF interval timers do not require dynamic timers, because they
can be updated while the process is running. The account_it_virt( ) and account_it_prof( )
functions are invoked by update_ process_times( ), which is called either by the PIT's timer
interrupt handler (UP) or by the local timer interrupt handlers (SMP). Therefore, the two interval
timers are updated once every tick, and if they are expired, the proper signal is sent to the current
process.
6.6.4. System Calls for POSIX Timers
Chapter 7. Process Scheduling
7.1. Scheduling Policy
When speaking about scheduling, processes are traditionally classified as I/O-bound or CPU-bound.
The former make heavy use of I/O devices and spend much time waiting for I/O operations to
complete; the latter carry on number-crunching applications that require a lot of CPU time.
Interactive processes
Batch processes
Real-time processes
Table 7-1. System calls related to scheduling
System call Description
nice( ) Change the static priority of a conventional process
getpriority( ) Get the maximum static priority of a group of conventional processes
setpriority( ) Set the static priority of a group of conventional processes
sched_getscheduler( ) Get the scheduling policy of a process
sched_setscheduler( ) Set the scheduling policy and the real-time priority of a process
sched_getparam( ) Get the real-time priority of a process
sched_setparam( ) Set the real-time priority of a process
sched_yield( ) Relinquish the processor voluntarily without blocking
sched_get_ priority_min( ) Get the minimum real-time priority value for a policy
sched_get_ priority_max( ) Get the maximum real-time priority value for a policy
sched_rr_get_interval( ) Get the time quantum value for the Round Robin policy
sched_setaffinity( ) Set the CPU affinity mask of a process
sched_getaffinity( ) Get the CPU affinity mask of a process
7.1.1. Process Preemption
7.1.2. How Long Must a Quantum Last?
The choice of the average quantum duration is always a compromise. The rule of thumb adopted by
Linux is choose a duration as long as possible, while keeping good system response time.
7.2. The Scheduling Algorithm
SCHED_FIFO
SCHED_RR
SCHED_NORMAL
7.2.1. Scheduling of Conventional Processes
The kernel represents the static priority of a conventional process with a number ranging from 100 (highest priority) to
139 (lowest priority); notice that static priority decreases as the values increase.
7.2.1.2. Dynamic priority and average sleep time
7.2.1.3. Active and expired processes
Active processes
These runnable processes have not yet exhausted their time quantum and are thus allowed to run.
Expired processes
These runnable processes have exhausted their time quantum and are thus forbidden to run
until all active processes expire
7.2.2. Scheduling of Real-Time Processes
Every real-time process is associated with a real-time priority, which is a value ranging from 1 (highest priority) to 99 (lowest priority).
7.3. Data Structures Used by the Scheduler
7.3.1. The runqueue Data Structure
Table 7-4. The fields of the runqueue structure
Type Name Description
spinlock_t lock Spin lock protecting the lists of processes
unsigned long nr_running Number of runnable processes in the runqueue lists
unsigned long cpu_load CPU load factor based on the average number of processes in the runqueue
unsigned long nr_switches Number of process switches performed by the CPU
unsigned long nr_uninterruptible Number of processes that were previously in the runqueue
lists and are now sleeping in TASK_UNINTERRUPTIBLE state
(only the sum of these fields across all runqueues is meaningful)
unsigned long expired_timestamp Insertion time of the eldest process in the expired lists unsigned long
long timestamp_last_tick Timestamp value of the last timer interrupt
task_t * curr Process descriptor pointer of the currently running process (same as current for the local CPU)
task_t * idle Process descriptor pointer of the swapper process for this CPU
struct mm_struct * prev_mm Used during a process switch to store the address of the memory descriptor of the process being replaced
prio_array_t * active Pointer to the lists of active processes
prio_array_t * expired Pointer to the lists of expired processes
prio_array_t [2] arrays The two sets of active and expired processes int best_expired_prio
int best_expired_prio The best static priority (lowest value) among the expired processes
atomic_t nr_iowait Number of processes that were previously in the runqueue
lists and are now waiting for a disk I/O operation to
complete
struct sched_domain *sd Points to the base scheduling domain of this CPU (see the section "Scheduling Domains" later in this chapter)
int active_balance Flag set if some process shall be migrated from this runqueue to another (runqueue balancing)
int push_cpu Not used
task_t * migration_thread Process descriptor pointer of the migration kernel thread
struct list_head migration_queue List of processes to be removed from the runqueue
7.3.2. Process Descriptor
7.4. Functions Used by the Scheduler
7.4.1. The scheduler_tick( ) Function
Keeps the time_slice counter of current up-to-date
We have already explained in the section "Updating Local CPU Statistics" in Chapter 6 how
scheduler_tick( ) is invoked once every tick to perform some operations related to scheduling.
7.4.1.1. Updating the time slice of a real-time process
This is the meaning of round-robin scheduling
7.4.1.1.Updating the time slice of a conventional process
7.4.2. The try_to_wake_up( ) Function
Awakens a sleeping process
The TRy_to_wake_up( ) function awakes a sleeping or stopped process by setting its state to
TASK_RUNNING and inserting it into the runqueue of the local CPU.
7.4.3. The recalc_task_prio( ) Function
Updates the dynamic priority of a process
7.4.4. The schedule( ) Function
Selects a new process to be executed
7.4.4.1. Direct invocation
7.4.4.2. Lazy invocation
7.4.4.3. Actions performed by schedule( ) before a process switch
7.4.4.4. Actions performed by schedule( ) to make the process switch
7.4.4.5. Actions performed by schedule( ) after a process switch
7.5. Runqueue Balancing in Multiprocessor Systems
7.5.1. Scheduling Domains
Essentially, a scheduling domain is a set of CPUs whose workloads should be kept balanced by the kernel.
7.5.2. The rebalance_tick( ) Function
7.5.3. The load_balance( ) Function
7.5.4. The move_tasks( ) Function
7.6. System Calls Related to Scheduling
7.6.1. The nice( ) System Call
The nice( )[*] system call allows processes to change their base priority. The integer value
contained in the increment parameter is used to modify the nice field of the process descriptor. The
nice Unix command, which allows users to run programs with modified scheduling priority, is based
on this system call.
7.6.2. The getpriority( ) and setpriority( ) System Calls
7.6.3. The sched_getaffinity( ) and sched_setaffinity( ) System Calls
The sched_getaffinity( ) and sched_setaffinity( ) system calls respectively return and set up the
CPU affinity mask of a processthe bit mask of the CPUs that are allowed to execute the process. This
mask is stored in the cpus_allowed field of the process descriptor.
7.6.4. System Calls Related to Real-Time Processes
7.6.4.1. The sched_getscheduler( ) and sched_setscheduler( ) system calls
The sched_getscheduler( ) system call queries the scheduling policy currently applied to the
process identified by the pid parameter.
SCHED_FIFO, SCHED_RR, or SCHED_NORMAL
7.6.4.2. The sched_ getparam( ) and sched_setparam( ) system calls
7.6.4.3. The sched_ yield( ) system call
7.6.4.4. The sched_ get_priority_min( ) and sched_ get_priority_max( ) system calls
7.6.4.5. The sched_rr_ get_interval( ) system call
Page 318