深入理解Linux内核第3版--笔记-1.pdf

深入理解Linux内核第3版.pdf

Understanding the Linux Kernel, 3rd Edition

Preface

The Audience for This Book

we try to go beyond superficial features. We offer a background, such as the history of major features and the reasons why they were used

Organization of the Material

We tried a bottom-up approach: start with topics that are hardware-dependent and end with those that are totally hardware-independent.

Level of Description

Overview of the Book

Conventions in This Book

How to Contact Us

Chapter 1. Introduction

1.1. Linux Versus Other Unix-Like Kernels

Linux regards lightweight processes as the basic execution context and handles them via the nonstandard clone( ) system call

1.2. Hardware Dependency

1.3. Linux Versions

1.4. Basic Operating System Concepts

1.4.1. Multiuser Systems

1.4.2. Users and Groups

1.4.2. Processes

A process can be defined either as "an instance of a program in execution" or as the "execution context" of a running program.

1.4.2. kenerl architecture

monolithic/microkenerl(module)

1.5. An Overview of the Unix Filesystem

1.5.1. Files

1.5.2. Hard and Soft Links

1.5.3. File Types

1.5.4. File Descriptor and Inode

1.5.5. Access Rights and File Mode

When a file is created by a process, its owner ID is the UID of the process.

Its owner user group ID can be either the process group ID of the creator process or the user group ID of the parent directory,

depending on the value of the sgid flag of the parent directory.

1.5.6. File-Handling System Calls

1.5.6.1. Opening a file

1.5.6.2. Accessing an opened file

1.5.6.3. Closing a file

1.5.6.4. Renaming and deleting a file

1.6. An Overview of Unix Kernels(需要再次理解阅读)

1.6.1. The Process/Kernel Model

kernel routines can be activated in several ways:

1:A process invokes a system call.

2:The CPU executing the process signals an exception, which is an unusual condition such as an invalid instruction.

The kernel handles the exception on behalf of the process that caused it.

3:A peripheral device issues an interrupt signal to the CPU to notify it of an event such as a request for attention,

a status change, or the completion of an I/O operation.

Each interrupt signal is dealt by a kernel program called an interrupt handler.

Because peripheral devices operate asynchronously with respect to the CPU, interrupts occur at unpredictable times.

4:A kernel thread is executed. Because it runs in Kernel Mode, the corresponding program must be considered part of the kernel.

1.6.2. Process Implementation

When the kernel stops the execution of a process, it saves the current contents of several processor registers in the process descriptor.

These include:

1:The program counter (PC) and stack pointer (SP) registers

2:The general purpose registers

3:The floating point registers

4:The processor control registers (Processor Status Word) containing information about the CPU state

5:The memory management registers used to keep track of the RAM accessed by the process

1.6.3. Reentrant Kernels

1.6.4. Process Address Space

1.6.5. Synchronization and Critical Regions

1.6.5.1. Kernel preemption disabling

1.6.5.2. Interrupt disabling

1.6.5.3. Semaphores

1.6.5.4. Spin locks

1.6.5.5. Avoiding deadlocks

1.6.6. Signals and Interprocess Communication

1.6.7. Process Management

1.6.7.1. Zombie processes

1.6.7.2. Process groups and login sessions

1.6.8. Memory Management

1.6.8.1. Virtual memory

1.6.8.2. Random access memory usage

1.6.8.3. Kernel Memory Allocator

1.6.8.4. Process virtual address space handling

1.6.8.5. Caching

1.6.9. Device Drivers

Chapter 2. Memory Addressing

2.1. Memory Addresses

(1)Logical address

(2)Linear address

(3)Physical address

The Memory Management Unit (MMU) transforms a logical address into a linear address by means of a hardware circuit called a segmentation unit.

a second hardware circuit called a paging unit transforms the linear address into a physical address .

Figure 2-1. Logical address translation

2.2. Segmentation in Hardware

2.2.1. Segment Selectors and Segmentation Registers

(1)Segment Selectors

(2 )Segmentation Registers

To make it easy to retrieve segment selectors quickly, the processor provides segmentation registerswhose only purpose is to hold Segment Selectors

cs, ss, ds, es, fs,gs.

2.2.2. Segment Descriptors

Global Descriptor Table (GDT )

Local Descriptor Table(LDT).

Code Segment Descriptor

Data Segment Descriptor

Task State Segment Descriptor (TSSD)

Local Descriptor Table Descriptor (LDTD)

2.2.3. Fast Access to Segment Descriptors

2.2.4. Segmentation Unit

2.3. Segmentation in Linux

The 2.6 version of Linux uses segmentation only when required by the 80 x 86 architecture

2.3.1. The Linux GDT

1. A Task State Segment (TSS)

2. kernel code and data segments

3.A segment including the default Local Descriptor Table (LDT),

4.Three Thread-Local Storage (TLS) segments

5. Three segments related to Advanced Power Management (APM ):

6.Five segments related to Plug and Play (PnP ) BIOS services

7.A special TSS segment used by the kernel to handle "Double fault " exceptions

a few entries in the GDT may depend on the process that the CPU is executing (LDT and TLS Segment Descriptors).

2.3.2. The Linux LDTs

2.4. Paging in Hardware

page frames/page

2.4.1. Regular Paging

2.4.2. Extended Paging

2.4.3. Hardware Protection Scheme

2.4.4. An Example of Regular Paging

A simple example will help in clarifying how regular paging works. Let's assume that the kernel

assigns the linear address space between 0x20000000 and 0x2003ffff to a running process.[ ] This

space consists of exactly 64 pages. We don't care about the physical addresses of the page frames

containing the pages; in fact, some of them might not even be in main memory. We are interested

only in the remaining fields of the Page Table entries.

[ ] As we shall see in the following chapters, the 3 GB linear address space is an upper limit, but a User Mode process is allowed to

reference only a subset of it.

Let's start with the 10 most significant bits of the linear addresses assigned to the process, which

are interpreted as the Directory field by the paging unit. The addresses start with a 2 followed by

zeros, so the 10 bits all have the same value, namely 0x080 or 128 decimal. Thus the Directory field

in all the addresses refers to the 129th entry of the process Page Directory. The corresponding entry

must contain the physical address of the Page Table assigned to the process (see Figure 2-9). If no

other linear addresses are assigned to the process, all the remaining 1,023 entries of the Page

Directory are filled with zeros.

The values assumed by the intermediate 10 bits, (that is, the values of the Table field) range from 0

to 0x03f, or from 0 to 63 decimal. Thus, only the first 64 entries of the Page Table are valid. The

remaining 960 entries are filled with zeros.

Suppose that the process needs to read the byte at linear address 0x20021406. This address is

handled by the paging unit as follows:

1. The Directory field 0x80 is used to select entry 0x80 of the Page Directory, which points to the

Page Table associated with the process's pages.

The Table field 0x21 is used to select entry 0x21 of the Page Table, which points to the page

frame containing the desired page.

Finally, the Offset field 0x406 is used to select the byte at offset 0x406 in the desired page

frame.

If the Present flag of the 0x21 entry of the Page Table is cleared, the page is not present in main

memory; in this case, the paging unit issues a Page Fault exception while translating the linear

address. The same exception is issued whenever the process attempts to access linear addresses

outside of the interval delimited by 0x20000000 and 0x2003ffff, because the Page Table entries not

assigned to the process are filled with zeros; in particular, their Present flags are all cleared.

Figure 2-9. An example of paging

2.4.5. The Physical Address Extension (PAE) Paging Mechanism

2.4.6. Paging for 64-bit Architectures

2.4.7. Hardware Cache/L1-cache/

The cache memory stores the actual lines of memory. The cache controller stores an array of entries, one entry for each line of the

cache memory. Each entry includes a tag and a few flags that describe the status of the cache line.

The tag consists of some bits that allow the cache controller to recognize the memory location

currently mapped by the line. The bits of the memory's physical address are usually split into three

groups: the most significant ones correspond to the tag, the middle ones to the cache controller

subset index, and the least significant ones to the offset within the line.

write-through:

the controller always writes into both RAM and the cache line, effectively switching off the cache for write operations

write-back:

the cache line is updated and the contents of the RAM are left

unchanged. After a write-back, of course, the RAM must eventually be updated. The cache controller

writes the cache line back into RAM only when the CPU executes an instruction requiring a flush of

cache entries or when a FLUSH hardware signal occurs (usually after a cache miss).

2.4.8. Translation Lookaside Buffers (TLB)//

Translation Lookaside Buffers (TLB) to speed up linear address translation. When a linear address is

used for the first time, the corresponding physical address is computed through slow accesses to the

Page Tables in RAM. The physical address is then stored in a TLB entry so that further references to

the same linear address can be quickly translated.

2.5. Paging in Linux

Linux's handling of processes relies heavily on paging. In fact, the automatic translation of linear

addresses into physical ones makes the following design objectives feasible:

1.Assign a different physical address space to each process, ensuring an efficient protection

against addressing errors.

2.Distinguish pages (groups of data) from page frames (physical addresses in main memory).

This allows the same page to be stored in a page frame, then saved to disk and later reloaded

in a different page frame. This is the basic ingredient of the virtual memory mechanism (see Chapter 17).

pgd

2.5.1. The Linear Address Fields

PAGE_SHIFT/PMD_SHIFT/PUD_SHIFT/PGDIR_SHIFT

PTRS_PER_PTE, PTRS_PER_PMD, PTRS_PER_PUD, and PTRS_PER_PGD

2.5.2. Page Table Handling

(1):type-conversion macros
_ _ pte, _ _ pmd, _ _ pud, _ _ pgd _ _ pgprot//(protect)

pte_val, pmd_val, pud_val, pgd_val,pgprot_val

(2):macros and functions to read or modify page table entries

pte_none, pmd_none, pud_none pgd_none

pte_clear, pmd_clear, pud_clear pgd_clear

set_pte, set_pmd, set_pud set_pgd

pte_same(a,b)

pmd_large(e)

pmd_bad pud_bad pgd_bad

pte_present

The pmd_bad macro is used by functions to check Page Middle Directory entries passed as input

parameters. It yields the value 1 if the entry points to a bad Page Table that is, if at least one of the

following conditions applies:

(1)The page is not in main memory (Present flag cleared).

(2)The page allows only Read access (Read/Write flag cleared).

(3)Either Accessed or Dirty is cleared (Linux always forces these flags to be set for every existing Page Table).

The pte_present macro yields the value 1 if either the Present flag or the Page Size flag of a Page

Table entry is equal to 1, the value 0 otherwise. Recall that the Page Size flag in Page Table entries

has no meaning for the paging unit of the microprocessor; the kernel, however, marks Present

equal to 0 and Page Size equal to 1 for the pages present in main memory but without read, write,

or execute privileges. In this way, any access to such pages triggers a Page Fault exception because

Present is cleared, and the kernel can detect that the fault is not due to a missing page by checking

the value of Page Size.

(3): Page flag reading/setting functions

pte_user( )

pte_read( )

pte_write( )

pte_exec( )

pte_dirty( )

pte_young( )

pte_file( )

mk_pte_huge( )

pte_wrprotect( )

pte_rdprotect( )

pte_exprotect( )

pte_mkwrite( )

pte_mkread( )

pte_mkexec( )

pte_mkclean( )

pte_mkdirty( )

pte_mkold( )

pte_mkyoung( )

pte_modify(p,v)

ptep_set_wrprotect()

ptep_set_access_flags()

ptep_mkdirty()

ptep_test_and_clear_dirty()

ptep_test_and_clear_young()

(4): Macros acting on Page Table entries

pgd_index(addr)

pgd_offset(mm, addr)

pgd_offset_k(addr)

pgd_page(pgd)

pud_offset(pgd, addr)

pud_page(pud)

pmd_index(addr)

pmd_offset(pud, addr)

pmd_page(pmd)

mk_pte(p,prot)

pte_index(addr)

pte_offset_kernel(dir, addr)

pte_offset_map(dir,ddr)

pte_to_pgoff(pte)

pgoff_to_pte(offset

(5): Page allocation functions

pgd_alloc(mm)

pgd_free( pgd)

pud_alloc(mm, pgd,addr)

pud_free(x)

pmd_alloc(mm, pud,addr)

pmd_free(x)

pte_alloc_map(mm, pmd,addr)

pte_alloc_kernel(mm,pmd, addr)

pte_free(pte)

pte_free_kernel(pte)
clear_page_range(mmu,start,end)

2.5.3. Physical Memory Layout

As a general rule, the Linux kernel is installed in RAM starting from the physical address 0x00100000 i.e., from the second megabyte.

Figure 2-13. The first 768 page frames (3 MB) in Linux 2.6

1:Page frame 0 is used by BIOS to store the system hardware configuration detected during the

Power-On Self-Test(POST); the BIOS of many laptops, moreover, writes data on this page

frame even after the system is initialized.

2:Physical addresses ranging from 0x000a0000 to 0x000fffff are usually reserved to BIOS

routines and to map the internal memory of ISA graphics cards. This area is the well-known

hole from 640 KB to 1 MB in all IBM-compatible PCs: the physical addresses exist but they are

reserved, and the corresponding page frames cannot be used by the operating system.

3:Additional page frames within the first megabyte may be reserved by specific computer

models. For example, the IBM ThinkPad maps the 0xa0 page frame into the 0x9f one.

Table 2-10. Variables describing the kernel's physical memory layout

Variable name Description

num_physpages Page frame number of the highest usable page frame

totalram_pages Total number of usable page frames

min_low_pfn Page frame number of the first usable page frame after the kernel image in RAM

max_pfn Page frame number of the last usable page frame

max_low_pfn Page frame number of the last page frame directly mapped by the kernel (low memory)

totalhigh_pages Total number of page frames not directly mapped by the kernel (high memory)

highstart_pfn Page frame number of the first page frame not directly mapped by the kernel

highend_pfn Page frame number of the last page frame not directly mapped by the kernel

2.5.4. Process Page Tables

The linear address space of a process is divided into two parts:

Linear addresses from 0x00000000 to 0xbfffffff can be addressed when the process runs in either User or Kernel Mode.

Linear addresses from 0xc0000000 to 0xffffffff can be addressed only when the process runs in Kernel Mode.

The content of the first entries of the Page Global Directory that map linear addresses lower than

0xc0000000 (the first 768 entries with PAE disabled, or the first 3 entries with PAE enabled) depends

on the specific process. Conversely, the remaining entries should be the same for all processes and

equal to the corresponding entries of the master kernel Page Global Directory (see the following

section).?????

2.5.5. Kernel Page Tables???

In the first phase, the kernel creates a limited address space including the kernel's code and data segments, the initial Page Tables,

and 128 KB for some dynamic data structures. This minimal address space is just large enough to install the kernel in RAM and to initialize its core

data structures.

In the second phase, the kernel takes advantage of all of the existing RAM and sets up the page tables properly.

Let us examine how this plan is executed.

2.5.5.1. Provisional kernel Page Tables

2.5.5.2. Final kernel Page Table when RAM size is less than 896 MB

2.5.5.3. Final kernel Page Table when RAM size is between 896 MB and 4096 MB

2.5.5.4. Final kernel Page Table when RAM size is more than 4096 MB

2.5.6. Fix-Mapped Linear Addresses

fix_to_virt( )

2.5.7. Handling the Hardware Cache and the TLB

2.5.7.1. Handling the hardware cache

L1_CACHE_BYTES macro yields the size of a cache line in bytes

2.5.7.2. Handling the TLB

Chapter 3. Processes

3.1. Processes, Lightweight Processes, and Threads

3.2. Process Descriptor

struct task_struct

3.2.1. Process State

eg:p->state = TASK_RUNNING;

3.2.2. Identifying a Process

process descriptor pointers

tgid(thread groups identify/pid

3.2.2.1. Process descriptors handling

Figure 3-2. Storing the thread_info structure and the process(task_struct) kernel stack in two page frames

union thread_union {

struct thread_info thread_info;

unsigned long stack[2048]; /* 1024 for 4KB stacks */

};

1):esp is the CPU stack pointer

3.2.2.2. Identifying the current process

current_thread_info( )denote the tHRead_info structure pointer of the process running on the CPU that executes the instruction.

current_thread_info( )->task or current denote the process descriptor pointer of the process running on the CPU.

#define task_stack_page(task) ((task)->stack) ，该宏根据task_struct得到栈底，也就是thread_info地址。//等价current

#define task_thread_info(task) ((struct thread_info *)(task)->stack)，该宏根据task_struct得到thread_info指针。//等价current_thread_info( )

3.2.2.3. Doubly linked lists

that the pointers in a list_head field store the addresses of other list_head fields rather than the addresses of the whole data structures

in which the list_head tructure is included;

3.2.2.4. The process list

Another useful macro, called for_each_process, scans the whole process list

3.2.2.5. The lists of TASK_RUNNING processes

enqueue_task(p,array) /dequeue_task(p,array)

3.2.3. Relationships Among Processes

3.2.3.1. The pidhash table and chained lists--???

each hash table is stored in four page frames

3.2.4. How Processes Are Organized

3.2.4.1. Wait queues

wait_queue_head_t:

struct _ _wait_queue_head {

spinlock_t lock;

struct list_head task_list;

};

typedef struct _ _wait_queue_head wait_queue_head_t;

Elements of a wait queue list are of type

wait_queue_t:

struct _ _wait_queue {

unsigned int flags;

struct task_struct * task;

wait_queue_func_t func;

struct list_head task_list;

};

typedef struct _ _wait_queue wait_queue_t;

3.2.4.2. Handling wait queues

The prepare_to_wait( ), prepare_to_wait_exclusive( ), and finish_wait( ) functions,

introduced in Linux 2.6, offer yet another way to put the current process to sleep in a wait

queue. Typically, they are used as follows:

DEFINE_WAIT(wait);

prepare_to_wait_exclusive(&wq, &wait, TASK_INTERRUPTIBLE);

/* wq is the head of the wait queue */

...

if (!condition)

schedule();

finish_wait(&wq, &wait);

3.2.5. Process Resource Limits

The resource limits for the current process are stored in the current->signal->rlim field, that is, in

a field of the process's signal descriptor

3.2. Process Switch

3.3.1. Hardware Context

a part of the hardware context of a process is stored in the process descriptor, while the remaining part is saved

in the Kernel Mode stack.

3.3.2. Task State Segment

The TSSDs created by Linux are stored in the Global Descriptor Table (GDT), whose base address is

stored in the gdtr register of each CPU. The tr register of each CPU contains the TSSD Selector of

the corresponding TSS. The register also includes two hidden, nonprogrammable fields: the Base

and Limit fields of the TSSD. In this way, the processor can address the TSS directly without having

to retrieve the TSS address from the GDT.

3.3.2.1. The thread field

Thus, each process descriptor includes a field called thread of type thread_struct, in which the

kernel saves the hardware context whenever the process is being switched out.

3.3.3. Performing the Process Switch

Essentially, every process switch consists of two steps:

1.Switching the Page Global Directory to install a new address space; we'll describe this step in Chapter 9.

2.Switching the Kernel Mode stack and the hardware context, which provides all the information

needed by the kernel to execute the new process, including the CPU registers.

3.3.3.1. The switch_to macro---???

3.3.3.2. The _ _switch_to ( ) function--???

3.3.4. Saving and Loading the FPU, MMX, and XMM Registers

3.3.4.1. Saving the FPU registers

3.3.4.2. Loading the FPU registers

3.3.4.3. Using the FPU, MMX, and SSE/SSE2 units in Kernel Mode

3.4. Creating Processes

3.4.1. The clone( ), fork( ), and vfork( ) System Calls

Lightweight processes are created in Linux by using a function named clone( ), which uses the following parameters:

fn,arg,flags,child_stack,tls,ptid,ctid

3.4.1.1. The do_fork( ) function

do_fork( clone_flags,stack_start,regs,stack_size,parent_tidptr,child_tidptr)

clone_flags

Same as the flags parameter of clone( )

stack_start:

Specifies the User Mode stack pointer to be assigned to the esp register of the child process.

The invoking process (the parent) should always allocate a new stack for the child.

regs:

Pointer to the values of the general purpose registers saved into the Kernel Mode stack when

switching from User Mode to Kernel Mode (see the section "The do_IRQ( ) function" in Chapter

stack_size

Unused (always set to 0)

parent_tidptr,

Specifies the address of a User Mode variable of the parent process that will hold the PID of

the new lightweight process. Meaningful only if the CLONE_PARENT_SETTID flag is set.

child_tidptr,

Specifies the address of a User Mode variable of the new lightweight process that will hold the

PID of such process. Meaningful only if the CLONE_CHILD_SETTID flag is set.

3.4.1.2. The copy_process( ) function

3.4.2. Kernel Threads

3.4.2.1. Creating a kernel thread

Kernel threads run only in Kernel Mode, while regular processes run alternatively in Kernel Mode and in User Mode.

Because kernel threads run only in Kernel Mode, they use only linear addresses greater than PAGE_OFFSET.

Regular processes, on the other hand, use all four gigabytes of linear addresses, in either User Mode or Kernel Mode.

3.4.2.2. Process 0 swapper process/idle process

3.4.2.3. Process 1 init

3.4.2.4. Other kernel threads

keventd (also called events)

Executes the functions in the keventd_wq workqueue (see Chapter 4).

kapmd

Handles the events related to the Advanced Power Management (APM).

kswapd

Reclaims memory, as described in the section "Periodic Reclaiming" in Chapter 17.

pdflush

Flushes "dirty" buffers to disk to reclaim memory, as described in the section "The pdflush Kernel Threads" in Chapter 15.

kblockd

Executes the functions in the kblockd_workqueue workqueue. Essentially, it periodically activates the block devic

ksoftirqd

Runs the tasklets (see section "Softirqs and Tasklets" in Chapter 4); there is one of these kernel threads for each CPU in the system.

3.5. Destroying Processes

The exit( ) library function may be inserted by the programmer explicitly. Additionally,

the C compiler always inserts an exit( ) unction call right after the last statement of the main( ) function.

3.5.1. Process Termination

3.5.1.1. The do_group_exit( ) function

3.5.1.2. The do_exit( ) function

3.5.2. Process Removal

The release_task( ) function detaches the last data structures from the descriptor of a zombie

process; it is applied on a zombie process in two possible ways: by the do_exit( ) function if the

parent is not interested in receiving signals from the child, or by the wait4( ) or waitpid( ) system

calls after a signal has been sent to the parent. In the latter case, the function also will reclaim the

memory used by the process descriptor, while in the former case the memory reclaiming will be

done by the scheduler (see Chapter 7). This function executes the following steps:

Chapter 4. Interrupts and Exceptions

Synchronous interrupts are produced by the CPU control unit while executing instructions and

are called synchronous because the control unit issues them only after terminating the

execution of an instruction

Asynchronous interrupts are generated by other hardware devices at arbitrary times with

respect to the CPU clock signals.

4.1. The Role of Interrupt Signals

4.2. Interrupts and Exceptions

Interrupts:

Maskable interrupts

Nonmaskable interrupts

Exceptions:

Processor-detected exceptions

Faults

Traps

Aborts

Programmed exceptions

4.2.1. IRQs and Interrupts

4.2.1.1. The Advanced Programmable Interrupt Controller (APIC)

4.2.2. Exceptions

4.2.3. Interrupt Descriptor Table

Interrupt Descriptor Table (IDT )

256 x 8 = 2048 bytes

4.2.4. Hardware Handling of Interrupts and Exceptions

4.3. Nested Execution of Exception and Interrupt Handlers

4.4.1. Interrupt, Trap, and System Gates

4.4.2. Preliminary Initialization of the IDT

4.4. Initializing the Interrupt Descriptor Table

4.5. Exception Handling

Exception handlers have a standard structure consisting of three steps:

1. Save the contents of most registers in the Kernel Mode stack (this part is coded in assembly language).

2. Handle the exception by means of a high-level C function.

3. Exit from the handler by means of the ret_from_exception( ) function.

4.5.1. Saving the Registers for the Exception Handler

4.5.2. Entering and Leaving the Exception Handler

The exception handler always checks whether the exception occurred in User Mode or in Kernel Mode

and, in the latter case, whether it was due to an invalid argument passed to a system call. We'll

describe in the section "Dynamic Address Checking: The Fix-up Code" in Chapter 10 how the kernel

defends itself against invalid arguments passed to system calls. Any other exception raised in Kernel

Mode is due to a kernel bug. In this case, the exception handler knows the kernel is misbehaving. In

order to avoid data corruption on the hard disks, the handler invokes the die( ) function, which

prints the contents of all CPU registers on the console (this dump is called kernel oops ) and

terminates the current process by calling do_exit( ) (see "Process Termination" in Chapter 3).

4.6. Interrupt Handling

I/O interrupts

Timer interrupts

Interprocessor interrupts

4.6.1. I/O Interrupt Handling

1. Save the IRQ value and the register's contents on the Kernel Mode stack.

2.Send an acknowledgment to the PIC that is servicing the IRQ line, thus allowing it to issue

further interrupts.

3. Execute the interrupt service routines (ISRs) associated with all the devices that share the IRQ.

4. Terminate by jumping to the ret_from_intr( ) address.

Vector range Use

0~19 (0x0-0x13) Nonmaskable interrupts and exceptions

20~31 (0x14-0x1f) Intel-reserved

32~127 (0x20-0x7f)External interrupts (IRQs)

128 (0x80) Programmed exception for system calls (see Chapter 10)

129~238 (0x81-0xee)External interrupts (IRQs)

239 (0xef) Local APIC timer interrupt (see Chapter 6)

240 (0xf0) Local APIC thermal interrupt (introduced in the Pentium 4 models)

241~250 (0xf1-0xfa)Reserved by Linux for future use

251~253 (0xfb-0xfd)Interprocessor interrupts (see the section "Interprocessor Interrupt Handling" later in this chapter)

254 (0xfe)Local APIC error interrupt (generated when the local APIC detects an erroneous condition)

255 (0xff) Local APIC spurious interrupt (generated if the CPU masks an interrupt while the hardware device raises it)

4.6.1.2. IRQ data structures

Figure 4-5. IRQ descriptors

1:irq_desc

Field Description

handler Points to the PIC object (hw_irq_controller descriptor) that services the IRQ line.

handler_data Pointer to data used by the PIC methods.

action Identifies the interrupt service routines to be invoked when the IRQ occurs. The

field points to the first element of the list of irqaction descriptors associated with

the IRQ. The irqaction descriptor is described later in the chapter.

status A set of flags describing the IRQ line status (see Table 4-5).

depth Shows 0 if the IRQ line is enabled and a positive value if it has been disabled at

least once.

irq_count Counter of interrupt occurrences on the IRQ line (for diagnostic use only).

irqs_unhandled Counter of unhandled interrupt occurrences on the IRQ line (for diagnostic use

only).

lock A spin lock used to serialize the accesses to the IRQ descriptor and to the PIC (see Chapter 5).

Table 4-5. Flags describing the IRQ line status

Flag name Description

IRQ_INPROGRESS A handler for the IRQ is being executed.

IRQ_DISABLED The IRQ line has been deliberately disabled by a device driver.

IRQ_PENDING An IRQ has occurred on the line; its occurrence has been acknowledged to the PIC,

but it has not yet been serviced by the kernel.

IRQ_REPLAY The IRQ line has been disabled but the previous IRQ occurrence has not yet been

acknowledged to the PIC.

IRQ_AUTODETECT The kernel is using the IRQ line while performing a hardware device probe.

IRQ_WAITING The kernel is using the IRQ line while performing a hardware device probe;

moreover, the corresponding interrupt has not been raised.

IRQ_LEVEL Not used on the 80 x 86 architecture.

IRQ_MASKED Not used.

IRQ_PER_CPU Not used on the 80 x 86 architecture.

2:irqaction descriptors

Fieldname Description

handler Points to the interrupt service routine for an I/O device. This is the key field that allows

many devices to share the same IRQ.

flags This field includes a few fields that describe the relationships between the IRQ line and

the I/O device (see Table 4-7).

mask Not used.

name The name of the I/O device (shown when listing the serviced IRQs by reading the

/proc/interrupts file).

dev_id A private field for the I/O device. Typically, it identifies the I/O device itself (for instance,

it could be equal to its major and minor numbers; see the section "Device Files" in

Chapter 13), or it points to the device driver's data.

next Points to the next element of a list of irqaction descriptors. The elements in the list refer

to hardware devices that share the same IRQ.

irq irq IRQ line.

dir Points to the descriptor of the /proc/irq/n directory associated with the IRQn.

4.6.1.3. IRQ distribution in multiprocessor systems

4.6.1.4. Multiple Kernel Mode stacks

4.6.1.5. Saving the registers for the interrupt handler

4.6.1.6. The do_IRQ( ) function

4.6.1.7. The _ _do_IRQ( ) function

4.6.1.8. Reviving a lost interrupt

4.6.1.9. Interrupt service routines

4.6.1.10. Dynamic allocation of IRQ lines

4.6.2. Interprocessor Interrupt Handling /IPI

CALL_FUNCTION_VECTOR (vector 0xfb)

RESCHEDULE_VECTOR (vector 0xfc)

INVALIDATE_TLB_VECTOR (vector 0xfd)

4.7. Softirqs and Tasklets

4.7.1. Softirqs

Table 4-9. Softirqs used in Linux 2.6

Softirq Index (priority) Description

HI_SOFTIRQ 0 Handles high priority tasklets

TIMER_SOFTIRQ 1 Tasklets related to timer interrupts

NET_TX_SOFTIRQ 2 Transmits packets to network cards

NET_RX_SOFTIRQ 3 Receives packets from network cards

SCSI_SOFTIRQ 4 Post-interrupt processing of SCSI commands

TASKLET_SOFTIRQ 5 Handles regular tasklets

4.7.1.1. Data structures used for softirqs

1.softirq_action

2.Another critical field used to keep track both of kernel preemption and of nesting of kernel control

paths is the 32-bit preempt_count field stored in the tHRead_info field of each process descriptor

4.7.1.2. Handling softirqs

4.7.1.3. The do_softirq( ) function

4.7.1.4. The _ _do_softirq( ) function

4.7.1.5. The ksoftirqd kernel threads

4.7.2. Tasklets

tasklet descriptors

tasklet_struct

Field name Description

next Pointer to next descriptor in the list

state Status of the tasklet

count Lock counter

func Pointer to the tasklet function

data An unsigned long integer that may be used by the tasklet function

4.8. Work Queues

Chapter 5. Kernel Synchronization

5.1. How the Kernel Services Requests

5.1.1. Kernel Preemption

kernel preemption is disabled when the preempt_count field

in the tHRead_info descriptor referenced by the current_thread_info( ) macro is greater than zero

Table 5-1. Macros dealing with the preemption counter subfield

Macro Description

preempt_count( ) Selects the preempt_count field in the tHRead_info descriptor

preempt_disable( ) Increases by one the value of the preemption counter

preempt_enable_no_resched()Decreases by one the value of the preemption counter

preempt_enable( ) Decreases by one the value of the preemption counter, and invokes preempt_schedule( )

if the TIF_NEED_RESCHED flag in thethread_info descriptor is set

get_cpu( ) Similar to preempt_disable( ), but also returns the number of the local CPU

put_cpu( ) Same as preempt_enable( )

put_cpu_no_resched( ) Same as preempt_enable_no_resched( )

5.1.2. When Synchronization Is Necessary

5.1.3. When Synchronization Is Not Necessary

5.2. Synchronization Primitives

5.2.1. Per-CPU Variables

a kernel control path should access a per-CPU variable with kernel preemption disabled.

5.2.2. Atomic Operations

Every such operation must be executed in a single instruction without being interrupted in the middle and avoiding accesses to the same memory

5.2.3. Optimization and Memory Barriers

1:Optimization Barriers

An optimization barrier primitive ensures that the assembly language instructions corresponding to

C statements placed before the primitive are not mixed by the compiler with assembly language

i nstructions corresponding to C statements placed after the primitive. In Linux the barrier( )

macro, which expands into asm volatile("":::"memory"), acts as an optimization barrier.

2:Memory Barriers

A memory barrier primitive ensures that the operations placed before the primitive are finished

before starting the operations placed after the primitive.

Table 5-6. Memory barriers in Linux

Macro Description

mb( ) Memory barrier for MP and UP

rmb( ) Read memory barrier for MP and UP

wmb( ) Write memory barrier for MP and UP

smp_mb( ) Memory barrier for MP only

smp_rmb( ) Read memory barrier for MP only

smp_wmb( ) Write memory barrier for MP only

5.2.4. Spin Locks

Spin locks are a special kind of lock designed to work in a multiprocessor environment. If the kernel

control path finds the spin lock "open," it acquires the lock and continues its execution. Conversely,

if the kernel control path finds the lock "closed" by a kernel control path running on another CPU, it

"spins" around, repeatedly executing a tight instruction loop, until the lock is released.

The instruction loop of spin locks represents a "busy wait." The waiting kernel control path keeps

running on the CPU, even if it has nothing to do besides waste time. Nevertheless, spin locks are

usually convenient, because many kernel resources are locked for a fraction of a millisecond only;

therefore, it would be far more time-consuming to release the CPU and reacquire it later

In Linux, each spin lock is represented by a spinlock_t structure consisting of two fields:

slock

Encodes the spin lock state: the value 1 corresponds to the unlocked state, while every

negative value and 0 denote the locked state

break_lock

Flag signaling that a process is busy waiting for the lock (present only if the kernel supports

both SMP and kernel preemption)

Table 5-7. Spin lock macros

Macro Description

spin_lock_init( ) Set the spin lock to 1 (unlocked)

spin_lock( ) Cycle until spin lock becomes 1 (unlocked), then set it to 0 (locked)

spin_unlock( ) Set the spin lock to 1 (unlocked)

spin_unlock_wait() Wait until the spin lock becomes 1 (unlocked)

spin_is_locked( ) Return 0 if the spin lock is set to 1 (unlocked); 1 otherwise

spin_trylocked( ) Set the spin lock to 0 (locked), and return 1 if the previous value of the lock was 1; 0 otherwise

5.2.4.1. The spin_lock( ) macro with kernel preemption

5.2.4.2. The spin_lock( ) macro without kernel preemption

5.2.4.3. The spin_unlock( ) macro

5.2.4.1 Read/Write Spin Locks

Each read/write spin lock is a rwlock_t structure; its lock field is a 32-bit field that encodes two

distinct pieces of information:

1:A 24-bit counter denoting the number of kernel control paths currently reading the protected

data structure. The two's complement value of this counter is stored in bits 023 of the field.

2:An unlock flag that is set when no kernel control path is reading or writing, and clear

otherwise. This unlock flag is stored in bit 24 of the field.

Notice that the lock field stores the number 0x01000000 if the spin lock is idle (unlock flag set and no

readers), the number 0x00000000 if it has been acquired for writing (unlock flag clear and no

readers), and any number in the sequence 0x00ffffff, 0x00fffffe, and so on, if it has been

acquired for reading by one, two, or more processes (unlock flag clear and the two's complement on

24 bits of the number of readers). As the spinlock_t structure, the rwlock_t structure also includes

a break_lock field.

5.2.5.1. Getting and releasing a lock for reading

5.2.5.2. Getting and releasing a lock for writing

5.2.4.2. Seqlocks

Each seqlock is a seqlock_t structure consisting of two fields: a lock field of type spinlock_t and an

integer sequence field. This second field plays the role of a sequence counter. Each reader must read

this sequence counter twice, before and after reading the data, and check whether the two values

coincide. In the opposite case, a new writer has become active and has increased the sequence

counter, thus implicitly telling the reader that the data just read is not valid.

seqlock_init

write_seqlock( )

write_sequnlock( )

read_seqbegin()

5.2.4.3. Read-Copy Update (RCU)

How does RCU obtain the surprising result of synchronizing several CPUs without shared data

structures? The key idea consists of limiting the scope of RCU as follows:

1.Only data structures that are dynamically allocated and referenced by means of pointers can be

protected by RCU.

2. No kernel control path can sleep inside a critical region protected by RCU.

rcu_read_lock( )

rcu_read_unlock( )

5.2.5. Semaphores

they implement a locking primitive that allows waiters to sleep until the desired resource becomes free.

Actually, Linux offers two kinds of semaphores

1.Kernel semaphores, which are used by kernel control paths

A kernel semaphore is similar to a spin lock, in that it doesn't allow a kernel control path to proceed

unless the lock is open. However, whenever a kernel control path tries to acquire a busy resource

protected by a kernel semaphore, the corresponding process is suspended. It becomes runnable

again when the resource is released. Therefore, kernel semaphores can be acquired only by

functions that are allowed to sleep; interrupt handlers and deferrable functions cannot use them.

2.System V IPC semaphores, which are used by User Mode processes

struct semaphore

count:

Stores an atomic_t value. If it is greater than 0, the resource is free that is, it is currently

available. If count is equal to 0, the semaphore is busy but no other process is waiting for the

protected resource. Finally, if count is negative, the resource is unavailable and at least one

process is waiting for it.

wait

Stores the address of a wait queue list that includes all sleeping processes that are currently

waiting for the resource. Of course, if count is greater than or equal to 0, the wait queue is

empty.

sleepers

Stores a flag that indicates whether some processes are sleeping on the semaphore. We'll see

this field in operation soon.

Getting and releasing semaphores

up( )

down( )

5.2.5.1 Read/Write Semaphores

Each read/write semaphore is described by a rw_semaphore structure that includes the following

fields:

count

Stores two 16-bit counters. The counter in the most significant word encodes in two's

complement form the sum of the number of nonwaiting writers (either 0 or 1) and the number

of waiting kernel control paths. The counter in the less significant word encodes the total

number of nonwaiting readers and writers.

wait_list

Points to a list of waiting processes. Each element in this list is a rwsem_waiter structure,

including a pointer to the descriptor of the sleeping process and a flag indicating whether the

process wants the semaphore for reading or for writing.

wait_lock

A spin lock used to protect the wait queue list and the rw_semaphore structure itself.

5.2.6. Completions

The real difference between completions and semaphores is how the spin lock included in the wait

queue is used. In completions, the spin lock is used to ensure that complete( ) and

wait_for_completion( ) cannot execute concurrently. In semaphores, the spin lock is used to avoid

letting concurrent down( )'s functions mess up the semaphore data structure

5.2.7. Local Interrupt Disabling

local_irq_disable( )
local_irq_enable( )

local_irq_save()

local_irq_restore()

5.2.8. Disabling and Enabling Deferrable Functions

local_bh_disable

local_bh_enable

5.3. Synchronizing Accesses to Kernel Data Structures

5.3.1. Choosing Among Spin Locks, Semaphores, and Interrupt Disabling

5.4. Examples of Race Condition Prevention

5.4.1. Reference Counters

A reference counter is just an atomic_t counter associated with a specific resource such as a memory page, a module, or a file

5.4.2. The Big Kernel Lock

lock_kernel( )/unlock_kernel( )

5.4.3. Memory Descriptor Read/Write Semaphore

Each memory descriptor of type mm_struct includes its own semaphore in the mmap_sem field (see the section "The Memory Descriptor" in Chapter 9)

5.4.4. Slab Cache List Semaphore

The list of slab cache descriptors (see the section "Cache Descriptor" in Chapter 8) is protected by the cache_chain_sem semaphore,

which grants an exclusive right to access and modify the list.

5.4.5. Inode Semaphore

Chapter 6. Timing Measurements

6.1. Clock and Timer Circuits

6.1.1. Real Time Clock (RTC)

6.1.2. Time Stamp Counter (TSC)

6.1.3. Programmable Interval Timer (PIT)

6.1.4. CPU Local Timer

6.1.5. High Precision Event Timer (HPET)

6.1.6. ACPI Power Management Timer

6.2. The Linux Timekeeping Architecture

6.2.1. Data Structures of the Timekeeping Architecture

6.2.1.1. The timer object

a descriptor of type timer_opts consisting of the timer name and of four standard methods shown in Table 6-1.

Table 6-1. The fields of the timer_opts data structure

Field name Description

name A string identifying the timer source

mark_offset Records the exact time of the last tick; it is invoked by the timer interrupt handler

get_offset Returns the time elapsed since the last tick

monotonic_clock Returns the number of nanoseconds since the kernel initialization

delay Waits for a given number of "loops" (see the later section "Delay Functions")

6.2.1.2. The jiffies variable

The jiffies variable is a counter that stores the number of elapsed ticks since the system was started.

6.2.1.3. The xtime variable

The xtime variable stores the current time and date; it is a structure of type timespec having two

fields:

tv_sec

Stores the number of seconds that have elapsed since midnight of January 1, 1970 (UTC)

tv_nsec

Stores the number of nanoseconds that have elapsed within the last second (its value ranges

between 0 and 999,999,999)

6.2.2. Timekeeping Architecture in Uniprocessor Systems

6.2.2.1. Initialization phase

During kernel initialization, the time_init( ) function is invoked to set up the timekeeping architecture.

6.2.2.2. The timer interrupt handler

The timer_interrupt( ) function is the interrupt service routine (ISR) of the PIT or of the HPET;

6.2.3. Timekeeping Architecture in Multiprocessor Systems

6.2.3.1. Initialization phase

6.2.3.2. The global timer interrupt handler

6.2.3.3. The local timer interrupt handler

6.3. Updating the Time and Date

update_times( )-->update_wall_time( )

6.4. Updating System Statistics

6.4.1. Updating Local CPU Statistics

update_process_times( ) function

6.4.2. Keeping Track of System Load

calc_load( ) function

6.4.3. Profiling the Kernel Code

profile_tick( ) function

1|shell@android:/ # oprofiled --usage

Usage: oprofiled [-v?] [--session-dir=/var/lib/oprofile]

[-r|--kernel-range start-end] [-k|--vmlinux file] [--no-vmlinux]

[--xen-range=start-end] [--xen-image=file]

[--image=profile these comma separated image] [--separate-lib=[0|1]]

[--separate-kernel=[0|1]] [--separate-thread=[0|1]]

[--separate-cpu=[0|1]] [-e|--events [events]] [-v|--version]

[-V|--verbose all,sfile,arcs,samples,module,misc]

[-x|--ext-feature <extended-feature-name>:[args]] [-?|--help] [--usage]

6.4.4. Checking the NMI Watchdogs

do_nmi( )

6.5. Software Timers and Delay Functions

6.5.1. Dynamic Timers

struct timer_list {

struct list_head entry;

unsigned long expires;

spinlock_t lock;

unsigned long magic;

void (*function)(unsigned long);

unsigned long data;

tvec_base_t *base;

};

6.5.1.1. Dynamic timers and race conditions

6.5.1.2. Data structures for dynamic timers

typedef struct tvec_t_base_s {

spinlock_t lock;

unsigned long timer_jiffies;

struct timer_list *running_timer;

tvec_root_t tv1;

tvec_t tv2;

tvec_t tv3;

tvec_t tv4;

tvec_t tv5;

} tvec_base_t;

6.5.1.3. Dynamic timer handling

run_timer_softirq( )

6.5.2. An Application of Dynamic Timers: the nanosleep( ) System Call

6.5.3. Delay Functions

6.6. System Calls Related to Timing Measurements

6.6.1. The time( ) and gettimeofday( ) System Calls

6.6.2. The adjtimex( ) System Call

6.6.3. The setitimer( ) and alarm( ) System Calls

ITIMER_REAL

The actual elapsed time; the process receives SIGALRM signals.

ITIMER_VIRTUAL

The time spent by the process in User Mode; the process receives SIGVTALRM signals.

ITIMER_PROF

The time spent by the process both in User and in Kernel Mode; the process receives SIGPROF

signals.

The ITIMER_REAL interval timer is implemented by using dynamic timers because the kernel must

send signals to the process even when it is not running on the CPU. Therefore, each process

descriptor includes a dynamic timer object called real_timer. The setitimer( ) system call

initializes the real_timer fields and then invokes add_timer( ) to insert the dynamic timer in the

proper list. When the timer expires, the kernel executes the it_real_fn( ) timer function. In turn,

the it_real_fn( ) function sends a SIGALRM signal to the process; then, if it_real_incr is not null, it

sets the expires field again, reactivating the timer.

The ITIMER_VIRTUAL and ITIMER_PROF interval timers do not require dynamic timers, because they

can be updated while the process is running. The account_it_virt( ) and account_it_prof( )

functions are invoked by update_ process_times( ), which is called either by the PIT's timer

interrupt handler (UP) or by the local timer interrupt handlers (SMP). Therefore, the two interval

timers are updated once every tick, and if they are expired, the proper signal is sent to the current

process.

6.6.4. System Calls for POSIX Timers

Chapter 7. Process Scheduling

7.1. Scheduling Policy

When speaking about scheduling, processes are traditionally classified as I/O-bound or CPU-bound.

The former make heavy use of I/O devices and spend much time waiting for I/O operations to

complete; the latter carry on number-crunching applications that require a lot of CPU time.

Interactive processes

Batch processes

Real-time processes

Table 7-1. System calls related to scheduling

System call Description

nice( ) Change the static priority of a conventional process

getpriority( ) Get the maximum static priority of a group of conventional processes

setpriority( ) Set the static priority of a group of conventional processes

sched_getscheduler( ) Get the scheduling policy of a process

sched_setscheduler( ) Set the scheduling policy and the real-time priority of a process

sched_getparam( ) Get the real-time priority of a process

sched_setparam( ) Set the real-time priority of a process

sched_yield( ) Relinquish the processor voluntarily without blocking

sched_get_ priority_min( ) Get the minimum real-time priority value for a policy

sched_get_ priority_max( ) Get the maximum real-time priority value for a policy

sched_rr_get_interval( ) Get the time quantum value for the Round Robin policy

sched_setaffinity( ) Set the CPU affinity mask of a process

sched_getaffinity( ) Get the CPU affinity mask of a process

7.1.1. Process Preemption

7.1.2. How Long Must a Quantum Last?

The choice of the average quantum duration is always a compromise. The rule of thumb adopted by

Linux is choose a duration as long as possible, while keeping good system response time.

7.2. The Scheduling Algorithm

SCHED_FIFO

SCHED_RR

SCHED_NORMAL

7.2.1. Scheduling of Conventional Processes

The kernel represents the static priority of a conventional process with a number ranging from 100 (highest priority) to

139 (lowest priority); notice that static priority decreases as the values increase.

7.2.1.2. Dynamic priority and average sleep time

7.2.1.3. Active and expired processes

Active processes

These runnable processes have not yet exhausted their time quantum and are thus allowed to run.

Expired processes

These runnable processes have exhausted their time quantum and are thus forbidden to run

until all active processes expire

7.2.2. Scheduling of Real-Time Processes

Every real-time process is associated with a real-time priority, which is a value ranging from 1 (highest priority) to 99 (lowest priority).

7.3. Data Structures Used by the Scheduler

7.3.1. The runqueue Data Structure

Table 7-4. The fields of the runqueue structure

Type Name Description

spinlock_t lock Spin lock protecting the lists of processes

unsigned long nr_running Number of runnable processes in the runqueue lists

unsigned long cpu_load CPU load factor based on the average number of processes in the runqueue

unsigned long nr_switches Number of process switches performed by the CPU

unsigned long nr_uninterruptible Number of processes that were previously in the runqueue

lists and are now sleeping in TASK_UNINTERRUPTIBLE state

(only the sum of these fields across all runqueues is meaningful)

unsigned long expired_timestamp Insertion time of the eldest process in the expired lists unsigned long

long timestamp_last_tick Timestamp value of the last timer interrupt

task_t * curr Process descriptor pointer of the currently running process (same as current for the local CPU)

task_t * idle Process descriptor pointer of the swapper process for this CPU

struct mm_struct * prev_mm Used during a process switch to store the address of the memory descriptor of the process being replaced

prio_array_t * active Pointer to the lists of active processes

prio_array_t * expired Pointer to the lists of expired processes

prio_array_t [2] arrays The two sets of active and expired processes int best_expired_prio

int best_expired_prio The best static priority (lowest value) among the expired processes

atomic_t nr_iowait Number of processes that were previously in the runqueue

lists and are now waiting for a disk I/O operation to

complete

struct sched_domain *sd Points to the base scheduling domain of this CPU (see the section "Scheduling Domains" later in this chapter)

int active_balance Flag set if some process shall be migrated from this runqueue to another (runqueue balancing)

int push_cpu Not used

task_t * migration_thread Process descriptor pointer of the migration kernel thread

struct list_head migration_queue List of processes to be removed from the runqueue

7.3.2. Process Descriptor

7.4. Functions Used by the Scheduler

7.4.1. The scheduler_tick( ) Function

Keeps the time_slice counter of current up-to-date

We have already explained in the section "Updating Local CPU Statistics" in Chapter 6 how

scheduler_tick( ) is invoked once every tick to perform some operations related to scheduling.

7.4.1.1. Updating the time slice of a real-time process

This is the meaning of round-robin scheduling

7.4.1.1.Updating the time slice of a conventional process

7.4.2. The try_to_wake_up( ) Function

Awakens a sleeping process

The TRy_to_wake_up( ) function awakes a sleeping or stopped process by setting its state to

TASK_RUNNING and inserting it into the runqueue of the local CPU.

7.4.3. The recalc_task_prio( ) Function

Updates the dynamic priority of a process

7.4.4. The schedule( ) Function

Selects a new process to be executed

7.4.4.1. Direct invocation

7.4.4.2. Lazy invocation

7.4.4.3. Actions performed by schedule( ) before a process switch

7.4.4.4. Actions performed by schedule( ) to make the process switch

7.4.4.5. Actions performed by schedule( ) after a process switch

7.5. Runqueue Balancing in Multiprocessor Systems

7.5.1. Scheduling Domains

Essentially, a scheduling domain is a set of CPUs whose workloads should be kept balanced by the kernel.

7.5.2. The rebalance_tick( ) Function

7.5.3. The load_balance( ) Function

7.5.4. The move_tasks( ) Function

7.6. System Calls Related to Scheduling

7.6.1. The nice( ) System Call

The nice( )[*] system call allows processes to change their base priority. The integer value

contained in the increment parameter is used to modify the nice field of the process descriptor. The

nice Unix command, which allows users to run programs with modified scheduling priority, is based

on this system call.

7.6.2. The getpriority( ) and setpriority( ) System Calls

7.6.3. The sched_getaffinity( ) and sched_setaffinity( ) System Calls

The sched_getaffinity( ) and sched_setaffinity( ) system calls respectively return and set up the

CPU affinity mask of a processthe bit mask of the CPUs that are allowed to execute the process. This

mask is stored in the cpus_allowed field of the process descriptor.

7.6.4. System Calls Related to Real-Time Processes

7.6.4.1. The sched_getscheduler( ) and sched_setscheduler( ) system calls

The sched_getscheduler( ) system call queries the scheduling policy currently applied to the

process identified by the pid parameter.

SCHED_FIFO, SCHED_RR, or SCHED_NORMAL

7.6.4.2. The sched_ getparam( ) and sched_setparam( ) system calls

7.6.4.3. The sched_ yield( ) system call

7.6.4.4. The sched_ get_priority_min( ) and sched_ get_priority_max( ) system calls

7.6.4.5. The sched_rr_ get_interval( ) system call

Page 318

深入理解Linux内核第3版--笔记-1.pdf

猜你喜欢