页

内核把物理页作为内存管理的基本单位。

内存管理单元（MMU）是用来管理内存并将虚拟地址转换为物理地址的硬件。

内核中的物理页由page结构体表示（include\linux\mm_types.h，稍有删减）：

struct page {
    unsigned long flags;        /* Atomic flags, some possibly
                     * updated asynchronously */
    atomic_t _count;        /* Usage count, see below. */
    union {
        atomic_t _mapcount; /* Count of ptes mapped in mms,
                     * to show when page is mapped
                     * & limit reverse map searches.
                     */
        struct {        /* SLUB */
            u16 inuse;
            u16 objects;
        };
    };
    union {
        struct {
        unsigned long private;      /* Mapping-private opaque data:
                         * usually used for buffer_heads
                         * if PagePrivate set; used for
                         * swp_entry_t if PageSwapCache;
                         * indicates order in the buddy
                         * system if PG_buddy is set.
                         */
        struct address_space *mapping;  /* If low bit clear, points to
                         * inode address_space, or NULL.
                         * If page mapped as anonymous
                         * memory, low bit is set, and
                         * it points to anon_vma object:
                         * see PAGE_MAPPING_ANON below.
                         */
        };
        struct kmem_cache *slab;    /* SLUB: Pointer to slab */
        struct page *first_page;    /* Compound tail pages */
    };
    union {
        pgoff_t index;      /* Our offset within mapping. */
        void *freelist;     /* SLUB: freelist req. slab lock */
    };
    struct list_head lru;       /* Pageout list, eg. active_list
                     * protected by zone->lru_lock !
                     */
    /*
     * On machines where all RAM is mapped into kernel address space,
     * we can simply calculate the virtual address. On machines with
     * highmem some memory is mapped into kernel virtual memory
     * dynamically, so we need a place to store that address.
     * Note that this field could be 16 bits on x86 ... ;)
     *
     * Architectures with slow multiplication can define
     * WANT_PAGE_VIRTUAL in asm/page.h
     */
#if defined(WANT_PAGE_VIRTUAL)
    void *virtual;          /* Kernel virtual address (NULL if
                       not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
};

每一个物理页对应一个page结构体。

flags：用来存放页的状态，每一位表示一种状态，具体说明位于include\linux\page-flags.h；

_count：页的引用计数，-1时就说明当前内核并没有引用这一页，这时新的分配中就可以使用它；

virtual：页的虚拟地址，即页在虚拟内存中的地址；

一个页可以由页缓存使用，此时mapping指向个这个也关联的address_space对象；也可以作为私有数据，由private指向。

区

内核把页划分为不同的区。

内核使用区对具有相似特性的页进行分组。

Linux这么做是因为：

一些硬件只能用某些特定的内存地址来执行DMA；
一些体系架构的内存的物理寻址范围比虚拟寻址范围大得多，这样就有一些内存不能永久地映射到内核地址空间上（比如x86-32，内核地址空间最大只有4G，但是时间物理内存可以超过4G）；

Linux主要使用了4个区（include\linux\mmzone.h）：

ZONE_DMA：这个区包含的页能用来执行DMA操作；

ZONE_DMA32：跟ZONE_DMA类似，但是这种页只能被32位设备访问；

ZONE_NORMAL：能够正常映射的页；

ZONE_HIGHMEM：包含高端内存，其中的页不能永久地映射到内核地址空间；

区的实际使用和分布与体系架构有关。

比如x86-32是这样的：

区的划分没有物理上的意义。

分配内存时可以从特定的区获取页，有些分配对区有要求，有些则没有。但是分配不能跨区。

不是所有的体系架构都定义了所有的区。

比如x86-64没有ZONE_HIGHMEM，因为64位的地址空间远大于目前支持的内存总量。

每个区对应zone结构体，位于include\linux\mmzone.h，这个结构体很大。

高端内存中的页被映射到3G-4G之间。

使用的函数（include\linux\highmem.h）：

永久映射：

static inline void *kmap(struct page *page)

接触映射：

static inline void kunmap(struct page *page)

临时映射：

static inline void *kmap_atomic(struct page *page, enum km_type idx)

解除临时映射：

#define kunmap_atomic(addr, idx)    do { pagefault_enable(); } while (0)

内核分配相关函数

页的底层操作函数位于include\linux\gfp.h：

释放页的函数有：

extern void __free_pages(struct page *page, unsigned int order);
extern void free_pages(unsigned long addr, unsigned int order);
#define free_page(addr) free_pages((addr),0)

以字节为单位分配内存，则使用kmalloc()（include\linux\slab_def.h）：

static __always_inline void *kmalloc(size_t size, gfp_t flags)

对应的释放函数：

void kfree(const void *);

内存分配时涉及到标志（gfp_mask和flags），这种标志有三种类型：

行为标志：

区修饰符：

类型标志（flags）：

类型标志其实是前面两种的合体：

关于什么时候用什么标志：

vmalloc()

与kmalloc()类似，但是分配的内存的虚拟地址是连续的，而物理地址则无须连续。（kmalloc()分配的内存物理地址是连续的。）

这也是用户空间分配函数的工作方式，malloc()返回的页在进程的虚拟地址空间中是连续的，物理上在不保证。

大多数情况只有硬件设备需要得到物理地址连续的内存。

对内核而言，所有内存看起来多事逻辑上连续的。

但是很多内核代码都使用kmalloc()而不是vmalloc()，这主要是处于性能的考虑。

vmalloc()仅在不得已时才会使用，比如为了获得大块内存时。

vmalloc()的声明中（include\linux\vmalloc.h）：

extern void *vmalloc(unsigned long size);

该函数可能睡眠，因此不能在中断上下文中调用。（kmalloc()由flags决定是否睡眠）

对应vmalloc()有：

extern void vfree(const void *addr);

slab分配器

slab分配器扮演了通用数据结构缓存层的角色，用于方便数据的频繁分配和回收。

slab层把不同的对象划分为所谓的高速缓存组，每个缓存组存放不同类型的对应。

每种类型对象对应一个高速缓存，比如一个高速缓存用来存放进程描述符，另一高速缓存用来存放索引节点对象。

kmalloc()接口建立在slab层上，使用了一组通用高速缓存。

高速缓存被划分为slab。

slab由一个或多个物理上连续的页组成。

一般情况下slab仅仅由一页组成。

每个高速缓存可以由多个slab组成。这些slab处于满，部分满和空三种状态的一种。

当内核的某一个部分需要创建一个新的对象时，就从部分满的slab中分配；如果没有部分满的，就从空的slab中分配；如果没有空的slab就需要创建slab。

具体关系：

每个高速缓存用kmem_cache来表示：

struct kmem_cache {
/* 1) per-cpu data, touched during every alloc/free */
    struct array_cache *array[NR_CPUS];
/* 2) Cache tunables. Protected by cache_chain_mutex */
    unsigned int batchcount;
    unsigned int limit;
    unsigned int shared;
    unsigned int buffer_size;
    u32 reciprocal_buffer_size;
/* 3) touched by every alloc & free from the backend */
    unsigned int flags;     /* constant flags */
    unsigned int num;       /* # of objs per slab */
/* 4) cache_grow/shrink */
    /* order of pgs per slab (2^n) */
    unsigned int gfporder;
    /* force GFP flags, e.g. GFP_DMA */
    gfp_t gfpflags;
    size_t colour;          /* cache colouring range */
    unsigned int colour_off;    /* colour offset */
    struct kmem_cache *slabp_cache;
    unsigned int slab_size;
    unsigned int dflags;        /* dynamic flags */
    /* constructor func */
    void (*ctor)(void *obj);
/* 5) cache creation/removal */
    const char *name;
    struct list_head next;
/* 6) statistics */
#ifdef CONFIG_DEBUG_SLAB
    unsigned long num_active;
    unsigned long num_allocations;
    unsigned long high_mark;
    unsigned long grown;
    unsigned long reaped;
    unsigned long errors;
    unsigned long max_freeable;
    unsigned long node_allocs;
    unsigned long node_frees;
    unsigned long node_overflow;
    atomic_t allochit;
    atomic_t allocmiss;
    atomic_t freehit;
    atomic_t freemiss;
    /*
     * If debugging is enabled, then the allocator can add additional
     * fields and/or padding to every object. buffer_size contains the total
     * object size including these internal fields, the following two
     * variables contain the offset to the user object and its size.
     */
    int obj_offset;
    int obj_size;
#endif /* CONFIG_DEBUG_SLAB */
    /*
     * We put nodelists[] at the end of kmem_cache, because we want to size
     * this array to nr_node_ids slots instead of MAX_NUMNODES
     * (see kmem_cache_init())
     * We still use [MAX_NUMNODES] and not [1] or [0] because cache_cache
     * is statically defined, so we reserve the max number of nodes.
     */
    struct kmem_list3 *nodelists[MAX_NUMNODES];
    /*
     * Do not add fields after nodelists[]
     */
};

最后面有一个kmem_list3，它包含3个链表（mm\slab.c）：

/*
 * The slab lists for all objects.
 */
struct kmem_list3 {
    struct list_head slabs_partial; /* partial list first, better asm code */
    struct list_head slabs_full;
    struct list_head slabs_free;
    unsigned long free_objects;
    unsigned int free_limit;
    unsigned int colour_next;   /* Per-node cache coloring */
    spinlock_t list_lock;
    struct array_cache *shared; /* shared per node */
    struct array_cache **alien; /* on other nodes */
    unsigned long next_reap;    /* updated without locking */
    int free_touched;       /* updated without locking */
};

就分别对应了满，部分满和空的slab。

slab描述符结构体如下（mm\slab.c）：

struct slab {
    struct list_head list;
    unsigned long colouroff;
    void *s_mem;        /* including colour offset */
    unsigned int inuse; /* num of objs active in slab */
    kmem_bufctl_t free;
    unsigned short nodeid;
};

slab描述要么在slab之外分配，要么放在slab自身开始的地方。

slab分配器可以创建新的slab，使用如下函数：

static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)

它的内部通过__get_free_pages低级内核页分配器来实现。

释放slab使用：

static void kmem_freepages(struct kmem_cache *cachep, void *addr)

创建新的高速缓冲：

struct kmem_cache *
kmem_cache_create (const char *name, size_t size, size_t align,
    unsigned long flags, void (*ctor)(void *))

撤销高速缓存：

void kmem_cache_destroy(struct kmem_cache *cachep)

从高速缓存中分配对象：

void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)

释放对象：

void kmem_cache_free(struct kmem_cache *cachep, void *objp)

CPU数据

支持SMP的操作系统需要使用CPU数据，即对一个给定的CPU其数据是唯一的。

可以通过数据来存放每个CPU的数据，下面就是一个例子：

#ifdef __ARCH_SYNC_CORE_ICACHE
unsigned long icache_invld_count[NR_CPUS];
void resync_core_icache(void)
{
    unsigned int cpu = get_cpu();
    blackfin_invalidate_entire_icache();
    icache_invld_count[cpu]++;
    put_cpu();
}

get_cpu()会禁止内核抢占，所以不会有内核抢占问题导致数据异常的问题，直到调用put_cpu()。

2.6内核增加了新的CPU数据接口。

编译时的每个CPU数据（include\linux\percpu-defs.h，include\linux\percpu.h）：

DEFINE_PER_CPU(type, name)
DECLARE_PER_CPU(type, name)
/*
 * Must be an lvalue. Since @var must be a simple identifier,
 * we force a syntax error here if it isn't.
 */
#define get_cpu_var(var) (*({               \
    preempt_disable();              \
    &__get_cpu_var(var); }))
/*
 * The weird & is necessary because sparse considers (void)(var) to be
 * a direct dereference of percpu variable (var).
 */
#define put_cpu_var(var) do {               \
    (void)&(var);                   \
    preempt_enable();               \
} while (0)

运行时的每个CPU数据：

void alloc_percpu(type)
extern void __percpu *__alloc_percpu(size_t size, size_t align);
extern void free_percpu(void __percpu *__pdata);

CPU数据的好处：

减少了数据锁定；
减少了缓存失效；

唯一的安全要求是禁止内核抢占。

《Linux内核设计与实现》读书笔记——内存管理

页

区

内核分配相关函数

vmalloc()

slab分配器

CPU数据

猜你喜欢