深入探索linux内存管理中的page

Linux内存管理

要讨论page，就不得不先说一说linux是如何进行内存管理的（这部分略讲）。

首先，要明白操作系统采用虚拟内存管理来为你的程序制造一个illusion——“我拥有非常大的内存！”

程序中使用的是虚拟地址，然后通过页表（page table）的映射能够找到其对应的物理地址（可能在主存中，也可能在其他存储设备中），页表以及多级页表的诸多概念和操作不在本文所讨论的范畴。

然后，虚拟地址到物理地址之间要映射吧，那就要统一一个大小吧，不然怎么映射。这个统一的大小就称为页（page），也就是所说的分页管理。页是地址连续的一片地址空间，可以是虚拟内存中的，也可以是物理内存中的，而物理内存中的一个页被称为页框（page frame）。

页就是这么个简单的概念：固定大小 (PAGE_SIZE) 的存储空间，内部地址连续。

Page定义及相关操作

简单了解了页的概念后，我们把注意力集中在如何在内核代码中使用它。

Page是Linux管理物理内存的最小单位，对于一个4GB的内存，都会有百万个struct page，即使增加一个字节，对系统的影响也很大，所以linux社区对struct page做了严格的设计，具体在 include/linux/mm_types.h 中定义了 struct page，目前的kernel version为5.18.

struct page {
	unsigned long flags;		/* Atomic flags, some possibly
					 * updated asynchronously */
	/*
	 * Five words (20/40 bytes) are available in this union.
	 * WARNING: bit 0 of the first word is used for PageTail(). That
	 * means the other users of this union MUST NOT use the bit to
	 * avoid collision and false-positive PageTail().
	 */
	union {
		struct {	/* Page cache and anonymous pages */
			/**
			 * @lru: Pageout list, eg. active_list protected by
			 * lruvec->lru_lock.  Sometimes used as a generic list
			 * by the page owner.
			 */
			union {
				struct list_head lru;
				/* Or, for the Unevictable "LRU list" slot */
				struct {
					/* Always even, to negate PageTail */
					void *__filler;
					/* Count page's or folio's mlocks */
					unsigned int mlock_count;
				};
			};
			/* See page-flags.h for PAGE_MAPPING_FLAGS */
			struct address_space *mapping;
			pgoff_t index;		/* Our offset within mapping. */
			/**
			 * @private: Mapping-private opaque data.
			 * Usually used for buffer_heads if PagePrivate.
			 * Used for swp_entry_t if PageSwapCache.
			 * Indicates order in the buddy system if PageBuddy.
			 */
			unsigned long private;
		};
		struct {	/* page_pool used by netstack */
			/**
			 * @pp_magic: magic value to avoid recycling non
			 * page_pool allocated pages.
			 */
			unsigned long pp_magic;
			struct page_pool *pp;
			unsigned long _pp_mapping_pad;
			unsigned long dma_addr;
			union {
				/**
				 * dma_addr_upper: might require a 64-bit
				 * value on 32-bit architectures.
				 */
				unsigned long dma_addr_upper;
				/**
				 * For frag page support, not supported in
				 * 32-bit architectures with 64-bit DMA.
				 */
				atomic_long_t pp_frag_count;
			};
		};
		struct {	/* Tail pages of compound page */
			unsigned long compound_head;	/* Bit zero is set */

			/* First tail page only */
			unsigned char compound_dtor;
			unsigned char compound_order;
			atomic_t compound_mapcount;
			atomic_t compound_pincount;
#ifdef CONFIG_64BIT
			unsigned int compound_nr; /* 1 << compound_order */
#endif
		};
		struct {	/* Second tail page of compound page */
			unsigned long _compound_pad_1;	/* compound_head */
			unsigned long _compound_pad_2;
			/* For both global and memcg */
			struct list_head deferred_list;
		};
		struct {	/* Page table pages */
			unsigned long _pt_pad_1;	/* compound_head */
			pgtable_t pmd_huge_pte; /* protected by page->ptl */
			unsigned long _pt_pad_2;	/* mapping */
			union {
				struct mm_struct *pt_mm; /* x86 pgds only */
				atomic_t pt_frag_refcount; /* powerpc */
			};
#if ALLOC_SPLIT_PTLOCKS
			spinlock_t *ptl;
#else
			spinlock_t ptl;
#endif
		};
		struct {	/* ZONE_DEVICE pages */
			/** @pgmap: Points to the hosting device page map. */
			struct dev_pagemap *pgmap;
			void *zone_device_data;
			/*
			 * ZONE_DEVICE private pages are counted as being
			 * mapped so the next 3 words hold the mapping, index,
			 * and private fields from the source anonymous or
			 * page cache page while the page is migrated to device
			 * private memory.
			 * ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also
			 * use the mapping, index, and private fields when
			 * pmem backed DAX files are mapped.
			 */
		};

		/** @rcu_head: You can use this to free a page by RCU. */
		struct rcu_head rcu_head;
	};

	union {		/* This union is 4 bytes in size. */
		/*
		 * If the page can be mapped to userspace, encodes the number
		 * of times this page is referenced by a page table.
		 */
		atomic_t _mapcount;

		/*
		 * If the page is neither PageSlab nor mappable to userspace,
		 * the value stored here may help determine what this page
		 * is used for.  See page-flags.h for a list of page types
		 * which are currently stored here.
		 */
		unsigned int page_type;
	};

	/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
	atomic_t _refcount;

#ifdef CONFIG_MEMCG
	unsigned long memcg_data;
#endif

	/*
	 * On machines where all RAM is mapped into kernel address space,
	 * we can simply calculate the virtual address. On machines with
	 * highmem some memory is mapped into kernel virtual memory
	 * dynamically, so we need a place to store that address.
	 * Note that this field could be 16 bits on x86 ... ;)
	 *
	 * Architectures with slow multiplication can define
	 * WANT_PAGE_VIRTUAL in asm/page.h
	 */
#if defined(WANT_PAGE_VIRTUAL)
	void *virtual;			/* Kernel virtual address (NULL if
					   not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */

#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
	int _last_cpupid;
#endif
} _struct_page_alignment;

为减小struct page的大小，使用了很多技巧，其中就用了两个较大的union结构以节省内存。struct page可以划分为如下几个部分：

除了union技术外，还有就是复用技术。包括对flags标志的使用（在NUMA系统中，该标志中划分出一部分给node id和zone使用）以及list_head lru链表，在page不同时期不同用途下，会指向不同的链表以节省空间。

现在page结构体比较清楚了，其中我们关注一些字段：

flags

用一个字（word）标记页面状态，例如这个page是 locked 还是 dirty. 这些标记位有：

PG_locked		该页面释放已经上锁，如果已经上锁则置1，其他内核模块不能再访问
PG_referenced	如果该页面最近是否被访问，如果被访问过则置位。用于LRU算法
PG_uptodate		该页面已经从硬盘中成功读取
PG_dirty		该页面是一个脏页，需要将该页面的数据刷新到硬盘中。当页面数据被修改时并不会立即刷新到硬盘中，而是暂时先保证到内存中，等待后面刷新到硬盘中。设置该页为脏页意味着再该页被置换出去之前必须要保证该页不能被-释放
PG_lru			表示该page 位于某个LRU链表中(active、inactive、unevictable LRU中）。
PG_active		表示该页处于活跃状态
PG_workingset	设置该页为某个进程的woring set
PG_waiters		有进程在等待这个页面
PG_error		该页面在操作IO过程中出现错误
PG_slab			该页被slab所使用
PG_owner_priv_1	被页面的所有者使用，如果是作为pagecache页面，则文件系统有可能使用
PG_arch_1		与体系结构相关的一个状态位，
PG_reserved		该页被保留
PG_private		如果page中的private成员非空，则需要设置该标志, 用于I/O的页可使用该字段将页细分为多核缓冲区
PG_private_2	在PG_private基础上的扩充，经常用于aux data
PG_writeback	页面的内存正在向磁盘写
PG_head			该页是一个head page页。在内核中有时需要将多个页组成一个compound pages，而设置该状态时表明该页是compound pages的第一个页
PG_mappedtodisk	该页被映射到硬盘中
PG_reclaim		该页可被回收
PG_swapbacked	该page的后备存储器是swap/ram，一般匿名页才可以回写swap分
PG_unevictable	该page被锁住，不能回收，并会出现在LRU_UNEVICTABLE链表中 
PG_mlocked		该页对应的vma被锁住，一般是通过系统调用mlock()锁定了一段内
PG_uncached		该页被设置为不可缓存，需要配置CONFIG_ARCH_USES_PG_UNCACHED
PG_hwpoison		hardware poisoned page. Don't touch，需要配置CONFIG_MEMORY_FAILURE
PG_young		需要CONFIG_IDLE_PAGE_TRACKING和CONFIG_64BIT才支持
PG_idle			需要CONFIG_IDLE_PAGE_TRACKING和CONFIG_64BIT才支持

另外，关于复合页（compund page）:如果使用类似 alloc_pages() 来分配多个页，并指定 GFP_COMP ，那么第一个页面将被标记为 PageHead()，其他页面被标记为 PageTail()，调用 compound_head() 将返回head页面。复合页的尺寸要远远大于系统所支持的页面大小，主要用于 HugeTLB 相关代码。

对于标志位的操作，可以使用一些列宏，包括（1）SetPageXXX，设置某个标志位；（2）ClearPageXXX，清空某个标志位；（3）PageXXX，检查页面是否设置了某些标志位

_refcount

每个页面都有一个引用计数，新分配的页面的引用计数为1。当引用计数为0时，通常通过调用 put_page() 来将页面释放回页分配器。不要直接操作它，应该使用其相关的函数来进行操作。
_mapcount

有些页面可以映射到用户空间，这个成员就是记录页面映射到用户空间的次数。同样，应该通过 page_mapcount() 之类的访问函数来访问它。对于不会映射到用户空间的页面（例如 slab页面）可能会为了自己的目的重用该字段。
virtual

这表示页面的虚拟地址。有一些内存地址（如High Memory）没有持久地映射到内核地址空间，对于这样的页面，virtual值为NULL，并且这些页必须通过动态映射才能在内核中使用。

page的相关操作包括：

获得页：使用alloc_pages接口来获得页
1
2
3
4
static inline struct page * alloc_pages(gfp_t gfp_mask, unsigned int order)
{
return alloc_pages_current(gfp_mask, order);
}
其中，参数gfp_mask标志获得页所使用的行为方式，参数order指定分配多少页（2^order个连续物理页面），返回的指针指向第一个page。

另外还有__get_free_pages，alloc_page以及__get_free_page函数都可以分配页，但本质都是调用alloc_pages来进行的。
释放页：可以使用__free_pages，__free_page，free_pages或者free_page来释放页。

一些问题

什么是High Memory和Low Memory？

在内核编程时，有时候对page操作会出现空指针的问题。其实可以用kmap()来确保这个问题不会发生。究其原因，内存划分成两个部分，一部分供内核使用，称为Low Memory；另一部分供用户程序使用，称为High Memory。对于一个4G内存的机器来说，内核空间占用1G（0xcffff ffff ~ 0xffff ffff），用户空间占用3G（0x0000 0000 ~ 0xbfff ffff）。可以参考下图：

High Memory指的是这部分内存在用户空间中，并没有永久地映射在内核空间。

Low Memory则指一直映射在内核空间的内存，可以直接通过解引用指针来访问它的值。

其实High / Low Memory仅仅是内核空间和用户空间的划分，确保内核部分不被污染。

如果试图从内核代码从访问High Memory，就需要利用 kmap() 将其在Low Memory中建立映射，这样才能访问。并且注意，kmap() 建立映射也是一种资源，它消耗了内核地址空间，所以一旦用完就要用 kunmap() 来释放资源。

kmap() 可以确保你想要访问的内存能在内核代码中访问到，如果数据本身就在内核空间中，也可以用该函数来做一个保证，其源码如下：

void * kmap(struct page * page)
{
    might_sleep();
    if(!PageHighMem(Page))
        return page_address(page);
    return kmap_high(page);
}

什么是匿名页？

Linux中有后备文件支持的页称为文件页，如属于进程的代码段、数据段的页，内存回收的时候这些页面只需要做脏页的同步即可（干净的页面可以直接丢弃掉）。

匿名页是值应用程序动态分配的那些内存，如进程的堆栈使用的页，它们很可能还要继续被访问，所以内存回收的时候这些页面不能简单的丢弃掉，需要交换到交换分区或交换文件。

Linux中的swap机制会把不常访问的内存先写到磁盘中，然后释放这些内存供其他更需要的进程使用。而再次访问这些内存时，重新从磁盘中读入就可以了。