揭秘OpenCloudOS内核调度器Features

源创会，线下重启！2023年7月1日深圳站—基础软件技术面面谈！免费票限时抢购！

调度器(Scheduler)需要应对各种极端场景以及各种业务模型，单一的策略设计很难覆盖所有的场景。于是内核在调度器里添加了很多调度特性feature，在不同的业务场景里，根据不同的业务模型选择最优的调度策略，这样可以让调度器拥有很好的适应性。调度特性分析

通过cat /sys/kernel/debug/sched_features可以知道当前内核支持哪些调度特性，以及这些的打开情况。图片可以看到，如果有NO_前缀，就表示这个功能关闭。而没有这个前缀则表示功能打开。内核代码位于： kernel/sched/feature.h 在这个头文件里定义了所有内核支持的feature 使用方式：打开某个调度特性： echo WAKEUP_PREEMPTION > /sys/kernel/debug/sched_features 关闭某个调度特性： echo NO_WAKEUP_PREEMPTION > /sys/kernel/debug/sched_features GENTLE_FAIR_SLEEPERS 该功能用来限制睡眠线程的补偿时间为sysctl_sched_latency的50%，可以减少其他任务的调度延迟，该功能内核默认打开。如果关闭该特性，则唤醒线程可以获得更多的执行时间，但于此同时，调度队列上的其他任务则会由较大的调度延迟。 /*

Only give sleepers 50% of their service deficit. This allows
them to run sooner, but does not allow tons of sleepers to
rip the spread apart. */ SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)

static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) { u64 vruntime = cfs_rq->min_vruntime;

The 'current' period is already promised to the current tasks,
however the extra weight of the new task will slow them down a
little, place the new task so that it fits in the slot that
stays open at the end.
将新创建的任务vruntime加上一个额外的虚拟时间片，这样可以让新创建的任务
必须等到下个调度周期才能运行(预期) */ if (initial && sched_feat(START_DEBIT)) vruntime += sched_vslice(cfs_rq, se);

/* sleeps up to a single latency don't count. / if (!initial) { / 当rq上的进程数目较少，内核默认的调度延迟(也可以称为调度周期) */ unsigned long thresh = sysctl_sched_latency;

/*
 * Halve their sleep time's effect, to allow
 * for a gentler effect of sleepers:
 *
 * 这里将睡眠线程的最大补偿时间设置为内核调度延迟的一半，这样做可以
 * 防止睡眠时间较长的线程被唤醒后获得的时间片过长，让调度队列上
 * 的其他任务出现较大的调度延迟毛刺
 */
if (sched_feat(GENTLE_FAIR_SLEEPERS))
  thresh >>= 1;

vruntime -= thresh;

}

/* ensure we never gain time by being placed backwards. */ se->vruntime = max_vruntime(se->vruntime, vruntime); }

START_DEBIT START_DEBIT会将新创建任务的vruntime适当增大，让其在下个调度周期才能获得执行机会（与其他进程公平分配时间片）。这样的目的是为了防止有的进程通过不断fork + exec的方式获得更多的时间片(有点类似于攻击了)，导致其他的进程出现调度饥饿的情况，该功能内核默认打开。 /*

Place new tasks ahead so that they do not starve already running
tasks */ SCHED_FEAT(START_DEBIT, true)

static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) { u64 vruntime = cfs_rq->min_vruntime;

The 'current' period is already promised to the current tasks,
however the extra weight of the new task will slow them down a
little, place the new task so that it fits in the slot that
stays open at the end.
将新创建的任务vruntime加上一个额外的虚拟时间片，这样可以让新创建的任务
必须等到下个调度周期才能运行(预期) */ if (initial && sched_feat(START_DEBIT)) vruntime += sched_vslice(cfs_rq, se);

/*
 * Halve their sleep time's effect, to allow
 * for a gentler effect of sleepers:
 *
 * 这里将睡眠线程的最大补偿时间设置为内核调度延迟的一般，这样做可以
 * 防止睡眠时间较长的线程被唤醒后获得的时间片过长，从让调度队列上
 * 的其他任务出现较大的调度延迟毛刺
 */
if (sched_feat(GENTLE_FAIR_SLEEPERS))
  thresh >>= 1;

vruntime -= thresh;

}

/* ensure we never gain time by being placed backwards. / se->vruntime = max_vruntime(se->vruntime, vruntime); } NEXT_BUDDY next与last是内核调度器留的两个“后门”，让某些进程可以得到优先调度的机会。这里的NEXT_BUDDY就是在唤醒抢占检查的地方，是否无条件的设置被唤醒的认为为NEXT BUDDY优先调度对象，内核默认为关闭（即需要进行抢占粒度检查之后，符合抢占条件才会设置）。如果打开这个功能，会让wakeup task得到优先调度的检查机会（仅仅是机会，能否得到调度还是要看虚拟时间），但同时会增加pick next task的时间开销。 /

Prefer to schedule the task we woke last (assuming it failed
wakeup-preemption), since its likely going to consume data we
touched, increases cache locality. */ SCHED_FEAT(NEXT_BUDDY, false)

Preempt the current task with a newly woken task if needed:
next/last则是没有太不公平时，尽量选中它们运行。next与last的优先级都是
一样的，内核会匹配next、last以及从红黑树里选出来的first se，所以next/
last都有优先执行权(具体哪个先执行就要看各自的se->vruntime了)。
在同等条件下，next其实比last要高一点优先级
last: 主要是在 check_preempt_wakeup()里，如果curr被pse抢占了，那么内核
就会设置cfs_rq->last = curr，表示在下次调度时，内核会优先考虑被抢占
进程。因为进程是被抢占的，所以设置其为last，这样下次优先选择它可以
保持更好的局部性
next: 主要是wakeup task、se dequeue以及se主动yield时，内核会设置进程
为cfs_rq->next = se，这样在 pick_next_task_fair()的时候，next会与last
都会被考虑优先调度。所以next表示有因为调度策略的原因，有进程希望被
优先执行 */ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags) { struct task_struct *curr = rq->curr; struct sched_entity *se = &curr->se, *pse = &p->se; struct cfs_rq *cfs_rq = task_cfs_rq(curr); int scale = cfs_rq->nr_running >= sched_nr_latency; int next_buddy_marked = 0;

if (unlikely(se == pse)) return;

This is possible from callers such as attach_tasks(), in which we
unconditionally check_prempt_curr() after an enqueue (which may have
lead to a throttle). This both saves work and prevents false
next-buddy nomination below. */ if (unlikely(throttled_hierarchy(cfs_rq_of(pse)))) return;

如果打开了NEXT_BUDDY特性，那么当前调度器上进程较多(大于等于8)并且
这个task不是新创建的，那么它就会成为next buddy的优先调度对象(不管
后面的检查是否有效，都要设置pse为next buddy) / if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) { set_next_buddy(pse); next_buddy_marked = 1; } ............ } LAST_BUDDY LAST_BUDDY表示是否将被抢占的任务设置为last优先调度，即在下次pick_next_task的时候，内核会有效考虑调度last(但优先级低于next). 内核默认为打开（这样可以让被抢占的任务在合适的时候尽快得到运行） /
Prefer to schedule the task that ran last (when we did
wake-preempt) as that likely will touch the same data, increases
cache locality. */ SCHED_FEAT(LAST_BUDDY, true)

static void check_preempt_wakeup(struct rq *rq, struct task_struct p, int wake_flags) { .................. /

走到这里表示当前rq->curr的任务即将被抢占，如果开启了LAST_BUDDY
则会将se设置为cfs_rq->last，表示在后面的调度中会优先考虑它(但调度
优先级低于cfs_rq->next) */ if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se)) set_last_buddy(se); .................. }

CACHE_HOT_BUDDY CACHE_HOT_BUDDY表示在做负载均衡的时候，需要考虑到被迁移进程的缓存亲和性，如果被迁移前进程是next/last这样的优先调度进程，则它们可能具有比较好的本地缓存热度，对于这样的任务会尽量让其不迁移到其他CPU上去。该功能内核默认是打开的。 /*

Consider buddies to be cache hot, decreases the likelyness of a
cache buddy being migrated away, increases cache locality. */ SCHED_FEAT(CACHE_HOT_BUDDY, true)

static int task_hot(struct task_struct *p, struct lb_env *env) { s64 delta;

lockdep_assert_held(&env->src_rq->lock);

if (p->sched_class != &fair_sched_class) return 0;

if (unlikely(task_has_idle_policy(p))) return 0;

Buddy candidates are cache hot:
如果dst_rq(即需要将进程迁移到这个dst_rq上)上有进程存在(不为空)
那么我们这里就要考虑本地缓存热度，如果p为next/last进程，则不允许
进行迁移 */ if (sched_feat(CACHE_HOT_BUDDY) && env->dst_rq->nr_running && (&p->se == cfs_rq_of(&p->se)->next || &p->se == cfs_rq_of(&p->se)->last)) return 1;

if (sysctl_sched_migration_cost == -1) return 1; if (sysctl_sched_migration_cost == 0) return 0;

delta = rq_clock_task(env->src_rq) - p->se.exec_start;

return delta < (s64)sysctl_sched_migration_cost; }

WAKEUP_PREEMPTION WAKEUP_PREEMPTION表示当一个进程被唤醒进入调度队列的时候，需要与cfs_rq->curr进行抢占检查，如果符合条件则它就可以抢占调度队列上正在运行的任务，通过这个特性可以让被唤醒的任务获得调度优先性，从而减少相应的调度延迟。该特性内核默认为打开。 /*

Allow wakeup-time preemption of the current task: */ SCHED_FEAT(WAKEUP_PREEMPTION, true)

static void check_preempt_wakeup(struct rq *rq, struct task_struct p, int wake_flags) { ............ /

Batch and idle tasks do not preempt non-idle tasks (their preemption
is driven by the tick):
1 只有SCHED_NORMAL才能进行唤醒检查(SCHED_IDLE核和SCHED_BATCH不能唤醒抢占)
2 只有开启WAKEUP_PREEMPTION特性才能允许唤醒抢占 */ if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION)) return; ........... }

HRTICK 在O(1)调度器里(linux内核在2.6.26之前的调度器)有很多问题，其中一个就是调度精度问题。O(1)调度器利用系统里的tick来作为调度抢占检查点（不考虑唤醒抢占的场景），在每个tick中断处理函数里，内核会判断调度队列上正在运行的任务时间片是否已经用完。如果是的，则需要进行切换，调度下一个任务到CPU上运行。在这个角度上说，tick的精度其实就决定了调度延迟的情况。在很早的时候，内核HZ=100，即每秒有100次tick，这样每次理论调度延迟是10ms。在计算机性能比较低的时代，10ms是完全可以接受的。但是随着计算机性能的提高，以及业务对于调度实时性的要求，HZ=100已经完全不能满足需要，内核将HZ改成默认250（有的架构甚至改成1000），以满足更好的调度实时性。但这样的改动不是没有代价，HZ越大，则表示tick越频繁，这里会带来较大的系统开销（tick除了要进行调度检查，还要进行包括：墙上时间更新、timerlist检查、进程cputime更新等工作）。所以CFS在设计的时候就引入了HRTICK机制，内核为每个CPU准备了一个hrtimer定时器。在pick_next_task的时候，内核会根据选中任务的时间片结束时间来设置hrtimer。通过这样的方式，调度切换的精度就不再依赖于TICK，从而获得了近乎纳秒的调度切换精度（具体是取决于硬件timer的精度）

但HRTICK机制会带来额外的中断开销（以及enqueue/dequeue时对timer的频繁操作），特别是在任务较多时，可能中断开销会比较大。所以内核在默认情况下，是会关闭该功能（基于吞吐量的考虑）。

如果是想要获得更好的调度实时性，那么可以考虑打开这个开关，但可能会引来吞吐量的下降（实时性与吞吐量总是处于对立面）。 SCHED_FEAT(HRTICK, false)

Use hrtick when:
- enabled by features
- hrtimer is actually high res */ static inline int hrtick_enabled(struct rq *rq) { if (!sched_feat(HRTICK)) return 0; if (!cpu_active(cpu_of(rq))) return 0; return hrtimer_is_hres_active(&rq->hrtick_timer); }

#ifdef CONFIG_SCHED_HRTICK static void hrtick_start_fair(struct rq *rq, struct task_struct *p) { struct sched_entity *se = &p->se; struct cfs_rq *cfs_rq = cfs_rq_of(se);

SCHED_WARN_ON(task_rq(p) != rq);

如果当前cfs上多个任务，那么这里会根据选中任务所计算的时间片slice(用
进程权重计算获得)以及它已经消耗的时间片，计算进程时间片结束时间。
通过这样的方式就能够获得近乎纳秒的调度切换精度(取决于timer的硬件精度) */ if (rq->cfs.h_nr_running > 1) { u64 slice = sched_slice(cfs_rq, se); u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime; s64 delta = slice - ran;

if (delta < 0) {
  if (rq->curr == p)
    resched_curr(rq);
  return;
}
hrtick_start(rq, delta);

} }

DOUBLE_TICK DOUBLE_TICK是与上面的HRTICK结合起来用的，如果内核使用了HRTICK，那么在entity_tick时就没有必要进行check_preempt_tick的检查。但内核提供了一个额外的DOUBLE_TICK开关，如果为true则表明既要在HRTICK里进行调度检查，也要在TICK里进行调度检查（这也是DOUBLE函数的由来）。如果为false，则只会在HRTICK里进行调度检查（如果使能了HRTICK）。在默认情况下，内核将DOUBLE_TICK设置为false。 NONTASK_CAPACITY 这里的NONTASK_CAPACITY表示在计算CPU capacity的时候，需要将IRQ使用的CPU负载去掉。CPU capacity表示CPU上去掉DL/RT以及IRQ后的CPU的可用能力，CFS在做负载均衡的时候需要考虑到优先级比它高的调度类+中断所消耗的CPU(这里的capacity就去去掉这些之后剩下的，CFS的可用资源)，这样才能实现更好的调度均衡策略。

这里的NONTASK_CAPACITY就表示CFS需要考虑IRQ中断所使用的PELT利用率（需要使能CONFIG_HAVE_SCHED_AVG_IRQ后才能生效），内核默认该特性开启。 /*

Decrement CPU capacity based on time not spent running tasks */ SCHED_FEAT(NONTASK_CAPACITY, true)

static void update_rq_clock_task(struct rq *rq, s64 delta) { ........... #ifdef CONFIG_HAVE_SCHED_AVG_IRQ if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) update_irq_load_avg(rq, irq_delta + steal); #endif .......... }

static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu) { struct rq *rq = cpu_rq(cpu); unsigned long max = arch_scale_cpu_capacity(cpu); unsigned long used, free; unsigned long irq;

irq = cpu_util_irq(rq);

if (unlikely(irq >= max)) return 1;

used = READ_ONCE(rq->avg_rt.util_avg); used += READ_ONCE(rq->avg_dl.util_avg);

if (unlikely(used >= max)) return 1;

free = max - used;

/* 计算CPU capacity的时候，需要将IRQ的PLET使用率去掉 / return scale_irq_capacity(free, irq, max); } TTWU_QUEUE TTWU_QUEUE 表示内核会将wakeup task的进程queue remote CPU，即将这个进程挂到remote cpu wake_list上，然后用IPI通知其实行wakeup动作。这样做的目的其实就是为了减少多核间的锁竞争导致的cacheline pingpong问题，会对性能带来一定的好处。但是内核也发现，过多的IPI会导致系统性能下降，所以后面提交了一个PATCH，用是否共享LLC来做TTWU_QUEUE的限制。在默认情况下，内核会开启该功能。 /

Queue remote wakeups on the target CPU and process them
using the scheduler IPI. Reduces rq->lock contention/bounces. */ SCHED_FEAT(TTWU_QUEUE, true)

static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags) { struct rq *rq = cpu_rq(cpu); struct rq_flags rf;

#if defined(CONFIG_SMP) /*

因为过多的中断会导致系统性能下降，所以内核用
!cpus_share_cache(smp_processor_id(), cpu) 来限制TTWU_QUEUE中断触发的频率(并且
从另一个角度来看，在同LLC上的task，它们之间因为锁竞争导致cache bounces的性能
损失会更小，所以这里只对跨LLC的CPU间做TTWU_QUEUE也是合理的) / if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) { sched_clock_cpu(cpu); / Sync clocks across CPUs */ ttwu_queue_remote(p, cpu, wake_flags); return; } #endif

rq_lock(rq, &rf); update_rq_clock(rq); ttwu_do_activate(rq, p, wake_flags, &rf); rq_unlock(rq, &rf); } SIS_AVG_CPU SIS_AVG_CPU的原意是想根据avg_idle来做查找开销的减少，但该机制存在一定的问题（一定都不去找空闲CPU会导致负载相对集中），所以5.12内核将该功能移除（由SIS_PROP功能来做开销平衡）。该功能在5.4内核里默认关闭，也不要开启该功能 /*

When doing wakeups, attempt to limit superfluous scans of the LLC domain. */ SCHED_FEAT(SIS_AVG_CPU, false)

static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target) { struct sched_domain *this_sd; u64 avg_cost, avg_idle; u64 time, cost; s64 delta; int this = smp_processor_id(); int cpu, nr = INT_MAX, si_cpu = -1;

this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) return -1;

Due to large variance we need a large fuzz factor; hackbench in
particularly is sensitive here. */ avg_idle = this_rq()->avg_idle / 512; avg_cost = this_sd->avg_scan_cost + 1;

SIS_AVG_CPU 功能原意是通过cfs_rq的平均idle时间avg_idle与后面的
for_each_cpu_wrap消耗时间做个比较，如果开销太大(相比于空闲时间)，则skip后面的流程
以减少查找开销。
该功能开启后会略显粗暴，导致select_idle_cpu一点都不会去查找，所以5.12内核里将
SIS_AVG_CPU 移除了 */ if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost) return -1; ...................... } SIS_PROP SIS_PROP是内核用来限制select_idle_cpu的查找开销的（通过限制最大的查找次数来实现），内核默认为开启。在系统CPU利用率较低（不超过50%）、而CPU又是调度延迟敏感性，这个时候可以考虑关闭SIS_PROP，通过更多的查找让被唤醒的任务尽可能的找到空闲的CPU，从而减少调度延迟（但这可能会带来一定程度的缓存损失，具体是否开启要看业务模型本身）。 SCHED_FEAT(SIS_PROP, true)

this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) return -1;

Due to large variance we need a large fuzz factor; hackbench in
particularly is sensitive here. */ avg_idle = this_rq()->avg_idle / 512; avg_cost = this_sd->avg_scan_cost + 1;

SIS_AVG_CPU 功能原意是通过cfs_rq的平均idle时间avg_idle与后面的
for_each_cpu_wrap消耗时间做个比较，如果开销太大(相比于空闲时间)，则skip后面的流程
以减少查找开销。
该功能开启后会略显粗暴，导致select_idle_cpu一点都不会去查找，所以5.12内核里将
SIS_AVG_CPU 移除了 */ if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost) return -1;

SIS_PROP 会根据当前CPU的负载与avg_idle的关系来决定CPU的最大查找个数
这里的最小值为4 / if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; if (span_avg > 4avg_cost) nr = div_u64(span_avg, avg_cost); else nr = 4; } ................ } WARN_DOUBLE_CLOCK 如果在在同一个地方多次调到用update_rq_clock则会发出警告(无用更新)，内核默认关闭。 RT_PUSH_IPI 对于RT的锁竞争优化，内核默认开启。 /*
In order to avoid a thundering herd attack of CPUs that are
lowering their priorities at the same time, and there being
a single CPU that has an RT task that can migrate and is waiting
to run, where the other CPUs will try to take that CPUs
rq lock and possibly create a large contention, sending an
IPI to that CPU and let that CPU push the RT task to where
it should go may be a better scenario. */ SCHED_FEAT(RT_PUSH_IPI, true)

static void pull_rt_task(struct rq *this_rq) { int this_cpu = this_rq->cpu, cpu; bool resched = false; struct task_struct *p; struct rq *src_rq; int rt_overload_count = rt_overloaded(this_rq);

if (likely(!rt_overload_count)) return;

Match the barrier from rt_set_overloaded; this guarantees that if we
see overloaded we must also see the rto_mask bit. */ smp_rmb();

/* If we are the only overloaded CPU do nothing */ if (rt_overload_count == 1 && cpumask_test_cpu(this_rq->cpu, this_rq->rd->rto_mask)) return;

#ifdef HAVE_RT_PUSH_IPI /*

当CPUs上的RT进程都批量(瞬时)的被改为CFS，并且系统里只有一个CPU上有
RT任务可以被迁移，那么当前的 pull_rt_task()就会被并发的执行(类似
惊群效应)，然后就会导致激烈的锁竞争。于是内核开发了RT_PUSH_IPI机制
向目标CPU发送一个IPI中断人，让目标CPU来执行PUSH RT到合适的CPU上
所而减少多核间的并发竞争情况 */ if (sched_feat(RT_PUSH_IPI)) { tell_cpu_to_push(this_rq); return; } #endif ............. }

RT_RUNTIME_SHARE 在rt sched_group或者全局的RT bandwidth会对RT的使用率进行限制，防止CPU上的实时任务使用了太多CPU。而这里的RUNTIME SHARE则是允许配额用完的CPU向其他CPU借一部分时间，从而让着CPU上的RT进程可以运行的更久，这样可能会导致某个CPU上的RT任务使用率达到100%。内核默认会开启这个功能。 LB_MIN 在load balance的时候，会跳过load < 16的进程，即不对这些进程进行迁移。该功能内核默认关闭，即内核不需要对所有进程都进行负载均衡。如果系统里的任务都是非常轻的负载，那么可以考虑打开该负载，避免过度迁移。 SCHED_FEAT(LB_MIN, false)

static int detach_tasks(struct lb_env *env) { ............. if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed) goto next; ............. } ATTACH_AGE_LOAD 当进程发生cpu migrate或者cgroup迁移的时候，内核的PELT计算会不准确（新的CPU上的PELT更新时间戳与旧的CPU不太一样，但通常情况下两个的clock_pelt差距不会超过1个tick）。所以开发了ATTACH_AGE_LOAD feature，在进行migrate的时候，会利用prev cfs_rq进行PELT的衰减，从而让进程PELT更加准确。该功能内核默认是开启的，也不应该关闭。 WA_IDLE WA_IDLE表示在进程做wake affine（唤醒亲核性选择）检查时，如果唤醒它的CPU是空闲的，则考虑将进程迁移到这个CPU上运行。内核默认为打开，如果不想被唤醒的任务被唤醒亲和频繁的迁移，则可以考虑关闭此功能（但一般需要打开，这个可以让系统进程更好的使用CPU资源）。 SCHED_FEAT(WA_IDLE, true)

static int wake_affine(struct sched_domain *sd, struct task_struct *p, int this_cpu, int prev_cpu, int sync) { int target = nr_cpumask_bits;

if (sched_feat(WA_IDLE)) target = wake_affine_idle(this_cpu, prev_cpu, sync);

if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits) target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts); if (target == nr_cpumask_bits) return prev_cpu;

schedstat_inc(sd->ttwu_move_affine); schedstat_inc(p->se.statistics.nr_wakeups_affine); return target; } WA_WEIGHT WA_WEIGHT表示在做wake affine时，是否用waker cpu与prev cpu的CPU负载来作为是否做唤醒亲核性选择的标准。内核默认为打开，如果不想做基于CPU负载的唤醒亲核选择，则可以关闭此功能（即只考虑用IDLE CPU做wake affine选择）。 SCHED_FEAT(WA_WEIGHT, true)

static int wake_affine(struct sched_domain *sd, struct task_struct *p, int this_cpu, int prev_cpu, int sync) { int target = nr_cpumask_bits;

if (sched_feat(WA_IDLE)) target = wake_affine_idle(this_cpu, prev_cpu, sync);

if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits) target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts); if (target == nr_cpumask_bits) return prev_cpu;

schedstat_inc(sd->ttwu_move_affine); schedstat_inc(p->se.statistics.nr_wakeups_affine); return target; } WA_BIAS WA_BIAS是基于上面的WA_WEIGHT实现的，表示在WEIGHT权重计算时会给于prev cpu进行一些加权，让内核更倾向于选择waker cpu。内核默认会打开该功能，如果不想内核倾向于优先选择waker cpu，则可以关闭该功能。 UTIL_EST 内核以前的PELT机制随着衰减的进行，会出现非常大的变化。例如当一个进程运行时，它的pelt load很大，但当它睡眠了一段时间，则他的pelt load会变得很小，这种变化会给负载均衡带来一定的问题。例如，某个进程在CPU上运行了很长一段时间，它的PELT LOAD会很大，然后它睡眠了一段时间，PELT LOAD就会被衰减的很小。而当它再次运行的时候，又需要一段时间的运行才能将PELT LOAD恢复，而在这段时间里这个进程就会被认为是小任务。为了解决这个问题，内核就在sched_avg里引入了util_est。util_est是统计进程没有经过衰减的指数平滑负载，这样在周期性负载均衡里，可以选择用util_est来计算CPU的剩余算力，这样可以避免大任务因睡眠衰减的原因而被错误的预估，从而导致load balance不准确。这里的UTIL_EST就表示在CPU算力评估时使用EST负载，而不是PELT的负载。 /*

UtilEstimation. Use estimated CPU utilization. */ SCHED_FEAT(UTIL_EST, true)

默认情况下，内核会开启这个特性。

通过本文的分析，可以让大家了解到内核相关feature的作用以及使用场景，这样可以根据根据这些参数为用户业务进行针对性的性能调优了。

揭秘OpenCloudOS内核调度器Features

猜你喜欢