抢占系统调用执行时间过长的Goroutine过程详解~

来看runtime/proc.go文件4380行继续分析retake：

func retake(now int64) uint32 {
   
       ......    for i := 0; i < len(allp); i++ {  //遍历所有p，然后根据p的状态进行抢占        _p_ := allp[i]        if _p_ == nil {
   
               // This can happen if procresize has grown            // allp but not yet created new Ps.            continue        }               //_p_.sysmontick用于sysmon监控线程监控p时记录系统调用时间和运行时间，由sysmon监控线程记录        pd := &_p_.sysmontick        s := _p_.status        if s == _Psyscall { //系统调用抢占处理            // Retake P from syscall if it's there for more than 1 sysmon tick (at least 20us).            //_p_.syscalltick用于记录系统调用的次数，主要由工作线程在完成系统调用之后++            t := int64(_p_.syscalltick)            if int64(pd.syscalltick) != t {
   
                   //pd.syscalltick != _p_.syscalltick，说明已经不是上次观察到的系统调用了，                //而是另外一次系统调用，所以需要重新记录tick和when值                pd.syscalltick = uint32(t)                pd.syscallwhen = now                continue            }                       //pd.syscalltick == _p_.syscalltick，说明还是之前观察到的那次系统调用，            //计算这次系统调用至少过了多长时间了                       // On the one hand we don't want to retake Ps if there is no other work to do,            // but on the other hand we want to retake them eventually            // because they can prevent the sysmon thread from deep sleep.            // 只要满足下面三个条件中的任意一个，则抢占该p，否则不抢占            // 1. p的运行队列里面有等待运行的goroutine            // 2. 没有无所事事的p            // 3. 从上一次监控线程观察到p对应的m处于系统调用之中到现在已经超过10了毫秒            if runqempty(_p_) &&  atomic.Load(&sched.nmspinning)+atomic.Load(&sched.npidle) > 0 && pd.syscallwhen+10*1000*1000 > now {
   
                   continue            }            // Drop allpLock so we can take sched.lock.            unlock(&allpLock)            // Need to decrement number of idle locked M's            // (pretending that one more is running) before the CAS.            // Otherwise the M from which we retake can exit the syscall,            // increment nmidle and report deadlock.            incidlelocked(-1)            if atomic.Cas(&_p_.status, s, _Pidle) {
   
                   ......                _p_.syscalltick++                handoffp(_p_)  //寻找一个新的m出来接管P            }            incidlelocked(1)            lock(&allpLock)        } else if s == _Prunning { //运行时间太长，抢占处理，前面已经分析            ......        }    }    ......}

retake主要在遍历所有的p，并根据各p的状态及处于该状态时长决定是否需发起抢占，只有当p处于_Prunning和_Psyscall状态时才会发起抢占。

从上述代码可以看到，满足以下任一状态就需对处于_Psyscall状态的p发起抢占调度：

p的运行队列中有等待运行的goroutine。用于保证当前p的本地运行队列中的goroutine得到及时调度，因为该p处于系统调度中，无法调度队列中goroutine，所以需要寻找另外一个工作线程来接管当前p，实现调度目的。
没有空闲的p。表示所有p都与工作线程绑定且正在执行Go代码，此时系统繁忙，所以需抢占当前处于系统调用中但系统调用不需要的p，并将其分配给其它工作线程，去调度其它goroutine。
从上次监控线程观察到p对应的m处于系统调用到现在超过10毫秒。表示只要系统调用超时，就对其抢占，而不管是否真的有goroutine需调度，保证sysmon有事可做（sysmon线程会判断retake返回值，为0表示retake未做抢占），而不会休眠太长时间降低sysmon监控的时效性。

retake如需抢占调度，则通过cas（工作线程此时可能刚从系统调用返回，正在获取p的使用权）修改p的状态来获取p的使用权，获取成功则调handoffp寻找新的工作线程来接管p，来看runtime/proc.go文件1995行分析handoffp：

// Hands off P from syscall or locked M.// Always runs without a P, so write barriers are not allowed.//go:nowritebarrierrecfunc handoffp(_p_ *p) {
   
       // handoffp must start an M in any situation where    // findrunnable would return a G to run on _p_.    // if it has local work, start it straight away    //运行队列不为空，需要启动m来接管    if !runqempty(_p_) || sched.runqsize != 0 {
   
           startm(_p_, false)        return    }    // if it has GC work, start it straight away    //有垃圾回收工作需要做，也需要启动m来接管    if gcBlackenEnabled != 0 && gcMarkWorkAvailable(_p_) {
   
           startm(_p_, false)        return    }    // no local work, check that there are no spinning/idle M's,    // otherwise our help is not required    //所有其它p都在运行goroutine，说明系统比较忙，需要启动m    if atomic.Load(&sched.nmspinning)+atomic.Load(&sched.npidle) == 0 &&  atomic.Cas(&sched.nmspinning, 0, 1) { // TODO: fast atomic        startm(_p_, true)        return    }    lock(&sched.lock)    if sched.gcwaiting != 0 { //如果gc正在等待Stop The World        _p_.status = _Pgcstop        sched.stopwait--        if sched.stopwait == 0 {
   
               notewakeup(&sched.stopnote)        }        unlock(&sched.lock)        return    }    ......    if sched.runqsize != 0 { //全局运行队列有工作要做        unlock(&sched.lock)        startm(_p_, false)        return    }    // If this is the last running P and nobody is polling network,    // need to wakeup another M to poll network.    //不能让所有的p都空闲下来，因为需要监控网络连接读写事件    if sched.npidle == uint32(gomaxprocs-1) && atomic.Load64(&sched.lastpoll) != 0 {
   
           unlock(&sched.lock)        startm(_p_, false)        return    }    pidleput(_p_)  //无事可做，把p放入全局空闲队列    unlock(&sched.lock)}

handoffp主要通过各条件判断是否需启动新的工作线程接管_p_，不需则将_p_放入p的全局运行队列，出现以下情况则需调startm启动新工作线程接管_p_：

_p_的本地运行队列或全局运行队列里有待运行的goroutine。
需帮助gc完成标记工作。
系统较忙，其它所有_p_都在运行各自goroutine，需帮忙。
其它p都空闲，此时如需监控网络连接读写事件，则需启动新m来poll网络连接。

至此，sysmon监控线程对处于系统调用中的p的抢占就完成了。

对处于系统调用中的goroutine的抢占实质上是剥夺与其对应的工作线程所绑定的p，虽说对处于系统调用中的工作线程不需要p，但从内核返回到用户空间就必须绑定p才可执行Go代码。

根据以下实例来聊聊工作线程从系统调用返回后需要做的事情：

package mainimport (    "fmt"    "os")func main() {
   
       fd, err := os.Open("./syscall.go")  //一定会执行系统调用    if err != nil {
   
           fmt.Println(err)    }    fd.Close()}

使用gdb跟踪调试可发现main调os.Open最终会调到Syscall6。

来看syscall/asm_linux_amd64.s文件46行分析Syscall6：

// func Syscall6(trap, a1, a2, a3, a4, a5, a6 uintptr) (r1, r2, err uintptr)TEXT ·Syscall6(SB), NOSPLIT, $0-80    CALL  runtime·entersyscall(SB)    #按照linux系统约定复制参数到寄存器并调用syscall指令进入内核    MOVQ  a1+8(FP), DI    MOVQ  a2+16(FP), SI    MOVQ  a3+24(FP), DX    MOVQ  a4+32(FP), R10    MOVQ  a5+40(FP), R8    MOVQ  a6+48(FP), R9    MOVQ  trap+0(FP), AX#syscall entry，系统调用编号放入AX    SYSCALL  #进入内核    #从内核返回，判断返回值，linux使用 -1 ~ -4095 作为错误码    CMPQ  AX, $0xfffffffffffff001    JLS  ok6    #系统调用返回错误，为Syscall6函数准备返回值    MOVQ  $-1, r1+56(FP)    MOVQ  $0, r2+64(FP)    NEGQ  AX    MOVQ  AX, err+72(FP)    CALL  runtime·exitsyscall(SB)    RETok6:      #系统调用返回错误    MOVQ  AX, r1+56(FP)    MOVQ  DX, r2+64(FP)    MOVQ  $0, err+72(FP)    CALL  runtime·exitsyscall(SB)    RET

Syscall6主要干了如下三件事：

调用runtime.entersyscall。
用SYSCALL指令进入系统调用。
调用runtime.exitsyscall。

exitsyscall会处理当前工作线程进入系统调用之前所拥有的p被监控线程抢占剥夺情况。

来看runtime/proc.go文件2847行分析entersyscall：

// Standard syscall entry used by the go syscall library and normal cgo calls.//go:nosplitfunc entersyscall() {
   
       reentersyscall(getcallerpc(), getcallersp())}func reentersyscall(pc, sp uintptr) {
   
       _g_ := getg()  //执行系统调用的goroutine    // Disable preemption because during this function g is in Gsyscall status,    // but can have inconsistent g->sched, do not let GC observe it.    _g_.m.locks++    // Entersyscall must not call any function that might split/grow the stack.    // (See details in comment above.)    // Catch calls that might, by replacing the stack guard with something that    // will trip any stack check and leaving a flag to tell newstack to die.    _g_.stackguard0 = stackPreempt    _g_.throwsplit = true    // Leave SP around for GC and traceback.    save(pc, sp)  //save函数分析过，用来保存g的现场信息，rsp, rbp, rip等等    _g_.syscallsp = sp    _g_.syscallpc = pc    casgstatus(_g_, _Grunning, _Gsyscall)      ......    _g_.m.syscalltick = _g_.m.p.ptr().syscalltick    _g_.sysblocktraced = true    _g_.m.mcache = nil    pp := _g_.m.p.ptr()    pp.m = 0  //p解除与m之间的绑定    _g_.m.oldp.set(pp)   //把p记录在oldp中，等从系统调用返回时，优先绑定这个p    _g_.m.p = 0  //m解除与p之间的绑定    atomic.Store(&pp.status, _Psyscall)  //修改当前p的状态，sysmon线程依赖状态实施抢占    .....    _g_.m.locks--}

entersyscall直接调reentersyscall，reentersyscall先将当前现场信息存入当前g的sched，之后解除m与p的关联关系，再将p状态设为_Psyscall。

这里可能会问了，有sysmon负责抢占剥夺，entersyscall为何主动解除m与p的关联关系？原因就是entersyscall主动解除m与p的关联关系后，sysmon就不需加锁或使用cas操作修改m.p来解除m与p的关联关系。

那为啥又需记录当前工作线程进入系统调用之前所拥有的p？原因就是记录下来可让工作线程从系统调用返回后快速找到一个可用的p，而不需加锁从sched的pidle全局队列中获取空闲的p。

保存当前工作线程进入系统调用前所拥有的p就保存呗，为啥又将这个p搬到m的oldp？

主要是为了保持m的p成员语义清晰，因为处于系统调用的m事实上并未绑定p，如果记录在m.p中，会导致p成员的语义不清晰。

再来看runtime/proc.go文件2931行分析exitsyscall：

// The goroutine g exited its system call.// Arrange for it to run on a cpu again.// This is called only from the go syscall library, not// from the low-level system calls used by the runtime.//// Write barriers are not allowed because our P may have been stolen.////go:nosplit//go:nowritebarrierrecfunc exitsyscall() {
   
       _g_ := getg()    ......    oldp := _g_.m.oldp.ptr()  //进入系统调用之前所绑定的p    _g_.m.oldp = 0    if exitsyscallfast(oldp) {//因为在进入系统调用之前已经解除了m和p之间的绑定，所以现在需要绑定p        //绑定成功，设置一些状态        ......               // There's a cpu for us, so we can run.        _g_.m.p.ptr().syscalltick++  //系统调用完成，增加syscalltick计数，sysmon线程依靠它判断是否是同一次系统调用        // We need to cas the status and scan before resuming...        //casgstatus函数会处理一些垃圾回收相关的事情，我们只需知道该函数重新把g设置成_Grunning状态即可        casgstatus(_g_, _Gsyscall, _Grunning)        ......        return    }    ......    _g_.m.locks--    // Call the scheduler.    //没有绑定到p，调用mcall切换到g0栈执行exitsyscall0函数    mcall(exitsyscall0)    ......}

因为进入系统调用前，工作线程调entersyscall解除了m与p的绑定关系，先从系统调用返回需重新绑定一个p才可继续执行Go代码，所以exitsyscall先调exitsyscallfast尝试绑定一个空闲的p，成功则结束exitsyscall并按调用链返回，反之调用mcall切到g0执行exitsyscall0。

来看runtime/proc.go文件3020行分析exitsyscallfast：

//go:nosplitfunc exitsyscallfast(oldp *p) bool {
   
       _g_ := getg()    ......    // Try to re-acquire the last P.    //尝试快速绑定进入系统调用之前所使用的p    if oldp != nil && oldp.status == _Psyscall && atomic.Cas(&oldp.status, _Psyscall, _Pidle) {
   
           //使用cas操作获取到p的使用权，所以之后的代码不需要使用锁就可以直接操作p        // There's a cpu for us, so we can run.        wirep(oldp) //绑定p        exitsyscallfast_reacquired()        return true    }    // Try to get any other idle P.    if sched.pidle != 0 {
   
           var ok bool        systemstack(func() {
   
               ok = exitsyscallfast_pidle()  //从全局队列中寻找空闲的p，需要加锁，比较慢            ......        })        if ok {
   
               return true        }    }    return false}

exitsyscallfast先尝试绑定进入系统调用之前所使用的p，因为该p状态目前还是_Psyscall，sysmon此时可能正准备操作这个p的状态，所以当前需使用cas原子操作来修改状态，保证只有一个线程的cas操作成功，一旦cas操作成功，就表示当前线程获取到p使用权，如此当前线程后续代码就可操作该p，在exitsyscallfast中来说就是一旦拿到p使用权，调用wirep将工作线程m和p关联起来，exitsyscallfast尝试绑定进入系统调用之前所使用的p失败则调exitsyscallfast_pidle获取空闲的p来绑定。

来看runtime/proc.go文件4099行分析wirep：

// wirep is the first step of acquirep, which actually associates the// current M to _p_. This is broken out so we can disallow write// barriers for this part, since we don't yet have a P.////go:nowritebarrierrec//go:nosplitfunc wirep(_p_ *p) {
   
       _g_ := getg()    ......    //相互赋值，绑定m和p    _g_.m.mcache = _p_.mcache    _g_.m.p.set(_p_)    _p_.m.set(_g_.m)    _p_.status = _Prunning}

来看runtime/proc.go文件3083行分析exitsyscallfast_pidle：

func exitsyscallfast_pidle() bool {
   
       lock(&sched.lock)    _p_ := pidleget()//从全局空闲队列中获取p    if _p_ != nil && atomic.Load(&sched.sysmonwait) != 0 {
   
           atomic.Store(&sched.sysmonwait, 0)        notewakeup(&sched.sysmonnote)    }    unlock(&sched.lock)    if _p_ != nil {
   
           acquirep(_p_)        return true    }    return false}

exitsyscallfast如绑定进入系统调用之前所使用的p失败，则调exitsyscallfast_pidle从p的全局运行队列获取一个空闲的p来与之进行绑定，此时使用了systemstack(func())来调exitsyscallfast_pidle，systemstack(func())有个func()类型参数，该函数首先将栈切到g0栈，之后调用通过参数传递来的函数（此处是闭包，包含对exitsyscallfast_pidle的调用），最后切到原来的栈上并返回。

为什么总要切到g0栈也就是系统栈上去执行呢？

原则上来说，只要调用链上某函数有nospilt编译器指示，就需到g0栈执行，因为有nospilt编译器指示就不会插入检查溢出的代码，如果在非g0栈上执行就会有栈溢出的风险，g0栈其实就是操作系统使用的栈，它空间较大，不需对runtime中函数都做栈溢出检查，否则会严重影响效率。

为啥绑定进入系统调用之前所使用的p会失败呢？

原因就是这个p可能被sysmon监控线程拿走并绑定到其它工作线程。

从上述代码可看到，从p的全局运行队列获取空闲的p操作需要加锁，如果锁冲突较为严重的话，此过程就很慢了，这也是exitsyscallfast为啥会先尝试绑定之前使用的p。

来看runtime/proc.go文件3098行分析exitsyscall0：

// exitsyscall slow path on g0.// Failed to acquire P, enqueue gp as runnable.////go:nowritebarrierrecfunc exitsyscall0(gp *g) {
   
       _g_ := getg()    casgstatus(gp, _Gsyscall, _Grunnable)       //当前工作线程没有绑定到p,所以需要解除m和g的关系    dropg()    lock(&sched.lock)    var _p_ *p    if schedEnabled(_g_) {
   
           _p_ = pidleget() //再次尝试获取空闲的p    }    if _p_ == nil { //还是没有空闲的p        globrunqput(gp)  //把g放入全局运行队列    } else if atomic.Load(&sched.sysmonwait) != 0 {
   
           atomic.Store(&sched.sysmonwait, 0)        notewakeup(&sched.sysmonnote)    }    unlock(&sched.lock)    if _p_ != nil {//获取到了p        acquirep(_p_) //绑定p        //继续运行g        execute(gp, false) // Never returns.    }    if _g_.m.lockedg != 0 {
   
           // Wait until another thread schedules gp and so m again.        stoplockedm()        execute(gp, false) // Never returns.    }    stopm()  //当前工作线程进入睡眠，等待被其它线程唤醒       //从睡眠中被其它线程唤醒，执行schedule调度循环重新开始工作    schedule() // Never returns.}

因为工作线程没有p是不能执行goroutine代码的，所以此步会再次尝试从全局空闲队列中获取p来绑定，找到p通过execute绑定继续执行当前goroutine，反之则将当前goroutine放入全局运行队列，由其它工作线程负责将其调度执行，当前goroutine则调用stopm进入睡眠状态。

最后汇总下因运行时间超长和系统调度时间过长发生的两种抢占调度的差别：

对运行时间超长的goroutine，sysmon监控线程会先提出抢占请求，之后工作线程会在适当的时候响应此请求并暂停被抢占的goroutine的运行，最后工作线程调用schedule继续调度其它goroutine。
对系统调度时间过长的goroutine，调度器并未暂停其执行，只是剥夺了其所绑定的p，等到工作线程从系统调用返回后绑定p失败后才会暂停该goroutine执行。

至此，有关goroutine调度器相关所有内容已介绍完毕。

以上仅为个人观点，不一定准确，能帮到各位那是最好的。

好啦，到这里本文就结束了，喜欢的话就来个三连击吧。

扫码关注公众号，获取更多优质内容。

抢占系统调用执行时间过长的Goroutine过程详解~

猜你喜欢