Android系统服务源码分析篇(Watchdog机制)

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接: https://blog.csdn.net/conconbenben/article/details/102711704

前言

看门狗是一种监控系统的运行状况的手段,通过软硬件结合的方式实现对系统运行状况的监控。稳定运行的软件会在执行完特定指令后进行喂狗,若在一定周期内看门狗没有收到来自软件的喂狗信号,则认为系统故障,会进入中断处理程序或强制系统复位。根据运行的软硬件平台,分为硬件看门狗和软件看门狗。

Android SystemServer 是一个非常复杂的进程,里面运行的很多系统服务,是一个很重要的进程,因此有必要对SystemServer 中运行的各种服务线程进行监控。Android 开发了WatchDog 类作为软件看门狗来监控 SystemServer 中这些系统服务线程。一旦发现问题系统服务出现问题,WatchDog 会杀死 SystemServer 进程。

本文源码分析:基于Android Pie版本(http://androidxref.com/9.0.0_r3)

一.Watchdog 工作原理

Watchdog 主要服务对象为系统 service 和主要线程,它的工作原理是周期性地向被监控线程消息队列中发送消息任务,来检查在指定时间内是否返回,如果超时不返回,则视为死锁,记录该 watchdog 记录,并做后续 dump 处理,然后 kill 掉当前 SystemServer 进程,SystemServer 的父进程 Zygote 接收到 SystemServer 的死亡信号后,会杀死自己,Zygote 进程死亡的信号传递到 Init 进程后,Init 进程会杀死 Zygote 进程所有的子进程并重启 Zygot e并上报 Framework 异常,这样整个手机相当于重启一遍。

二.Watchdog源码分析

Watchdog在SystemServer中主要完成了init和start的操作。

/frameworks/base/services/java/com/android/server/SystemServer.java:


private void startOtherServices() {
  ...
  traceBeginAndSlog("InitWatchdog");
  final Watchdog watchdog = Watchdog.getInstance();
  watchdog.init(context, mActivityManagerService);
  traceEnd();
  ...
  traceBeginAndSlog("StartWatchdog");
  Watchdog.getInstance().start();
  traceEnd();
  ...
}

Watchdog类使用单例:

frameworks/base/services/core/java/com/android/server/Watchdog.java:

242    public static Watchdog getInstance() {
243        if (sWatchdog == null) {
244            sWatchdog = new Watchdog();
245        }
246
247        return sWatchdog;
248    }

Watchdog构造方法:


250    private Watchdog() {
251        super("watchdog");
252        // Initialize handler checkers for each common thread we want to check.  Note
253        // that we are not currently checking the background thread, since it can
254        // potentially hold longer running operations with no guarantees about the timeliness
255        // of operations there.
256
257        // The shared foreground thread is the main checker.  It is where we
258        // will also dispatch monitor checks and do other work.
259        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
260                "foreground thread", DEFAULT_TIMEOUT);
261        mHandlerCheckers.add(mMonitorChecker);
262        // Add checker for main thread.  We only do a quick check since there
263        // can be UI running on the thread.
264        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
265                "main thread", DEFAULT_TIMEOUT));
266        // Add checker for shared UI thread.
267        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
268                "ui thread", DEFAULT_TIMEOUT));
269        // And also check IO thread.
270        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
271                "i/o thread", DEFAULT_TIMEOUT));
272        // And the display thread.
273        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
274                "display thread", DEFAULT_TIMEOUT));
275
276        // Initialize monitor for Binder threads.
277        addMonitor(new BinderThreadMonitor());
278
279        mOpenFdMonitor = OpenFdMonitor.create();
280
281        // See the notes on DEFAULT_TIMEOUT.
282        assert DB ||
283                DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
284    }

产生Watchdog实例时,将将主要系统线程(foreground main ui io display)以HandlerChecker的形式加入mHandlerCheckers列表。

Watchdog中比较重要的一个类--HandlerChecker:


118    /**
119     * Used for checking status of handle threads and scheduling monitor callbacks.
120     */
121    public final class HandlerChecker implements Runnable {
122        private final Handler mHandler;
123        private final String mName;
124        private final long mWaitMax;
125        private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
126        private boolean mCompleted;
127        private Monitor mCurrentMonitor;
128        private long mStartTime;
129
130        HandlerChecker(Handler handler, String name, long waitMaxMillis) {
131            mHandler = handler;
132            mName = name;
133            mWaitMax = waitMaxMillis;
134            mCompleted = true;
135        }
136
137        public void addMonitor(Monitor monitor) {
138            mMonitors.add(monitor);
139        }
140
141        public void scheduleCheckLocked() {
142            if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
143                // If the target looper has recently been polling, then
144                // there is no reason to enqueue our checker on it since that
145                // is as good as it not being deadlocked.  This avoid having
146                // to do a context switch to check the thread.  Note that we
147                // only do this if mCheckReboot is false and we have no
148                // monitors, since those would need to be executed at this point.
149                mCompleted = true;
150                return;
151            }
152
153            if (!mCompleted) {
154                // we already have a check in flight, so no need
155                return;
156            }
157
158            mCompleted = false;
159            mCurrentMonitor = null;
160            mStartTime = SystemClock.uptimeMillis();
161            mHandler.postAtFrontOfQueue(this);
162        }
163
164        public boolean isOverdueLocked() {
165            return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
166        }
167
168        public int getCompletionStateLocked() {
169            if (mCompleted) {
170                return COMPLETED;
171            } else {
172                long latency = SystemClock.uptimeMillis() - mStartTime;
173                if (latency < mWaitMax/2) {
174                    return WAITING;
175                } else if (latency < mWaitMax) {
176                    return WAITED_HALF;
177                }
178            }
179            return OVERDUE;
180        }
181
182        public Thread getThread() {
183            return mHandler.getLooper().getThread();
184        }
185
186        public String getName() {
187            return mName;
188        }
189
190        public String describeBlockedStateLocked() {
191            if (mCurrentMonitor == null) {
192                return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
193            } else {
194                return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
195                        + " on " + mName + " (" + getThread().getName() + ")";
196            }
197        }
198
199        @Override
200        public void run() {
201            final int size = mMonitors.size();
202            for (int i = 0 ; i < size ; i++) {
203                synchronized (Watchdog.this) {
204                    mCurrentMonitor = mMonitors.get(i);
205                }
206                mCurrentMonitor.monitor();
207            }
208
209            synchronized (Watchdog.this) {
210                mCompleted = true;
211                mCurrentMonitor = null;
212            }
213        }
214    }

HandlerChecker是Watchdog用来检测主要系统线程(foreground main ui io display)是否block。其原理就是通过各个Handler的looper的MessageQueue来判断该线程是否卡住了。

Watchdog作为一个线程类,其run方法如下:

418    @Override
419    public void run() {
420        boolean waitedHalf = false;
421        while (true) {
422            final List<HandlerChecker> blockedCheckers;
423            final String subject;
424            final boolean allowRestart;
425            int debuggerWasConnected = 0;
426            synchronized (this) {
427                long timeout = CHECK_INTERVAL;
428                // Make sure we (re)spin the checkers that have become idle within
429                // this wait-and-check interval
430                for (int i=0; i<mHandlerCheckers.size(); i++) {
431                    HandlerChecker hc = mHandlerCheckers.get(i);
432                    hc.scheduleCheckLocked();
433                }
434
435                if (debuggerWasConnected > 0) {
436                    debuggerWasConnected--;
437                }
438
439                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
440                // wait while asleep. If the device is asleep then the thing that we are waiting
441                // to timeout on is asleep as well and won't have a chance to run, causing a false
442                // positive on when to kill things.
443                long start = SystemClock.uptimeMillis();
444                while (timeout > 0) {
445                    if (Debug.isDebuggerConnected()) {
446                        debuggerWasConnected = 2;
447                    }
448                    try {
449                        wait(timeout);
450                    } catch (InterruptedException e) {
451                        Log.wtf(TAG, e);
452                    }
453                    if (Debug.isDebuggerConnected()) {
454                        debuggerWasConnected = 2;
455                    }
456                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
457                }
458
459                boolean fdLimitTriggered = false;
460                if (mOpenFdMonitor != null) {
461                    fdLimitTriggered = mOpenFdMonitor.monitor();
462                }
463
464                if (!fdLimitTriggered) {
465                    final int waitState = evaluateCheckerCompletionLocked();
466                    if (waitState == COMPLETED) {
467                        // The monitors have returned; reset
468                        waitedHalf = false;
469                        continue;
470                    } else if (waitState == WAITING) {
471                        // still waiting but within their configured intervals; back off and recheck
472                        continue;
473                    } else if (waitState == WAITED_HALF) {
474                        if (!waitedHalf) {
475                            // We've waited half the deadlock-detection interval.  Pull a stack
476                            // trace and wait another half.
477                            ArrayList<Integer> pids = new ArrayList<Integer>();
478                            pids.add(Process.myPid());
479                            ActivityManagerService.dumpStackTraces(true, pids, null, null,
480                                getInterestingNativePids());
481                            waitedHalf = true;
482                        }
483                        continue;
484                    }
485
486                    // something is overdue!
487                    blockedCheckers = getBlockedCheckersLocked();
488                    subject = describeCheckersLocked(blockedCheckers);
489                } else {
490                    blockedCheckers = Collections.emptyList();
491                    subject = "Open FD high water mark reached";
492                }
493                allowRestart = mAllowRestart;
494            }
495
496            // If we got here, that means that the system is most likely hung.
497            // First collect stack traces from all threads of the system process.
498            // Then kill this process so that the system will restart.
499            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
500
501            ArrayList<Integer> pids = new ArrayList<>();
502            pids.add(Process.myPid());
503            if (mPhonePid > 0) pids.add(mPhonePid);
504            // Pass !waitedHalf so that just in case we somehow wind up here without having
505            // dumped the halfway stacks, we properly re-initialize the trace file.
506            final File stack = ActivityManagerService.dumpStackTraces(
507                    !waitedHalf, pids, null, null, getInterestingNativePids());
508
509            // Give some extra time to make sure the stack traces get written.
510            // The system's been hanging for a minute, another second or two won't hurt much.
511            SystemClock.sleep(2000);
512
513            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
514            doSysRq('w');
515            doSysRq('l');
516
517            // Try to add the error to the dropbox, but assuming that the ActivityManager
518            // itself may be deadlocked.  (which has happened, causing this statement to
519            // deadlock and the watchdog as a whole to be ineffective)
520            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
521                    public void run() {
522                        mActivity.addErrorToDropBox(
523                                "watchdog", null, "system_server", null, null,
524                                subject, null, stack, null);
525                    }
526                };
527            dropboxThread.start();
528            try {
529                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
530            } catch (InterruptedException ignored) {}
531
532            IActivityController controller;
533            synchronized (this) {
534                controller = mController;
535            }
536            if (controller != null) {
537                Slog.i(TAG, "Reporting stuck state to activity controller");
538                try {
539                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
540                    // 1 = keep waiting, -1 = kill system
541                    int res = controller.systemNotResponding(subject);
542                    if (res >= 0) {
543                        Slog.i(TAG, "Activity controller requested to coninue to wait");
544                        waitedHalf = false;
545                        continue;
546                    }
547                } catch (RemoteException e) {
548                }
549            }
550
551            // Only kill the process if the debugger is not attached.
552            if (Debug.isDebuggerConnected()) {
553                debuggerWasConnected = 2;
554            }
555            if (debuggerWasConnected >= 2) {
556                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
557            } else if (debuggerWasConnected > 0) {
558                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
559            } else if (!allowRestart) {
560                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
561            } else {
562                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
563                WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
564                Slog.w(TAG, "*** GOODBYE!");
565                Process.killProcess(Process.myPid());
566                System.exit(10);
567            }
568
569            waitedHalf = false;
570        }
571    }

Watchdog的run方法就是死循环,死循环中主要完成的流程如下:

1.遍历所有HandlerChecker,并调其监控方法scheduleCheckLocked,记录开始时间;

2.等待30s;

3.评估各Checker的状态evaluateCheckerCompletionLocked,里面会遍历所有的HandlerChecker,并获取最大的返回值

4.根据评估状态设置阻塞状态标志waitedHalf;

    当状态为COMPLETED时,说明线程服务正常;

    当状态为WAITING时,说明线程服务开始阻塞;

    当状态为WAITED_HALF且waitedHalf为false时,线程服务已经阻塞超过30s,dump线程相关信息,并置waitedHalf为true;

    当状态为OVERDUE时,说明线程服务已经超过60s,重启系统。

注意:评估状态为超时,重启系统前会进行以下事情:

a.写Eventlog

b.以追加的方式,输出system_server和3个native进程的栈信息

c.输出kernel栈信息

d.dump所有阻塞线程

e.输出dropbox信息

f.判断有没有debuger,没有的话,重启系统了,并输出kill系统服务的log。

最后,为了更好理解Watchdog工作流程,将Watchdog的工作流程绘制如下:

                                                                           

                                                                       扫码关注公众号,收看更多精彩内容

猜你喜欢

转载自blog.csdn.net/conconbenben/article/details/102711704