来源:
https://blog.csdn.net/shift_wwx/article/details/81021257
前言:
Watchdog字面上是“看门狗”的意思,有做过嵌入式低层的朋友应该知道,为了防止嵌入式系统MCU里的程序因为干扰而跑飞,专门在MCU里设计了一个定时器电路,叫做看门狗。当MCU正常工作的,每隔一段时间会输出一个信号给看门狗,也就是所谓的喂狗。如果程序跑飞,MCU在规定的时间内没法喂狗,这时看门狗就会直接触发一个reset信号,让CPU重新启动。
在Android系统的framework中,设计了一个系统服务Watchdog,它类似于一个软件看门狗,用来保护重要的系统服务。它的源代码位于:
frameworks/base/services/core/java/com/android/server/Watchdog.java
流程图:
源码分析:(基于android O)
1、获取WatchDog 对象
public class Watchdog extends Thread { }
想要分析一个功能代码,可能先从本身的源头找起,对于Java 类首先看的就是类的定义以及构造构造函数啦!
从这里看 WatchDog 其实一个Thread,这个Thread 可能比较特殊而已,至于这么特殊,下面会在SystemServer 分析的时候说明。那对于一个Thread,核心的操作部分就是run() 函数了,这个最重要的部分会放在最后解析。
再来看下WatchDog 的构造函数:
private Watchdog() { }
WatchDog 构造函数是private,对于外界获取对象的接口为:
public static Watchdog getInstance() {
if (sWatchdog == null) {
sWatchdog = new Watchdog();
}
return sWatchdog;
}
外界获取WatchDog 就是通过getInstance(),至于这个“外界”后面会补充。
2、构造函数
private Watchdog() {
super("watchdog");
// Initialize handler checkers for each common thread we want to check. Note
// that we are not currently checking the background thread, since it can
// potentially hold longer running operations with no guarantees about the timeliness
// of operations there.
// The shared foreground thread is the main checker. It is where we
// will also dispatch monitor checks and do other work.
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
// Add checker for shared UI thread.
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
// And also check IO thread.
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
// And the display thread.
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
// Initialize monitor for Binder threads.
addMonitor(new BinderThreadMonitor());
mOpenFdMonitor = OpenFdMonitor.create();
// See the notes on DEFAULT_TIMEOUT.
assert DB ||
DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
}
(1)前几个都是HandlerChedker 对象,分别代表的是FgThread、MainThread、UIThread、IoThread、DisplayThread。
- MainThread 有自己的MainLooper
- 其他的Thread 都是继承自HandlerThread(详细看 Android HandlerThread 详解)
(2)HandlerChecker 有3 个参数分别是Handler 对象、name、以及触发watchdog 的最大时间间隔,详细的HandlerChecker 看下面。
(3)BinderThreadMonitor 和 OpenFdMonitor 创建
3、init() 函数
public void init(Context context, ActivityManagerService activity) {
mResolver = context.getContentResolver();
mActivity = activity;
context.registerReceiver(new RebootRequestReceiver(),
new IntentFilter(Intent.ACTION_REBOOT),
android.Manifest.permission.REBOOT, null);
}
注册了reboot 的广播,软重启的操作时在这里进行的
4、HandlerChecker类
public final class HandlerChecker implements Runnable { //Runnable 类型,核心就是run()
private final Handler mHandler;
private final String mName;
private final long mWaitMax;
private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>(); //这个是核心成员
private boolean mCompleted;
private Monitor mCurrentMonitor;
private long mStartTime;
HandlerChecker(Handler handler, String name, long waitMaxMillis) {
mHandler = handler;
mName = name;
mWaitMax = waitMaxMillis;
mCompleted = true;
}
public void addMonitor(Monitor monitor) { //注册的monitor 都在这里
mMonitors.add(monitor);
}
public void scheduleCheckLocked() {
if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
// If the target looper has recently been polling, then
// there is no reason to enqueue our checker on it since that
// is as good as it not being deadlocked. This avoid having
// to do a context switch to check the thread. Note that we
// only do this if mCheckReboot is false and we have no
// monitors, since those would need to be executed at this point.
mCompleted = true;
return;
}
if (!mCompleted) { //check 过了就不用再check
// we already have a check in flight, so no need
return;
}
mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
mHandler.postAtFrontOfQueue(this); //check 的时候往MessageQueue 中最前面添加,最终会调用run()
}
public boolean isOverdueLocked() {
return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
}
public int getCompletionStateLocked() { //获取scheduleCheckLocked 最后的结果,如果没有从monitor() 出来就会出现WAIT
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;
}
public Thread getThread() { //每个HandlerChecker 都是个单独的thread,mHandler 比较特殊
return mHandler.getLooper().getThread();
}
public String getName() {
return mName;
}
public String describeBlockedStateLocked() {
if (mCurrentMonitor == null) {
return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
} else {
return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
+ " on " + mName + " (" + getThread().getName() + ")";
}
}
@Override
public void run() { //scheduleCheckLocked 函数会执行这里
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor();
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
}
}
需要注意的地方:
- HandlerChecker 是Runnable 类型,那核心部分就是run() 函数
- mHandler比较特殊,都是HandlerThread 中的Handler,都存在于单独的thread 中
- 注册的monitor 都是通过成员函数addMonitor() 存放
- scheduleCheckLocked() 是HandlerChecker 的触发点
既然HandlerCheker 是Runnable 类型,那核心就是run() 函数,run() 函数的目的就是分别调用注册进来的monitor 的monitor() 函数。在每过一个时间间隔(默认为30秒),WatchDog 会通过scheduleCheckLocked 函数轮询一下monitors 中的注册的monitor。
在scheduleCheckLocked 中,其实主要是处理mMonitorChecker 的情况,对于其他的没有monitor 注册进来的且处于polling 状态的 HandlerChecker 是不去检查的,例如,UiThread,肯定一直处于polling 状态。
5、其他几个重要函数
public void addMonitor(Monitor monitor) {
synchronized (this) {
if (isAlive()) {
throw new RuntimeException("Monitors can't be added once the Watchdog is running");
}
mMonitorChecker.addMonitor(monitor); //通过WatchDog 添加的monitor都在mMonitorChecker中的mMonitors里
}
}
public void addThread(Handler thread) { //除了构造中的几个Thread,也可以添加其他Thread 的monitor
addThread(thread, DEFAULT_TIMEOUT);
}
public void addThread(Handler thread, long timeoutMillis) {
synchronized (this) {
if (isAlive()) {
throw new RuntimeException("Threads can't be added once the Watchdog is running");
}
final String name = thread.getLooper().getThread().getName();
mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis)); //新添加的monitor 都在mHandlerChekers中
}
}
/**
* Perform a full reboot of the system.
*/
void rebootSystem(String reason) { //收到reboot 广播后会调用到这里
Slog.i(TAG, "Rebooting system because: " + reason);
IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
try {
pms.reboot(false, reason, false);
} catch (RemoteException ex) {
}
}
见code 中的注释部分
6、WatchDog 的run() 函数
上面的基本函数都大概解析完了,对于一个Thread 那最重要的肯定还是run() 函数,这里也是WatchDog 的监测机制所在,下面来详细分析。
public void run() {
boolean waitedHalf = false;
File initialStack = null;
while (true) {
...
...
synchronized (this) {
long timeout = CHECK_INTERVAL; //30秒一次循环检查
//第一步,30秒轮询系统中所有的monitor
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
hc.scheduleCheckLocked(); //详细看HandlerChecker
}
...
...
//第二步,等待30秒,等待检查结果
long start = SystemClock.uptimeMillis();
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
...
...
if (!fdLimitTriggered) {
//第三步,检查HandlerChecker 完成状态
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) { //monitor() 顺利返回
// The monitors have returned; reset
waitedHalf = false;
continue;
} else if (waitState == WAITING) { //不到30秒的等下个30秒
// still waiting but within their configured intervals; back off and recheck
continue;
} else if (waitState == WAITED_HALF) { //过了30秒的,再给个机会
if (!waitedHalf) {
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
ArrayList<Integer> pids = new ArrayList<Integer>();
pids.add(Process.myPid());
initialStack = ActivityManagerService.dumpStackTraces(true, pids,
null, null, getInterestingNativePids());
waitedHalf = true;
}
continue;
}
//第四步,如果状态是 overdue!,也就是超过60秒
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
} else {
blockedCheckers = Collections.emptyList();
subject = "Open FD high water mark reached";
}
allowRestart = mAllowRestart;
}
// 第五步,保存日志,重启系统
...
...
Slog.w(TAG, "*** GOODBYE!");
Process.killProcess(Process.myPid());
System.exit(10);
}
}
流程梳理下:
(1)Watchdog运行后,便开始无限循环,依次调用每一个HandlerChecker的scheduleCheckLocked()方法
(2)调度完HandlerChecker之后,便开始定期检查是否超时,每一次检查的间隔时间由CHECK_INTERVAL常量设定,为30秒
(3)每一次检查都会调用evaluateCheckerCompletionLocked()方法来评估一下HandlerChecker的完成状态:
- COMPLETED表示已经完成
- WAITING和WAITED_HALF表示还在等待,但未超时
- OVERDUE表示已经超时。默认情况下,timeout是1分钟,但监测对象可以通过传参自行设定,譬如PKMS的Handler Checker的超时是10分钟
(5)保存日志,包括一些运行时的堆栈信息,这些日志是我们解决Watchdog问题的重要依据。如果判断需要杀掉system_server进程,则给当前进程(system_server)发送signal 9,详细信息看 Android 系统中WatchDog 日志分析
至此,WatchDog 类的解析基本完成了,下面继续来看“外界”触发或调用的地方。
7、SystemServer 中初始化WatchDog
final Watchdog watchdog = Watchdog.getInstance();
watchdog.init(context, mActivityManagerService);
WatchDog 是单例存在,所以系统中对象就一个,其实在AMS 初始化的时候就已经调用了getInstance(),详细看下面。
这里其实主要是调用init()
8、AMS 构造的时候添加monitor
Watchdog.getInstance().addMonitor(this);
Watchdog.getInstance().addThread(mHandler);
来看下一直说的monitor 是什么东西:
public void monitor() {
synchronized (this) { }
}
这里其实就是确认AMS 中是否存在死锁,30秒后还是没有放出来这个锁,WatchDog 会在给一次机会,如果还是没有释放的话,会打印堆栈信息并且结束AMS 所在进程。
这里还需要注意的是addThread() 函数,AMS 会在WatchDog 中创建一个单独的HandlerChecker,所以会有下面的log:
11-06 00:05:28.603 2197 2441 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on ActivityManager (ActivityManager)
11-06 00:05:28.603 2197 2441 W Watchdog: foreground thread stack trace:
11-06 00:05:28.603 2197 2441 W Watchdog: at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:23862)
11-06 00:05:28.604 2197 2441 W Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:211)
11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.Handler.handleCallback(Handler.java:790)
11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99)
11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.Looper.loop(Looper.java:164)
11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)
11-06 00:05:28.604 2197 2441 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:46)
11-06 00:05:28.604 2197 2441 W Watchdog: ActivityManager stack trace:
11-06 00:05:28.604 2197 2441 W Watchdog: at com.android.server.am.ActivityManagerService.idleUids(ActivityManagerService.java:23305)
11-06 00:05:28.604 2197 2441 W Watchdog: at com.android.server.am.ActivityManagerService$MainHandler.handleMessage(ActivityManagerService.java:2428)
11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:106)
11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.Looper.loop(Looper.java:164)
11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)
11-06 00:05:28.604 2197 2441 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:46)
11-06 00:05:28.604 2197 2441 W Watchdog: *** GOODBYE
其中就有ActivityManager stack trace的打印,具体的可以看 Android 系统中WatchDog 日志分析
WMS 中的monitor:
public void monitor() {
synchronized (mWindowMap) { }
}
9、WatchDog 的启动
traceBeginAndSlog("StartWatchdog");
Watchdog.getInstance().start();
traceEnd();
在AMS ready 的时候会start watch dog,进而触发WatchDog 的run()函数。
总结:
Android 中的WatchDog 主要是监测系统中重要服务,例如AMS、WMS 等,当注册的monitor 无法通过,或者是超时的时候就会触发WatchDog,最后可能会引起系统的重启。
Android 系统中WatchDog 日志分析 中会结合实例详解run() 第 5 步保存日志。
附加:WatchDog 类图