一. 概述
Android系统在很多情况下都会进入到一种无法自主恢复的状态下:例如无法开机,常驻系统进程无限crash等等,往往在这些情况下手机已经无法正常使用了,像这些情况小白用户往往都不知道怎么修复手机,只能送回售后了。在O上加了一个救援的机制就是来解决这些问题的,这个机制叫:
RescueParty
。
RescueParty
的原理大致为:同一个uid的应用发生多次异常,RescueParty会根据该uid记录发生的次数,当次数达到默认次数后会调整拯救的策略。拯救策略等级分为:
1.NONE
2.RESET_SETTINGS_UNTRUSTED_DEFAULTS
3.RESET_SETTINGS_UNTRUSTED_CHANGES
4.RESET_SETTINGS_TRUSTED_DEFAULTS
5.FACTORY_RESET
最终的拯救策略是进recovery模式。
那么哪些场景会造成触发这个机制呢?
1.a persistent app is stuck in a crash loop
2.we're stuck in a runtime restart loop.
二.RescueParty 原理介绍
RescueParty的原理我们从第一点“
a persistent app is stuck in a crash”来说,appCrash的流程这里就不多说了,看一张时序图好了:
O上在AppErrors.java的
crashApplicationInner方法中加上了
RescueParty监控,具体代码如下:
void crashApplicationInner(ProcessRecord r, ApplicationErrorReport.CrashInfo crashInfo,
int callingPid, int callingUid) {
。。。
// If a persistent app is stuck in a crash loop, the device isn't very
// usable, so we want to consider sending out a rescue party.
if (r != null && r.persistent) {
RescueParty.notePersistentAppCrash(mContext, r.uid);
}
AppErrorResult result = new AppErrorResult();
TaskRecord task;
。。。
}
/**
* Take note of a persistent app crash. If we notice too many of these
* events happening in rapid succession, we'll send out a rescue party.
*/
public static void notePersistentAppCrash(Context context, int uid) {
if (isDisabled()) return;
Threshold t = sApps.get(uid);
if (t == null) {
t = new AppThreshold(uid);
sApps.put(uid, t);
}
if (t.incrementAndTest()) {
t.reset();
incrementRescueLevel(t.uid);
executeRescueLevel(context);
}
}
首先先进行了一个
RescueParty
机制是否被禁用了的的判断,我们看看什么情况下会被禁用:
禁用的情况分为以下几种情况:
1.eng版本会被禁用.
2.userdebug版本,并且usb正在连接中.
3.getprop persist.sys.disable_rescue 为true.
其他情况都没有被禁用
Threshold t = sApps.get(uid);
if (t == null) {
t = new AppThreshold(uid);
sApps.put(uid, t);
}
if (t.incrementAndTest()) {
t.reset();
incrementRescueLevel(t.uid);
executeRescueLevel(context);
}
我们先看看sApps的定义:
/** Threshold for app crash loops */private static SparseArray<Threshold> sApps = new SparseArray<>();
每一个uid会对应一个
Threshold
对象,这里会根据uid取得对应的
Threshold
对象,如果
Threshold
对象为Null,那么久new一个
Threshold
对象,然后放到sApps中。紧接着会调用
incrementAndTest
方法,看看
incrementAndTest
方法中做了什么:
/**
* @return if this threshold has been triggered
*/
public boolean incrementAndTest() {
final long now = SystemClock.elapsedRealtime();
final long window = now - getStart();
if (window > triggerWindow) {
setCount(1);
setStart(now);
return false;
} else {
int count = getCount() + 1;
setCount(count);
EventLogTags.writeRescueNote(uid, count, window);
Slog.w(TAG, "Noticed " + count + " events for UID " + uid + " in last "
+ (window / 1000) + " sec");
return (count >= triggerCount);
}
}
private static class BootThreshold extends Threshold {
public BootThreshold() {
// We're interested in 5 events in any 300 second period; this
// window is super relaxed because booting can take a long time if
// forced to dexopt things.
super(android.os.Process.ROOT_UID, 5, 300 * DateUtils.SECOND_IN_MILLIS);
}
@Override
public int getCount() {
return SystemProperties.getInt(PROP_RESCUE_BOOT_COUNT, 0);
}
@Override
public void setCount(int count) {
SystemProperties.set(PROP_RESCUE_BOOT_COUNT, Integer.toString(count));
}
@Override
public long getStart() {
return SystemProperties.getLong(PROP_RESCUE_BOOT_START, 0);
}
@Override
public void setStart(long start) {
SystemProperties.set(PROP_RESCUE_BOOT_START, Long.toString(start));
}
}
这里其实就是把时间,次数保存到了
Properties
文件中。
super(android.os.Process.ROOT_UID, 5, 300 * DateUtils.SECOND_IN_MILLIS);
private abstract static class Threshold {
。。。
public Threshold(int uid, int triggerCount, long triggerWindow) {
this.uid = uid;
this.triggerCount = triggerCount;
this.triggerWindow = triggerWindow;
}
。。。
}
从这里我们可以知道
triggerWindow
的值为300000,
triggerCount
的值为5.
如果两次crash的时间差大于300000,那么就设置次数为1,并把时间设置为当前时间(重置时间和次数),否则就次数加1,然后保存次数。并判断当前次数是否大于
triggerCount
(5),大于就返回true,返回true后会分别执行:
t.reset();
incrementRescueLevel(t.uid);
executeRescueLevel(context);
我们分别看看三个方法的实现:
public void reset() {
setCount(0);
setStart(0);
}
将次数和时间分别设置为0。
/**
* Escalate to the next rescue level. After incrementing the level you'll
* probably want to call {@link #executeRescueLevel(Context)}.
*/
private static void incrementRescueLevel(int triggerUid) {
final int level = MathUtils.constrain(
SystemProperties.getInt(PROP_RESCUE_LEVEL, LEVEL_NONE) + 1,
LEVEL_NONE, LEVEL_FACTORY_RESET);
SystemProperties.set(PROP_RESCUE_LEVEL, Integer.toString(level));
EventLogTags.writeRescueLevel(level, triggerUid);
PackageManagerService.logCriticalInfo(Log.WARN, "Incremented rescue level to "
+ levelToString(level) + " triggered by UID " + triggerUid);
}
这段代码其实
就是取出当前所在的等级,加1后在存到properties中。
private static void executeRescueLevel(Context context) {
final int level = SystemProperties.getInt(PROP_RESCUE_LEVEL, LEVEL_NONE);
if (level == LEVEL_NONE) return;
Slog.w(TAG, "Attempting rescue level " + levelToString(level));
try {
executeRescueLevelInternal(context, level);
EventLogTags.writeRescueSuccess(level);
PackageManagerService.logCriticalInfo(Log.DEBUG,
"Finished rescue level " + levelToString(level));
} catch (Throwable t) {
final String msg = ExceptionUtils.getCompleteMessage(t);
EventLogTags.writeRescueFailure(level, msg);
PackageManagerService.logCriticalInfo(Log.ERROR,
"Failed rescue level " + levelToString(level) + ": " + msg);
}
}
这里先取出当前的等级,判断等级是否为NONE,如果不是就会去调用
executeRescueLevelInternal
方法,我们接着看
executeRescueLevelInternal
方法做了什么:
private static void executeRescueLevelInternal(Context context, int level) throws Exception {
switch (level) {
case LEVEL_RESET_SETTINGS_UNTRUSTED_DEFAULTS:
resetAllSettings(context, Settings.RESET_MODE_UNTRUSTED_DEFAULTS);
break;
case LEVEL_RESET_SETTINGS_UNTRUSTED_CHANGES:
resetAllSettings(context, Settings.RESET_MODE_UNTRUSTED_CHANGES);
break;
case LEVEL_RESET_SETTINGS_TRUSTED_DEFAULTS:
resetAllSettings(context, Settings.RESET_MODE_TRUSTED_DEFAULTS);
break;
case LEVEL_FACTORY_RESET:
RecoverySystem.rebootPromptAndWipeUserData(context, TAG);
break;
}
}
这里根据不同的等级来救我们的系统,总共有四级,分别为:
private static void resetAllSettings(Context context, int mode) throws Exception {
// Try our best to reset all settings possible, and once finished
// rethrow any exception that we encountered
Exception res = null;
final ContentResolver resolver = context.getContentResolver();
try {
Settings.Global.resetToDefaultsAsUser(resolver, null, mode, UserHandle.USER_SYSTEM);
} catch (Throwable t) {
res = new RuntimeException("Failed to reset global settings", t);
}
for (int userId : getAllUserIds()) {
try {
Settings.Secure.resetToDefaultsAsUser(resolver, null, mode, userId);
} catch (Throwable t) {
res = new RuntimeException("Failed to reset secure settings for " + userId, t);
}
}
if (res != null) {
throw res;
}
}
这里其实就是根据不同的等级尽最大的努力重置所有可能的设置,对这里感兴趣的可以详细看一下。我们接下来看看最后一个等级,它调用了
RecoverySystem
类里的
rebootPromptAndWipeUserData
方法,这里其实就是让系统进recovery模式了,详细流程就不说了,看个调用栈吧:
"Binder:1313_18@9485" prio=5 tid=0xbe nid=NA waiting
java.lang.Thread.State: WAITING
blocks Binder:1313_18@9485
waiting for android.ui@9431 to release lock on <0x2562> (a com.android.server.power.PowerManagerService$4)
at java.lang.Object.wait(Object.java:-1)
at com.android.server.power.PowerManagerService.shutdownOrRebootInternal(PowerManagerService.java:2802)
locked <0x2562> (a com.android.server.power.PowerManagerService$4)
at com.android.server.power.PowerManagerService.-wrap35(PowerManagerService.java:-1)
at com.android.server.power.PowerManagerService$BinderService.reboot(PowerManagerService.java:4483)
at android.os.PowerManager.reboot(PowerManager.java:969)
at com.android.server.RecoverySystemService$BinderService.rebootRecoveryWithCommand(RecoverySystemService.java:193)
locked <0x25e1> (a java.lang.Object)
at android.os.RecoverySystem.rebootRecoveryWithCommand(RecoverySystem.java:1146)
at android.os.RecoverySystem.bootCommand(RecoverySystem.java:925)
at android.os.RecoverySystem.rebootPromptAndWipeUserData(RecoverySystem.java:855)
at com.android.server.RescueParty.executeRescueLevelInternal(RescueParty.java:190)
at com.android.server.RescueParty.executeRescueLevel(RescueParty.java:166)
at com.android.server.RescueParty.notePersistentAppCrash(RescueParty.java:126)
at com.android.server.am.AppErrors.crashApplicationInner(AppErrors.java:343)
at com.android.server.am.AppErrors.crashApplication(AppErrors.java:322)
at com.android.server.am.ActivityManagerService.handleApplicationCrashInner(ActivityManagerService.java:14621)
at com.android.server.am.ActivityManagerService.handleApplicationCrash(ActivityManagerService.java:14603)
at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:79)
at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3011)
at android.os.Binder.execTransact(Binder.java:677)
最终会调用到
PowerManagerService的
lowLevelReboot方法。
三.RescueParty监控的业务
发在本文最开始就已经说了在哪些场景会造成触发这个机制:
- a persistent app is stuck in a crash loop
- we're stuck in a runtime restart loop.
第一种情况在原理介绍的时候已经说了,就是app连续crash的时候会触发,接下来我们看看另外一种情况:
we're stuck in a runtime restart loop:
这个其实就是监控手机是不是一直在无限重启,我们看看它怎么实现监控开机的:
private void startBootstrapServices() {
。。。
// Now that we have the bare essentials of the OS up and running, take
// note that we just booted, which might send out a rescue party if
// we're stuck in a runtime restart loop.
RescueParty.noteBoot(mSystemContext);
// Manages LEDs and display backlight so we need it to bring up the display.
traceBeginAndSlog("StartLightsService");
。。。
}
在system_server启动的时候在startBootstrapServices方法里会调用noteBoot方法,我们可以继续看看noteBoot方法:
/**
* Take note of a boot event. If we notice too many of these events
* happening in rapid succession, we'll send out a rescue party.
*/
public static void noteBoot(Context context) {
if (isDisabled()) return;
if (sBoot.incrementAndTest()) {
sBoot.reset();
incrementRescueLevel(sBoot.uid);
executeRescueLevel(context);
}
}
}
看到这我们就很熟悉了,这里其实也是根据时间来记录次数,到达默认次数后会升级处理对策。最后的一个策略就是进入recovery了。
四.总结
RescueParty
实际上就统计一段时间内某个常驻进程有没有在不断的crash,如果是的话就按照crash的次数来分等级处理,最后一个等级是进入recovery模式,让用户自主格式化数据来拯救无法恢复的手机。