假如说,20个服务实例,结果在1分钟之内,只有8个服务实例保持了心跳 --> eureka server是应该将剩余的12个没有心跳的服务实例都摘除吗?
这个时候很可能说的是,eureka server自己网络故障了,那些服务没问题的。只不过eureka server自己的机器所在的网络故障了,导致那些服务的心跳发送不过来。就导致eureka server本地一直没有更新心跳。
其实eureka server自己会进入一个自我保护的机制,从此之后就不会再摘除任何服务实例了
注册表的evict()方法,EvictionTask,定时调度的任务,60s来一次,会判断一下服务实例是否故障了,如果故障了,一直没有心跳,就会将服务实例给摘除。
1、evict()方法内部,先会判断上一分钟的心跳次数,是否小于我期望的一分钟的心跳次数,如果小于,那么压根儿就不让清理任何服务实例
public abstract class AbstractInstanceRegistry implements InstanceRegistry {
public void evict(long additionalLeaseMs) {
logger.debug("Running the evict task");
//自我保护机制,直接return,不摘除任何服务实例
if (!isLeaseExpirationEnabled()) {
logger.debug("DS: lease expiration is currently disabled.");
return;
}
// We collect first all expired items, to evict them in random order. For large eviction sets,
// if we do not that, we might wipe out whole apps before self preservation kicks in. By randomizing it,
// the impact should be evenly distributed across all applications.
List<Lease<InstanceInfo>> expiredLeases = new ArrayList<>();
for (Entry<String, Map<String, Lease<InstanceInfo>>> groupEntry : registry.entrySet()) {
Map<String, Lease<InstanceInfo>> leaseMap = groupEntry.getValue();
if (leaseMap != null) {
for (Entry<String, Lease<InstanceInfo>> leaseEntry : leaseMap.entrySet()) {
Lease<InstanceInfo> lease = leaseEntry.getValue();
//调用Lease的isExpired()方法,来判断当前这个服务实例的租约是否过期了,是否失效了是否故障了
if (lease.isExpired(additionalLeaseMs) && lease.getHolder() != null) {
expiredLeases.add(lease);
}
}
}
}
// To compensate for GC pauses or drifting local time, we need to use current registry size as a base for
// triggering self-preservation. Without that we would wipe out full registry.
int registrySize = (int) getLocalRegistrySize();
int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());
int evictionLimit = registrySize - registrySizeThreshold;
int toEvict = Math.min(expiredLeases.size(), evictionLimit);
if (toEvict > 0) {
logger.info("Evicting {} items (expired={}, evictionLimit={})", toEvict, expiredLeases.size(), evictionLimit);
Random random = new Random(System.currentTimeMillis());
for (int i = 0; i < toEvict; i++) {
// Pick a random item (Knuth shuffle algorithm)
int next = i + random.nextInt(expiredLeases.size() - i);
Collections.swap(expiredLeases, i, next);
Lease<InstanceInfo> lease = expiredLeases.get(i);
String appName = lease.getHolder().getAppName();
String id = lease.getHolder().getId();
EXPIRED.increment();
logger.warn("DS: Registry: expired lease for {}/{}", appName, id);
internalCancel(appName, id, false);
}
}
}
}
@Singleton
public class PeerAwareInstanceRegistryImpl extends AbstractInstanceRegistry implements PeerAwareInstanceRegistry {
@Override
public boolean isLeaseExpirationEnabled() {
if (!isSelfPreservationModeEnabled()) {
// The self preservation mode is disabled, hence allowing the instances to expire.
//自我保存模式被禁用,因此允许实例过期。
return true;
}
//numberOfRenewsPerMinThreshold代表期望一分钟至少有多少次心跳
//getNumOfRenewsInLastMin() 获取上一分钟心跳的总次数
return numberOfRenewsPerMinThreshold > 0 && getNumOfRenewsInLastMin() > numberOfRenewsPerMinThreshold;
}
}
2、我期望的一分钟的心跳次数是怎么算出来的?
(1)eureka server启动的时候,就会初始化一次这个值
EurekaBootStrap是启动的初始化的类
registry.openForTraffic(applicationInfoManager, registryCount);
完成了numberOfRenewsPerMinThreshold这个值,我期望一分钟得有多少次心跳的值,初始化。刚开始会调用syncUp()的方法,从相邻的eureka server节点,拷贝过来注册表,如果是自己本地还没注册的服务实例,就在自己本地注册一下。
会记录一下从别的eureka server拉取过来的服务实例的数量registryCount,将这个服务实例的数量,就作为自己eureka server本地初始化的这么一个服务实例的数量
protected void initEurekaServerContext() throws Exception {
//之前的代码处理省略。。。
//从相邻的eureka server节点,拷贝过来注册表
int registryCount = registry.syncUp();
registry.openForTraffic(applicationInfoManager, registryCount);
}
将 服务实例数量 * 2 * 0.85 ,期望心跳次数的计算,居然hard code了。
假设你现在有20个服务实例,每个服务实例每30秒发送一次心跳,于是一分钟一个服务实例应该发送2次心跳,1分钟内我期望获取到的心跳的次数,应该是20 * 2 = 40个心跳。
用这个服务实例 * 2 * 0.85 = 20 * 2 * 0.85 = 34,期望的是最少一分钟20个服务实例,得有34个心跳。根据当前的服务实例的数量,计算出来的一分钟最少需要的心跳次数。
硬编码可能会产生的问题:假设现在我们默认的心跳是30秒1次,如果我调整了撑10秒一次心跳了???怎么办??这里的count * 2,就错了。
@Singleton
public class PeerAwareInstanceRegistryImpl extends AbstractInstanceRegistry implements PeerAwareInstanceRegistry {
@Override
public void openForTraffic(ApplicationInfoManager applicationInfoManager, int count) {
// Renewals happen every 30 seconds and for a minute it should be a factor of 2.
this.expectedNumberOfRenewsPerMin = count * 2;
//初始化numberOfRenewsPerMinThreshold的值 服务实例数量 * 2 * 0.85
this.numberOfRenewsPerMinThreshold =
(int) (this.expectedNumberOfRenewsPerMin * serverConfig.getRenewalPercentThreshold());
logger.info("Got " + count + " instances from neighboring DS node");
logger.info("Renew threshold is: " + numberOfRenewsPerMinThreshold);
this.startupTime = System.currentTimeMillis();
if (count > 0) {
this.peerInstancesTransferEmptyOnStartup = false;
}
DataCenterInfo.Name selfName = applicationInfoManager.getInfo().getDataCenterInfo().getName();
boolean isAws = Name.Amazon == selfName;
if (isAws && serverConfig.shouldPrimeAwsReplicaConnections()) {
logger.info("Priming AWS connections for all replicas..");
primeAwsReplicas(applicationInfoManager);
}
logger.info("Changing status to UP");
applicationInfoManager.setInstanceStatus(InstanceStatus.UP);
super.postInit();
}
}
(2)注册、下线、故障
这个每分钟期望的心跳的次数,是跟咱们的这个服务实例的数量相关的,服务实例随着上线和下线、故障,都在不断的变动着。注册的时候,每分钟期望心跳次数 + 2。服务下线的时候,直接每分钟期望心跳次数 - 2。
public abstract class AbstractInstanceRegistry implements InstanceRegistry {
/**
* Registers a new instance with a given duration.
*
* @see com.netflix.eureka.lease.LeaseManager#register(java.lang.Object, int, boolean)
*/
public void register(InstanceInfo registrant, int leaseDuration, boolean isReplication) {
// The lease does not exist and hence it is a new registration
synchronized (lock) {
if (this.expectedNumberOfRenewsPerMin > 0) {
// Since the client wants to cancel it, reduce the threshold
// (1
// for 30 seconds, 2 for a minute)
this.expectedNumberOfRenewsPerMin = this.expectedNumberOfRenewsPerMin + 2;
this.numberOfRenewsPerMinThreshold =
(int) (this.expectedNumberOfRenewsPerMin * serverConfig.getRenewalPercentThreshold());
}
}
logger.debug("No previous lease information found; it is new registration");
}
}
@Singleton
public class PeerAwareInstanceRegistryImpl extends AbstractInstanceRegistry implements PeerAwareInstanceRegistry {
@Override
public boolean cancel(final String appName, final String id,
final boolean isReplication) {
if (super.cancel(appName, id, isReplication)) {
replicateToPeers(Action.Cancel, appName, id, null, null, isReplication);
synchronized (lock) {
if (this.expectedNumberOfRenewsPerMin > 0) {
// Since the client wants to cancel it, reduce the threshold (1 for 30 seconds, 2 for a minute)
this.expectedNumberOfRenewsPerMin = this.expectedNumberOfRenewsPerMin - 2;
this.numberOfRenewsPerMinThreshold =
(int) (this.expectedNumberOfRenewsPerMin * serverConfig.getRenewalPercentThreshold());
}
}
return true;
}
return false;
}
}
注意:故障的时候,摘除一个服务实例,居然没找到更新期望心跳次数的代码。bug,如果说有很多的服务实例都是故障下线的,摘除了。结果每分钟期望的心跳次数并没有减少,但是实际的服务实例变少了一些,就会导致实际的心跳次数变少,如果说出现较多的服务实例故障被自动摘除的话,很可能会快速导致eureka server进自我保护机制。
实际的心跳次数比期望的心跳次数要小,就不会再摘除任何服务实例了
(3)定时更新
Registry注册表,默认是15分钟,会跑一次定时任务,算一下服务实例的数量,如果从别的eureka server拉取到的服务实例的数量,大于当前的服务实例的数量,会重新计算一下,主要是跟其他的eureka server做一下同步
触发概率很小
3、实际的上一分钟的心跳次数是怎么算出来的
抓大放小,之前我们看源码的时候,看到过这个MeasuredRate,当时肯定是看不懂的,因为很多代码,都是一个机制相关的。每次一个心跳过来,一定会更新这个MeasuredRate。来计算每一分钟的心跳的实际的次数。
public abstract class AbstractInstanceRegistry implements InstanceRegistry {
private final MeasuredRate renewsLastMin;
protected AbstractInstanceRegistry(EurekaServerConfig serverConfig, EurekaClientConfig clientConfig, ServerCodecs serverCodecs) {
//构造一个用来计算上一分钟实际的心跳次数的线程,传入参数为60s
this.renewsLastMin = new MeasuredRate(1000 * 60 * 1);
}
public boolean renew(String appName, String id, boolean isReplication) {
//每发送一次心跳,currentBucket就累加一次
renewsLastMin.increment();
}
protected void postInit() {
//eureka server在启动初始化的时候启动MeasuredRate线程(可以延迟、重复执行)
renewsLastMin.start();
//后边的代码逻辑省略。。。
}
}
MeasuredRate类,好好看看,技术亮点:如何计算每一分钟内的一个内存中的计数的呢?计算每一分钟内的心跳的次数?
public class MeasuredRate {
private static final Logger logger = LoggerFactory.getLogger(MeasuredRate.class);
//上一分钟心跳的总次数
private final AtomicLong lastBucket = new AtomicLong(0);
//每发送一次心跳,currentBucket就累加一次,每60秒清零一次
private final AtomicLong currentBucket = new AtomicLong(0);
private final long sampleInterval;
private final Timer timer;
private volatile boolean isActive;
/**
* @param sampleInterval in milliseconds
*/
public MeasuredRate(long sampleInterval) {
this.sampleInterval = sampleInterval;
this.timer = new Timer("Eureka-MeasureRateTimer", true);
this.isActive = false;
}
public synchronized void start() {
if (!isActive) {
/**public void schedule(TimerTask task,long delay,long period)
* 在delay毫秒之后第一次执行,然后按照period间隔时间,重复执行
* delay: 延迟执行的毫秒数,即在delay毫秒之后第一次执行
* period:重复执行的时间间隔
*/
timer.schedule(new TimerTask() {
@Override
public void run() {
try {
// Zero out the current bucket==》把当前的桶清零
//1.将currentBucket的值赋给lastBucket,
//2.将currentBucket清零
lastBucket.set(currentBucket.getAndSet(0));
} catch (Throwable e) {
logger.error("Cannot reset the Measured Rate", e);
}
}
}, sampleInterval, sampleInterval);
isActive = true;
}
}
public synchronized void stop() {
if (isActive) {
timer.cancel();
isActive = false;
}
}
/**
* Returns the count in the last sample interval.
* 返回上一分钟心跳的总次数
*/
public long getCount() {
return lastBucket.get();
}
/**
* Increments the count in the current sample interval.
* 每发送一次心跳,currentBucket就累加一次
*/
public void increment() {
currentBucket.incrementAndGet();
}
}
4、来看看自我保护机制的触发
如果上一分钟实际的心跳次数,比我们期望的一分钟的心跳次数要小,触发自我保护机制,不允许摘除任何服务实例,此时认为自己的eureka server出现网络故障,大量的服务实例无法发送心跳过来
@Singleton
public class PeerAwareInstanceRegistryImpl extends AbstractInstanceRegistry implements PeerAwareInstanceRegistry {
@Override
public boolean isLeaseExpirationEnabled() {
if (!isSelfPreservationModeEnabled()) {
// The self preservation mode is disabled, hence allowing the instances to expire.
//自我保存模式被禁用,因此允许实例过期。
return true;
}
//numberOfRenewsPerMinThreshold代表期望一分钟至少有多少次心跳
//getNumOfRenewsInLastMin() 获取上一分钟心跳的总次数
return numberOfRenewsPerMinThreshold > 0 && getNumOfRenewsInLastMin() > numberOfRenewsPerMinThreshold;
}
}
5、eureka这一块,自我保护机制,必须从源码级别要看懂
因为其实在线上的时候,最坑爹的就是这儿,就是你会发现有些服务实例下线了,但是eureka控制台老是没给他摘除,自我保护机制了。线上生产环境,如果你可以的话,你可以选择将这个自我保护给关了。如果eureka server接收不到心跳的话,各个服务实例也是无法从eureka server拉取注册表的。每个服务实例直接基于自己的本地的注册表的缓存来就可以了。自我保护机制给打开也可以,从源码层面已经知道了,服务故障摘除,自我保护的源码,如果你发现线上生产环境,出现了一些问题,你可以从源码级别去看一下是怎么回事。
总结:eureka自我保护机制 流程图