《redis设计与实现》-17集群故障转移

一 RAFT算法

书接上篇17 集群的故障检测，本篇主要介绍集群检测到某主节点下线后，是如何选举新的主节点的。注意到Redis集群是无中心的，那么使用分布式一致性的算法来使集群中各节点能对在新主节点的选举上达成共识就是一个比较可行的方案。Redis使用了Raft算法来做主节点选举的。所以这里先简单介绍下Raft的原理：（坦白的说，我是看不懂论文的数学公式，真让人头疼，之前在看书paxos也只有个印象，不是真正的懂了，因为过一阵很快忘了）

在一个由 Raft 协议组织的集群中有三类角色：

Leader
Follower
Candidate

选举过程即是在分布式式系统中选出Leader状态节点的过程。三种状态的转换关系如下所示：

当系统中没有Leader时，所有节点初始状态均为Follewer，每个节点在经历time_out时间后，会自动转变为Candidate，并尝试发起投票以便成为新的Leader；为了保证系统中只存在一个Leader，当选新Leader的条件是Candidate收到了超过半数节点以上的投票（每个节点在每轮投票中只能投给唯一的节点，通常是投个第一个发来邀票请求的节点），达到该条件后，Candidate即变为Leader。注意到投票是有轮次的，只有收到当前轮次的投票才是有效票。在状态机中，用term来表示投票的轮次。

根据上面介绍的流程，容易注意到为实现Leader的选举，有几个前提： 1. 每个节点都需要知道系统中到底有哪些节点存在（为了能向每个节点邀票，同时要知道节点总数才能判断最终是否收到了多数的投票）； 2. 所有节点应该有统一的初始化term，且后续需要不断的同步term

　　第一点前提，只需要初始化阶段给所有节点置统一的term即可；而第二点前提则要求每个节点收到旧term的消息，可以不处理消息请求，并把自身的较高的term返回；而每个节点收到新的term之后，则要积极响应，并在响应结束之后更新自己的term。

在这里强烈推荐一篇文章：https://www.cnblogs.com/mindwind/p/5231986.html ，把流程中各种异常情况都列举证明了，不是数学公式推导的那种，容易理解。当然大神也在极客时间开设专栏。

之前的redis哨兵介绍过类似epoch(类似于term)的概念，Redis集群中的纪元与之类似，主要是两种：currentEpoch和configEpoch。

currentEpoch

这是一个集群状态相关的概念，可以当做记录集群状态变更的递增版本号。每个集群节点，都会通过server.cluster->currentEpoch记录当前的currentEpoch。

集群节点创建时，不管是主节点还是从节点，都置currentEpoch为0。当前节点接收到来自其他节点的包时，如果发送者的currentEpoch（消息头部会包含发送者的currentEpoch）大于当前节点的currentEpoch，那么当前节点会更新currentEpoch为发送者的currentEpoch。因此，集群中所有节点的currentEpoch最终会达成一致，相当于对集群状态的认知达成了一致。

configepoch

这是一个集群节点配置相关的概念，每个集群节点都有自己独一无二的configepoch。所谓的节点配置，实际上是指节点所负责的槽位信息。

每一个主节点在向其他节点发送包时，都会附带其configEpoch信息，以及一份表示它所负责槽位的位数组信息。而从节点向其他节点发送包时，包中的configEpoch和负责槽位信息，是其主节点的configEpoch和负责槽位信息。节点收到包之后，就会根据包中的configEpoch和负责槽位信息，记录到相应节点属性中。configEpoch主要用于解决不同的节点的配置发生冲突的情况。

在从节点发起选举，获得足够多的选票之后，成功当选时，也就是从节点试图替代其下线主节点，成为新的主节点时，会增加它自己的configEpoch，使其成为当前所有集群节点的configEpoch中的最大值。这样，该从节点成为主节点后，就会向所有节点发送广播包，强制其他节点更新相关槽位的负责节点为自己。

二故障转移概述

如果slave 发现master 已下线, 则开始进行故障转移.，下面是过程步骤：

1.从所有的从节点里面选举出一个新的主
2.选举出的新主会执行slaveof no one把自己的状态从slave变成master
3.撤销已下线的主节点的槽指派,并把这些槽位重新指派给自己
4.新的主节点向集群广播一条PONG消息,通过这个消息告诉所有集群节点:自己已经变成了主节点,接管了原来的主节点
5.新的主节点开始接收和处理与自己槽位相关的命令请求

下面看下实现过程。

三实现过程

下线节点的从节点等待下一个周期执行clusterCron()函数，来开始故障转移操作。具体的代码如下：

    // 如果myself是从节点
    if (nodeIsSlave(myself)) {
        // 设置手动故障转移的状态
        clusterHandleManualFailover();
        // 执行从节点的自动或手动故障转移，从节点获取其主节点的哈希槽，并传播新配置
        clusterHandleSlaveFailover();
        /* If there are orphaned slaves, and we are a slave among the masters
         * with the max number of non-failing slaves, consider migrating to
         * the orphaned masters. Note that it does not make sense to try
         * a migration if there is no master with at least *two* working
         * slaves. */
        // 如果存在孤立的主节点，并且集群中的某一主节点有超过2个正常的从节点，并且该主节点正好是myself节点的主节点
        if (orphaned_masters && max_slaves >= 2 && this_slaves == max_slaves)
            // 给孤立的主节点迁移一个从节点
            clusterHandleSlaveMigration(max_slaves);
    }

调用clusterHandleManualFailover来设置手动故障转移的状态。用于执行了CLUSTER FAILOVER [FORCE|TAKEOVER]命令的情况。
调用clusterHandleSlaveFailover()函数来执行故障转移操作。

/* This function is called if we are a slave node and our master serving
 * a non-zero amount of hash slots is in FAIL state.
 * 如果当前节点是一个从节点，并且它正在复制的一个负责非零个槽的主节点处于 FAIL 状态，
 * 那么执行这个函数。
 *
 * The gaol of this function is:
 * 这个函数有三个目标：
 * 1) To check if we are able to perform a failover, is our data updated?
 * 检查是否可以对主节点执行一次故障转移，节点的关于主节点的信息是否准确和最新（updated）？
 * 2) Try to get elected by masters.
 * 选举一个新的主节点
 * 3) Perform the failover informing all the other nodes.
 *  执行故障转移，并通知其他节点
 */
void clusterHandleSlaveFailover(void) {
    mstime_t data_age;
    // 计算上次选举所过去的时间
    mstime_t auth_age = mstime() - server.cluster->failover_auth_time;
    // 计算胜选需要的票数
    int needed_quorum = (server.cluster->size / 2) + 1;
    // 手动故障转移的标志
    int manual_failover = server.cluster->mf_end != 0 &&
                          server.cluster->mf_can_start;            
    //该变量表示故障转移流程(发起投票，等待回应)的超时时间，超过该时间后还没有获得足够的选票，则表示本次故障转移失败；                                    
    mstime_t auth_timeout,
     //该变量表示判断是否可以开始下一次故障转移流程的时间，只有距离上一次发起故障转移时，已经超过auth_retry_time之后，
     //才表示可以开始下一次故障转移了（auth_age > auth_retry_time）；
     auth_retry_time;

    server.cluster->todo_before_sleep &= ~CLUSTER_TODO_HANDLE_FAILOVER;

    /* Compute the failover timeout (the max time we have to send votes
     * and wait for replies), and the failover retry time (the time to wait
     * before trying to get voted again).
     *
     * Timeout is MIN(NODE_TIMEOUT*2,2000) milliseconds.
     * Retry is two times the Timeout.
     */
    // 计算故障转移超时时间
    auth_timeout = server.cluster_node_timeout*2;
    if (auth_timeout < 2000) auth_timeout = 2000;
    // 重试的超时时间	
    auth_retry_time = auth_timeout*2;

    /* Pre conditions to run the function, that must be met both in case
     * of an automatic or manual failover:
     * 运行函数的前提条件，在自动或手动故障转移的情况下都必须满足：
     * 1) We are a slave.
     *  1. 当前节点是从节点
     * 2) Our master is flagged as FAIL, or this is a manual failover.
     *   2. 该从节点的主节点被标记为FAIL状态，或者是一个手动故障转移状态
     * 3) It is serving slots. 
     * 3.当前从节点有负责的槽位
     */
    // 如果不能满足以上条件，则直接返回 
    if (nodeIsMaster(myself) ||
        myself->slaveof == NULL ||
        (!nodeFailed(myself->slaveof) && !manual_failover) ||
        myself->slaveof->numslots == 0)
    {
        /* There are no reasons to failover, so we set the reason why we
         * are returning without failing over to NONE. */
        // 设置故障转移失败的原因：CLUSTER_CANT_FAILOVER_NONE     
        server.cluster->cant_failover_reason = CLUSTER_CANT_FAILOVER_NONE;
        return;
    }

    /* Set data_age to the number of seconds we are disconnected from
     * the master. */
     // 如果当前节点正在和主节点保持连接状态，计算从节点和主节点断开的时间 
    if (server.repl_state == REPL_STATE_CONNECTED) { //如果主从之间是因为网络不通引起的，read判断不出epoll err事件，则状态为这个
        data_age = (mstime_t)(server.unixtime - server.master->lastinteraction)
                   * 1000;//也就是当前从节点与主节点最后一次通信过了多久了
    } else { //这里一般都是直接kill主master进程，从epoll err感知到了，会在replicationHandleMasterDisconnection把状态置为REDIS_REPL_CONNECT
        data_age = (mstime_t)(server.unixtime - server.repl_down_since) * 1000; //本从节点和主节点断开了多久，
    }

    /* Remove the node timeout from the data age as it is fine that we are
     * disconnected from our master at least for the time it was down to be
     * flagged as FAIL, that's the baseline. */
    // node timeout 的时间不计入断线时间之内 如果data_age大于server.cluster_node_timeout，则从data_age中
    //减去server.cluster_node_timeout，因为经过server.cluster_node_timeout时间没有收到主节点的PING回复，才会将其标记为PFAIL  
    if (data_age > server.cluster_node_timeout)
        data_age -= server.cluster_node_timeout;

    /* Check if our data is recent enough according to the slave validity
     * factor configured by the user.
     *
     * Check bypassed for manual failovers. */
    // 检查这个从节点的数据是否比较新  
    // 目前的检测办法是断线时间不能超过 node timeout 的十倍
     /* data_age主要用于判断当前从节点的数据新鲜度；如果data_age超过了一定时间，表示当前从节点的数据已经太老了，
      不能替换掉下线主节点，因此在不是手动强制故障转移的情况下，直接返回；*/
    if (server.cluster_slave_validity_factor &&
        data_age >
        (((mstime_t)server.repl_ping_slave_period * 1000) +
         (server.cluster_node_timeout * server.cluster_slave_validity_factor)))
    {
        if (!manual_failover) {
            clusterLogCantFailover(CLUSTER_CANT_FAILOVER_DATA_AGE);
            return;
        }
    }

    /* If the previous failover attempt timedout and the retry time has
     * elapsed, we can setup a new one. */
    // 如果先前的尝试故障转移超时并且重试时间已过，我们可以设置一个新的。 
    if (auth_age > auth_retry_time) { //每次超时从新发送auth req要求其他主master投票，都会先走这个if，然后下次调用该函数才会走if后面的流程
    	  // 设置新的故障转移属性
        server.cluster->failover_auth_time = mstime() +
            500 + /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */
            random() % 500; /* Random delay between 0 and 500 milliseconds. *///等到这个时间到才进行故障转移
        server.cluster->failover_auth_count = 0;
        server.cluster->failover_auth_sent = 0;
        server.cluster->failover_auth_rank = clusterGetSlaveRank(); //本节点按照在master中的repl_offset来获取排名
        /* We add another delay that is proportional to the slave rank.
         * Specifically 1 second * rank. This way slaves that have a probably
         * less updated replication offset, are penalized. */
        server.cluster->failover_auth_time +=
            server.cluster->failover_auth_rank * 1000;
        /* However if this is a manual failover, no delay is needed. */
         /*注意如果是管理员发起的手动强制执行故障转移，则设置server.cluster->failover_auth_time为当前时间，表示会
        *立即开始故障转移流程；最后，调用clusterBroadcastPong，向该下线主节点的所有从节点发送PONG包，包头部分带
        *有当前从节点的复制数据量，因此其他从节点收到之后，可以更新自己的排名；最后直接返回；*/
        if (server.cluster->mf_end) {
            server.cluster->failover_auth_time = mstime();
            server.cluster->failover_auth_rank = 0;
        }
        serverLog(LL_WARNING,
            "Start of election delayed for %lld milliseconds "
            "(rank #%d, offset %lld).",
            server.cluster->failover_auth_time - mstime(),
            server.cluster->failover_auth_rank,
            replicationGetSlaveOffset());
        /* Now that we have a scheduled election, broadcast our offset
         * to all the other slaves so that they'll updated their offsets
         * if our offset is better. */
          // 发送一个PONG消息包给所有的从节点，携带有当前的复制偏移量  
        clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_SLAVES);
        return;
    }

    /* It is possible that we received more updated offsets from other
     * slaves for the same master since we computed our election delay.
     * Update the delay if our rank changed.
     *
     * Not performed if this is a manual failover. */
     // 如果没有开始故障转移，则调用clusterGetSlaveRank()获取当前从节点的最新排名。
     //因为在故障转移之前可能会收到其他节点发送来的心跳包，因而可以根据心跳包的复制偏移量更新本节点的排名，获得新排名newrank，
     //如果newrank比之前的排名靠后，则需要增加故障转移开始时间的延迟，然后将newrank记录到server.cluster->failover_auth_rank中
 
    if (server.cluster->failover_auth_sent == 0 &&
        server.cluster->mf_end == 0)
    {
    	  // 获取新排名
        int newrank = clusterGetSlaveRank();
        // 新排名比之前的靠后  
        if (newrank > server.cluster->failover_auth_rank) {
        	  // 计算延迟故障转移时间
            long long added_delay =
                (newrank - server.cluster->failover_auth_rank) * 1000;
            // 更新下一次故障转移的时间和排名    
            server.cluster->failover_auth_time += added_delay;
            server.cluster->failover_auth_rank = newrank;
            serverLog(LL_WARNING,
                "Slave rank updated to #%d, added %lld milliseconds of delay.",
                newrank, added_delay);
        }
    }

    /* Return ASAP if we can't still start the election. */
     // 如果执行故障转移的时间未到，先返回
    if (mstime() < server.cluster->failover_auth_time) {
        clusterLogCantFailover(CLUSTER_CANT_FAILOVER_WAITING_DELAY);
        return;
    }

    /* Return ASAP if the election is too old to be valid. */
    // 如果距离应该执行故障转移的时间已经过了很久
    // 那么不应该再执行故障转移了（因为可能已经没有需要了）
    // 直接返回
    if (auth_age > auth_timeout) {
    	  // 故障转移过期
        clusterLogCantFailover(CLUSTER_CANT_FAILOVER_EXPIRED);
        return;
    }

    /* Ask for votes if needed. */
     // 向其他节点发送故障转移请求（投票请求）
    if (server.cluster->failover_auth_sent == 0) {
    	  // 增加配置纪元
        server.cluster->currentEpoch++;
        // 记录发起故障转移的配置纪元
        server.cluster->failover_auth_epoch = server.cluster->currentEpoch;
        serverLog(LL_WARNING,"Starting a failover election for epoch %llu.",
            (unsigned long long) server.cluster->currentEpoch);
        //向其他所有节点发送信息，看它们是否支持由本节点来对下线主节点进行故障转移    
        clusterRequestFailoverAuth();
         // 设置为真，表示本节点已经向其他节点发送了投票请求
        server.cluster->failover_auth_sent = 1;
        // 进入下一个事件循环执行的操作，保存配置文件，更新节点状态，同步配置
        clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG|
                             CLUSTER_TODO_UPDATE_STATE|
                             CLUSTER_TODO_FSYNC_CONFIG);
        return; /* Wait for replies. */
    }

    /* Check if we reached the quorum. */
     // 如果当前节点获得了足够多的投票，那么对下线主节点进行故障转移
    if (server.cluster->failover_auth_count >= needed_quorum) {
        /* We have the quorum, we can finally failover the master. */

        serverLog(LL_WARNING,
            "Failover election won: I'm the new master.");

        /* Update my configEpoch to the epoch of the election. */
        if (myself->configEpoch < server.cluster->failover_auth_epoch) {
            myself->configEpoch = server.cluster->failover_auth_epoch;
            serverLog(LL_WARNING,
                "configEpoch set to %llu after successful failover",
                (unsigned long long) myself->configEpoch);
        }

        /* Take responsability for the cluster slots. */
         // 执行自动或手动故障转移，从节点获取其主节点的哈希槽，并传播新配置
        clusterFailoverReplaceYourMaster();
    } else {
        clusterLogCantFailover(CLUSTER_CANT_FAILOVER_WAITING_VOTES);
    }
}

该函数可以分为这几个部分：

选举资格检测和准备工作：

首先函数需要判断当前执行clusterHandleSlaveFailover()函数的节点是否具有选举的资格。

准备选举的时间：

如果当前从节点符合故障转移的资格，更新选举开始的时间，只有达到改时间才能执行后续的流程。

发起选举：

如果还没有向集群其他节点发起投票请求，那么将当前纪元currentEpoch加一，然后该当前纪元设置发起故障转移的纪元failover_auth_epoch。调用clusterRequestFailoverAuth()函数，发送一个FAILOVE_AUTH_REQUEST消息给其他所有集群节点，等待其他节点回复是否同意该从节点为它的主节点执行故障转移操作。
选举投票：

当集群中所有的节点接收到REQUEST消息后，会执行clusterProcessPacket()函数的这部分代码：

      // 这是一条请求获得故障迁移授权的消息： sender 请求当前节点为它进行故障转移投票    
    } else if (type == CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST) {
        if (!sender) return 1;  /* We don't know that node. */
        // 如果条件允许的话，向 sender 投票，支持它进行故障转移	
        clusterSendFailoverAuthIfNeeded(sender,hdr);
    }

调用clusterSendFailoverAuthIfNeeded()函数向sender节点发起投票。该函数的代码如下：

/* Vote for the node asking for our vote if there are the conditions. */
// 在满足条件的情况下，为请求执行故障转移的节点node进行投票
void clusterSendFailoverAuthIfNeeded(clusterNode *node, clusterMsg *request) {
    // 获取该请求从节点的主节点
    clusterNode *master = node->slaveof;
    // 获取请求的当前纪元和配置纪元
    uint64_t requestCurrentEpoch = ntohu64(request->currentEpoch);
    uint64_t requestConfigEpoch = ntohu64(request->configEpoch);
    // 获取该请求从节点的槽位图信息
    unsigned char *claimed_slots = request->myslots;
    // 是否指定强制认证故障转移的标识
    int force_ack = request->mflags[0] & CLUSTERMSG_FLAG0_FORCEACK;
    int j;

    /* IF we are not a master serving at least 1 slot, we don't have the
     * right to vote, as the cluster size in Redis Cluster is the number
     * of masters serving at least one slot, and quorum is the cluster
     * size + 1 */
    // 如果myself是从节点，或者myself没有负责的槽信息，那么myself节点没有投票权，直接返回
    if (nodeIsSlave(myself) || myself->numslots == 0) return;

    /* Request epoch must be >= our currentEpoch.
     * Note that it is impossible for it to actually be greater since
     * our currentEpoch was updated as a side effect of receiving this
     * request, if the request epoch was greater. */
    // 如果请求的当前纪元小于集群的当前纪元，直接返回。该节点有可能是长时间下线后重新上线，导致版本落后于就集群的版本
    // 因为该请求节点的版本小于集群的版本，每次有选举或投票都会更新每个节点的版本，使节点状态和集群的状态是一致的。
    if (requestCurrentEpoch < server.cluster->currentEpoch) {
        serverLog(LL_WARNING,
            "Failover auth denied to %.40s: reqEpoch (%llu) < curEpoch(%llu)",
            node->name,
            (unsigned long long) requestCurrentEpoch,
            (unsigned long long) server.cluster->currentEpoch);
        return;
    }

    /* I already voted for this epoch? Return ASAP. */
    // 如果最近一次投票的纪元和当前纪元相同，表示集群已经投过票了
    if (server.cluster->lastVoteEpoch == server.cluster->currentEpoch) {
        serverLog(LL_WARNING,
                "Failover auth denied to %.40s: already voted for epoch %llu",
                node->name,
                (unsigned long long) server.cluster->currentEpoch);
        return;
    }

    /* Node must be a slave and its master down.
     * The master can be non failing if the request is flagged
     * with CLUSTERMSG_FLAG0_FORCEACK (manual failover). */
    // 指定的node节点必须为从节点且它的主节点处于下线状态，否则打印日志后返回
    if (nodeIsMaster(node) || master == NULL ||
        (!nodeFailed(master) && !force_ack))
    {
        // 故障转移的请求必须由从节点发起
        if (nodeIsMaster(node)) {
            serverLog(LL_WARNING,
                    "Failover auth denied to %.40s: it is a master node",
                    node->name);
        // 从节点找不到他的主节点
        } else if (master == NULL) {
            serverLog(LL_WARNING,
                    "Failover auth denied to %.40s: I don't know its master",
                    node->name);
        // 从节点的主节点没有处于下线状态
        } else if (!nodeFailed(master)) {
            serverLog(LL_WARNING,
                    "Failover auth denied to %.40s: its master is up",
                    node->name);
        }
        return;
    }

    /* We did not voted for a slave about this master for two
     * times the node timeout. This is not strictly needed for correctness
     * of the algorithm but makes the base case more linear. */
    // 在cluster_node_timeout * 2时间内只能投1次票
    if (mstime() - node->slaveof->voted_time < server.cluster_node_timeout * 2)
    {
        serverLog(LL_WARNING,
                "Failover auth denied to %.40s: "
                "can't vote about this master before %lld milliseconds",
                node->name,
                (long long) ((server.cluster_node_timeout*2)-
                             (mstime() - node->slaveof->voted_time)));
        return;
    }

    /* The slave requesting the vote must have a configEpoch for the claimed
     * slots that is >= the one of the masters currently serving the same
     * slots in the current configuration. */
    // 请求投票的从节点必须有一个声明负责槽位的配置纪元，这些配置纪元必须比负责相同槽位的主节点的配置纪元要大
    for (j = 0; j < CLUSTER_SLOTS; j++) {
        // 跳过没有指定的槽位
        if (bitmapTestBit(claimed_slots, j) == 0) continue;
        // 如果请求从节点的配置纪元大于槽的配置纪元，则跳过
        if (server.cluster->slots[j] == NULL ||
            server.cluster->slots[j]->configEpoch <= requestConfigEpoch)
        {
            continue;
        }
        /* If we reached this point we found a slot that in our current slots
         * is served by a master with a greater configEpoch than the one claimed
         * by the slave requesting our vote. Refuse to vote for this slave. */
        // 如果请求从节点的配置纪元小于槽的配置纪元，那么表示该从节点的配置纪元已经过期，不能给该从节点投票，直接返回
        serverLog(LL_WARNING,
                "Failover auth denied to %.40s: "
                "slot %d epoch (%llu) > reqEpoch (%llu)",
                node->name, j,
                (unsigned long long) server.cluster->slots[j]->configEpoch,
                (unsigned long long) requestConfigEpoch);
        return;
    }

    /* We can vote for this slave. */
    // 发送一个FAILOVER_AUTH_ACK消息给指定的节点，表示支持该从节点进行故障转移
    clusterSendFailoverAuth(node);
    // 设置最近一次投票的纪元，防止给多个节点投多次票
    server.cluster->lastVoteEpoch = server.cluster->currentEpoch;
    // 设置最近一次投票的时间
    node->slaveof->voted_time = mstime();
    serverLog(LL_WARNING, "Failover auth granted to %.40s for epoch %llu",
        node->name, (unsigned long long) server.cluster->currentEpoch);
}

替换主节点：

当请求投票的从节点收到其他主节点的ACK消息后，会执行clusterProcessPacket()函数的这部分代码：

  // 这是一条故障迁移投票信息： sender 支持当前节点执行故障转移操作   
    } else if (type == CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK) {
        if (!sender) return 1;  /* We don't know that node. */
        /* We consider this vote only if the sender is a master serving
         * a non zero number of slots, and its currentEpoch is greater or
         * equal to epoch where this node started the election. */
        // 只有正在处理至少一个槽的主节点的投票会被视为是有效投票
        // 只有符合以下条件， sender 的投票才算有效：
        // 1） sender 是主节点
        // 2） sender 正在处理至少一个槽
        // 3） sender 的配置纪元大于等于当前节点的配置纪元  
        if (nodeIsMaster(sender) && sender->numslots > 0 &&
            senderCurrentEpoch >= server.cluster->failover_auth_epoch)
        {
        	  // 增加支持票数
            server.cluster->failover_auth_count++;
            /* Maybe we reached a quorum here, set a flag to make sure
             * we check ASAP. */
            clusterDoBeforeSleep(CLUSTER_TODO_HANDLE_FAILOVER);
        }
    }

从节点刚在执行到clusterHandleSlaveFailover()函数发送REQUEST消息后，就直接返回，是因为等待集群其他主节点回复是否同意该从节点进行故障转移操作。当再次执行clusterHandleSlaveFailover()函数时，如果集群其他节点投票数failover_auth_count大于所需的票数，则从节点可以对主节点进行故障转移。具体代码在clusterHandleSlaveFailover最后：

  /* Check if we reached the quorum. */
    // 如果获得的票数到达quorum，那么对下线的主节点执行故障转移
    if (server.cluster->failover_auth_count >= needed_quorum) {
        /* We have the quorum, we can finally failover the master. */

        serverLog(LL_WARNING,
            "Failover election won: I'm the new master.");

        /* Update my configEpoch to the epoch of the election. */
        if (myself->configEpoch < server.cluster->failover_auth_epoch) {
            myself->configEpoch = server.cluster->failover_auth_epoch;
            serverLog(LL_WARNING,
                "configEpoch set to %llu after successful failover",
                (unsigned long long) myself->configEpoch);
        }

        /* Take responsability for the cluster slots. */
        // 执行自动或手动故障转移，从节点获取其主节点的哈希槽，并传播新配置
        clusterFailoverReplaceYourMaster();
    } else {
        clusterLogCantFailover(CLUSTER_CANT_FAILOVER_WAITING_VOTES);
    }
}

票数足够，调用clusterFailoverReplaceYourMaster()函数进行故障转移，替换主节点。函数代码如下：

/* This function implements the final part of automatic and manual failovers,
 * where the slave grabs its master's hash slots, and propagates the new
 * configuration.
 *
 * Note that it's up to the caller to be sure that the node got a new
 * configuration epoch already. */
 // 该函数实现自动和手动故障转移的最后一部分，从节点获取其主节点的哈希槽，并传播新配置
void clusterFailoverReplaceYourMaster(void) {
    int j;
    // 获取myself的主节点 
    clusterNode *oldmaster = myself->slaveof;
    
    // 如果myself节点是主节点，直接返回
    if (nodeIsMaster(myself) || oldmaster == NULL) return;

    /* 1) Turn this node into a master. */
     // 将指定的myself节点重新配置为主节点
    clusterSetNodeAsMaster(myself);
     // 取消复制操作，设置myself为主节点
    replicationUnsetMaster();

    /* 2) Claim all the slots assigned to our master. */
    // 将所有之前主节点声明负责的槽位指定给现在的主节点myself节点。
    for (j = 0; j < CLUSTER_SLOTS; j++) {
    	  // 如果当前槽已经指定
        if (clusterNodeGetSlotBit(oldmaster,j)) {
        	  // 将该槽设置为未分配的 
            clusterDelSlot(j);
            // 将该槽指定给myself节点
            clusterAddSlot(myself,j);
        }
    }

    /* 3) Update state and save config. */
     // 更新节点状态
    clusterUpdateState();
    // 写配置文件
    clusterSaveConfigOrDie(1);

    /* 4) Pong all the other nodes so that they can update the state
     *    accordingly and detect that we switched to master role. */
    // 发送一个PONG消息包给所有已连接不处于握手状态的的节点
    // 以便能够其他节点更新状态 
    clusterBroadcastPong(CLUSTER_BROADCAST_ALL);

    /* 5) If there was a manual failover in progress, clear the state. */
     // 重置与手动故障转移的状态
    resetManualFailover();
}

主从切换广播给集群

当从节点替换主节点之后，调用clusterBroadcastPong()函数，将主从切换的信息发送给集群中的所有节点。该函数就是遍历集群中的所有节点，发送PONG消息，消息包中会包含最新的晋升为主节点的节点信息。接收到消息的函数会调用clusterProcessPacket()函数来处理，具体处理代码如下：

 /* Check for role switch: slave -> master or master -> slave. */
        // 主从切换的检测
        if (sender) {
        	    // 如果消息头的slaveof为空名字，那么说明sender节点是主节点
            if (!memcmp(hdr->slaveof,CLUSTER_NODE_NULL_NAME,
                sizeof(hdr->slaveof)))
            {
                /* Node is a master. */
                  // 设置 sender 为主节点
                clusterSetNodeAsMaster(sender);
            } else {// sender 的 slaveof 不为空，那么这是一个从节点
                /* Node is a slave. */
                // 根据名字从集群中查找并返回sender从节点的主节点
                clusterNode *master = clusterLookupNode(hdr->slaveof);
                 // sender标识自己为主节点，但是消息中显示它为从节点
                if (nodeIsMaster(sender)) {
                    /* Master turned into a slave! Reconfigure the node. */
                     // 删除主节点所负责的槽
                    clusterDelNodeSlots(sender);
                      // 更新标识
                    sender->flags &= ~(CLUSTER_NODE_MASTER|
                                       CLUSTER_NODE_MIGRATE_TO);
                     // 设置为从节点标识                   
                    sender->flags |= CLUSTER_NODE_SLAVE;

                    /* Update config and state. */
                     // 更新配置和状态
                    clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG|
                                         CLUSTER_TODO_UPDATE_STATE);
                }

                /* Master node changed for this slave? */
                  // sender的主节点发生改变
                if (master && sender->slaveof != master) {
                	   // 如果 sender 之前的主节点不是现在的主节点
                    // 那么在旧主节点的从节点列表中移除 sender
                    if (sender->slaveof)
                        clusterNodeRemoveSlave(sender->slaveof,sender);
                    // 将sender添加到新的主节点的从节点字典中    
                    clusterNodeAddSlave(master,sender);
                    // 设置sender的主节点
                    sender->slaveof = master;

                    /* Update config. */
                    // 更新配置
                    clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG);
                }
            }
        }

每一个接收到PONG消息的节点都会将进行主从切换，更新各自视角中的节点信息。故障转移操作完毕！

参考：

https://blog.csdn.net/men_wen/article/details/73137338

https://www.cnblogs.com/mindwind/p/5231986.html