在日常工作中,会碰到如下的场景,如MySQL数据库升级,主服务器硬件升级等,这个时候就需要将写操作切换到另外一台服务器上,那么如何进行在线切换呢?同时,要求切换过程短,对业务的影响比较小。
MHA就提供了这样一种优雅的方式,只会堵塞业务0.5~2s的时间,在这段时间内,业务无法读取和写入。
集群信息
角色 IP地址 ServerID 类型
Master 192.168.244.10 1 写入
Candicate master 192.168.244.20 2 读
Slave 192.168.244.30 3 读
Monitor host 192.168.244.40 监控集群组
MHA具体的搭建步骤和原理,可参考另外一篇文章:
在线切换的步骤
1. 关闭MHA监控
# masterha_stop --conf=/etc/masterha/app1.cnf
2. 在线切换
# /usr/local/bin/masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=192.168.244.20 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
其中,
--orig_master_is_new_slave是将原master切换为新主的slave,默认情况下,是不添加的。
--running_updates_limit默认为1s,即如果主从延迟时间(Seconds_Behind_Master),或master show processlist中dml操作大于1s,则不会执行切换。
在线切换的输出
Tue Apr
11
15:
28:
32
2017 - [
info] MHA::MasterRotate version
0.56
.
Tue Apr
11
15:
28:
32
2017 - [
info
] Starting online master switch..
Tue Apr
11
15:
28:
32
2017 - [
info
]
Tue Apr
11
15:
28:
32
2017 - [
info] * Phase
1
: Configuration Check Phase..
Tue Apr
11
15:
28:
32
2017 - [
info
]
Tue Apr
11
15:
28:
32
2017 - [warning] Global configuration
file /etc/
masterha_default.cnf not found. Skipping.
Tue Apr
11
15:
28:
32
2017 - [
info] Reading application default configuration from /etc/masterha/
app1.cnf..
Tue Apr
11
15:
28:
32
2017 - [
info] Reading server configuration from /etc/masterha/
app1.cnf..
Tue Apr
11
15:
28:
34
2017 - [
info] GTID failover mode =
0
Tue Apr
11
15:
28:
34
2017 - [
info] Current Alive Master:
192.168.
244.10(
192.168.
244.10:
3306
)
Tue Apr
11
15:
28:
34
2017 - [
info
] Alive Slaves:
Tue Apr
11
15:
28:
34
2017 - [
info]
192.168.
244.20(
192.168.
244.20:
3306) Version=
5.6.
31-
log (oldest major version between slaves) log
-bin:enabledTue Apr
11
15:
28:
34
2017 - [
info] Replicating from
192.168.
244.10(
192.168.
244.10:
3306
)
Tue Apr
11
15:
28:
34
2017 - [
info] Primary candidate
for
the new Master (candidate_master is set)
Tue Apr
11
15:
28:
34
2017 - [
info]
192.168.
244.30(
192.168.
244.30:
3306) Version=
5.6.
31-
log (oldest major version between slaves) log
-bin:enabledTue Apr
11
15:
28:
34
2017 - [
info] Replicating from
192.168.
244.10(
192.168.
244.10:
3306
)
It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on
192.168.
244.10(
192.168
.
244.10:
3306)? (YES/no): yes
Tue Apr
11
15:
28:
47
2017 - [
info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take
long
time
..
Tue Apr
11
15:
28:
47
2017 - [
info
] ok.
Tue Apr
11
15:
28:
47
2017 - [
info
] Checking MHA is not monitoring or doing failover..
Tue Apr
11
15:
28:
47
2017 - [
info] Checking replication health on
192.168.
244.20
..
Tue Apr
11
15:
28:
47
2017 - [
info
] ok.
Tue Apr
11
15:
28:
47
2017 - [
info] Checking replication health on
192.168.
244.30
..
Tue Apr
11
15:
28:
47
2017 - [
info
] ok.
Tue Apr
11
15:
28:
47
2017 - [
info]
192.168.
244.20
can be new master.
Tue Apr
11
15:
28:
47
2017 - [
info
]
From:
192.168.
244.10(
192.168.
244.10:
3306
) (current master)
+--
192.168.
244.20(
192.168.
244.20:
3306
)
+--
192.168.
244.30(
192.168.
244.30:
3306
)
To:
192.168.
244.20(
192.168.
244.20:
3306
) (new master)
+--
192.168.
244.30(
192.168.
244.30:
3306
)
+--
192.168.
244.10(
192.168.
244.10:
3306
)
Starting master switch from
192.168.
244.10(
192.168.
244.10:
3306) to
192.168.
244.20(
192.168.
244.20:
3306)? (yes/
NO): yes
Tue Apr
11
15:
29:
00
2017 - [
info] Checking whether
192.168.
244.20(
192.168.
244.20:
3306) is ok
for
the new master..
Tue Apr
11
15:
29:
00
2017 - [
info
] ok.
Tue Apr
11
15:
29:
00
2017 - [
info]
192.168.
244.10(
192.168.
244.10:
3306
): SHOW SLAVE STATUS returned empty result. To check replication
filtering rules, temporarily executing CHANGE MASTER to a dummy host.Tue Apr
11
15:
29:
00
2017 - [
info]
192.168.
244.10(
192.168.
244.10:
3306
): Resetting slave pointing to the dummy host.
Tue Apr
11
15:
29:
00
2017 - [
info] ** Phase
1
: Configuration Check Phase completed.
Tue Apr
11
15:
29:
00
2017 - [
info
]
Tue Apr
11
15:
29:
00
2017 - [
info] * Phase
2
: Rejecting updates Phase..
Tue Apr
11
15:
29:
00
2017 - [
info
]
Tue Apr
11
15:
29:
00
2017 - [
info] Executing master ip online change script to disable
write
on the current master:
Tue Apr
11
15:
29:
00
2017 - [
info] /usr/local/bin/master_ip_online_change --command=stop --orig_master_host=
192.168.
244.10 --
orig_ma
ster_ip=
192.168.
244.10 --orig_master_port=
3306 --orig_master_user=
'
monitor
' --orig_master_password=
'
monitor123
' --new_master_host=
192.168.
244.20 --new_master_ip=
192.168.
244.20 --new_master_port=
3306 --new_master_user=
'
monitor
' --new_master_password=
'
monitor123
' --orig_master_ssh_user=root --new_master_ssh_user=root --orig_master_is_new_slaveTue Apr
11
15:
29:
00
2017
476501
Set read_only on the new master.. ok.
Tue Apr
11
15:
29:
00
2017
911951 Set read_only=
1
on the orig master.. ok.
Tue Apr
11
15:
29:
00
2017
919517
Killing all application threads..
Tue Apr
11
15:
29:
00
2017
919552
done
.
Disabling the VIP an old master:
192.168.
244.10
SIOCSIFFLAGS: Cannot assign requested address
Tue Apr
11
15:
29:
00
2017 - [
info
] ok.
Tue Apr
11
15:
29:
00
2017 - [
info
] Locking all tables on the orig master to reject updates from everybody (including root):
Tue Apr
11
15:
29:
00
2017 - [
info
] Executing FLUSH TABLES WITH READ LOCK..
Tue Apr
11
15:
29:
00
2017 - [
info
] ok.
Tue Apr
11
15:
29:
00
2017 - [
info] Orig master binlog:pos is mysql-bin.
000016:
211
.
Tue Apr
11
15:
29:
00
2017 - [
info] Waiting to execute all relay logs on
192.168.
244.20(
192.168.
244.20:
3306
)..
Tue Apr
11
15:
29:
01
2017 - [
info] master_pos_wait(mysql-bin.
000016:
211) completed on
192.168.
244.20(
192.168.
244.20:
3306). Executed
0
events.Tue Apr
11
15:
29:
01
2017 - [
info]
done
.
Tue Apr
11
15:
29:
01
2017 - [
info] Getting new master
'
s binlog name and position..
Tue Apr
11
15:
29:
01
2017 - [
info] mysql-bin.
000009:
211
Tue Apr
11
15:
29:
01
2017 - [
info
] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_
HOST=
'
192.168.244.20
', MASTER_PORT=
3306, MASTER_LOG_FILE=
'
mysql-bin.000009
', MASTER_LOG_POS=
211, MASTER_USER=
'
repl
', MASTER_PASSWORD=
'
xxx
';Tue Apr
11
15:
29:
01
2017 - [
info] Executing master ip online change script to allow
write
on the new master:
Tue Apr
11
15:
29:
01
2017 - [
info] /usr/local/bin/master_ip_online_change --command=start --orig_master_host=
192.168.
244.10 --
orig_m
aster_ip=
192.168.
244.10 --orig_master_port=
3306 --orig_master_user=
'
monitor
' --orig_master_password=
'
monitor123
' --new_master_host=
192.168.
244.20 --new_master_ip=
192.168.
244.20 --new_master_port=
3306 --new_master_user=
'
monitor
' --new_master_password=
'
monitor123
' --orig_master_ssh_user=root --new_master_ssh_user=root --orig_master_is_new_slaveTue Apr
11
15:
29:
01
2017
109040 Set read_only=
0
on the new master.
Enabling the VIP
192.168.
244.188 on the new master:
192.168.
244.20
Tue Apr
11
15:
29:
01
2017 - [
info
] ok.
Tue Apr
11
15:
29:
01
2017 - [
info
]
Tue Apr
11
15:
29:
01
2017 - [
info] * Switching slaves
in
parallel..
Tue Apr
11
15:
29:
01
2017 - [
info
]
Tue Apr
11
15:
29:
01
2017 - [
info] -- Slave switch on host
192.168.
244.30(
192.168.
244.30:
3306) started, pid:
17651
Tue Apr
11
15:
29:
01
2017 - [
info
]
Tue Apr
11
15:
29:
02
2017 - [
info] Log messages from
192.168.
244.30
...
Tue Apr
11
15:
29:
02
2017 - [
info
]
Tue Apr
11
15:
29:
01
2017 - [
info] Waiting to execute all relay logs on
192.168.
244.30(
192.168.
244.30:
3306
)..
Tue Apr
11
15:
29:
01
2017 - [
info] master_pos_wait(mysql-bin.
000016:
211) completed on
192.168.
244.30(
192.168.
244.30:
3306). Executed
0
events.Tue Apr
11
15:
29:
01
2017 - [
info]
done
.
Tue Apr
11
15:
29:
01
2017 - [
info] Resetting slave
192.168.
244.30(
192.168.
244.30:
3306) and starting replication from the new master
1
92.168.
244.20(
192.168.
244.20:
3306)..Tue Apr
11
15:
29:
01
2017 - [
info
] Executed CHANGE MASTER.
Tue Apr
11
15:
29:
01
2017 - [
info
] Slave started.
Tue Apr
11
15:
29:
02
2017 - [
info] End of log messages from
192.168.
244.30
...
Tue Apr
11
15:
29:
02
2017 - [
info
]
Tue Apr
11
15:
29:
02
2017 - [
info] -- Slave switch on host
192.168.
244.30(
192.168.
244.30:
3306
) succeeded.
Tue Apr
11
15:
29:
02
2017 - [
info
] Unlocking all tables on the orig master:
Tue Apr
11
15:
29:
02
2017 - [
info
] Executing UNLOCK TABLES..
Tue Apr
11
15:
29:
02
2017 - [
info
] ok.
Tue Apr
11
15:
29:
02
2017 - [
info
] Starting orig master as a new slave..
Tue Apr
11
15:
29:
02
2017 - [
info] Resetting slave
192.168.
244.10(
192.168.
244.10:
3306) and starting replication from the new master
1
92.168.
244.20(
192.168.
244.20:
3306)..Tue Apr
11
15:
29:
02
2017 - [
info
] Executed CHANGE MASTER.
Tue Apr
11
15:
29:
02
2017 - [
info
] Slave started.
Tue Apr
11
15:
29:
02
2017 - [
info
] All new slave servers switched successfully.
Tue Apr
11
15:
29:
02
2017 - [
info
]
Tue Apr
11
15:
29:
02
2017 - [
info] * Phase
5
: New master cleanup phase..
Tue Apr
11
15:
29:
02
2017 - [
info
]
Tue Apr
11
15:
29:
02
2017 - [
info]
192.168.
244.20: Resetting slave
info
succeeded.
Tue Apr
11
15:
29:
02
2017 - [
info] Switching master to
192.168.
244.20(
192.168.
244.20:
3306) completed successfully.
MHA在线切换的原理
1. 检查当前的配置信息及主从服务器的信息
包括读取MHA的配置文件/etc/masterha/app1.cnf及检查当前slave的健康状态
2. 阻止对当前master的更新
主要通过如下步骤:
1> 等待1.5s($time_until_kill_threads*100ms),等待当前连接断开。
2> 执行 read_only=1,阻止新的DML操作
3> 等待0.5s,等待当前DML操作完成。
4> kill掉所有连接。
5> FLUSH NO_WRITE_TO_BINLOG TABLES
6> FLUSH TABLES WITH READ LOCK
3. 等待新master执行完所有的relay log
Waiting to execute all relay logs on 192.168.244.20(192.168.244.20:3306)..
4. 将新master的read_only设置为off,并添加VIP
5. slave切换到新master上。
1> 等待slave(192.168.244.30)应用完原主从复制产生的relay log,然后执行change master操作切换到新master上。
2> 释放原master上加的锁。
3> 因masterha_master_switch命令行中带有--orig_master_is_new_slave参数,故原master也切换为新master的从。
6. 清理新master的相关信息。
主要是执行了reset slave all操作,清除之前的复制信息。
MHA在线切换需满足的条件
MHA在执行在线切换之前,会判断当前的主从复制信息,只有满足了以下条件,才能执行切换动作:
1. 所有SLAVE的IO线程和SQL线程都在运行。
2. 所有slave的Seconds_Behind_Master小于或等于running_updates_limit的值,该参数如果没有显示指定的话,则默认为1s
3. 在master上,通过show processlist输出,没有一个DML操作的时间大于running_updates_limit的值。
在线切换时,打开general log,各个服务器的操作信息
注:在执行masterha_master_switch命令时,会有两次确认操作
1. It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 192.168.244.10(192.168
.244.10:3306)? (YES/no):
2. Starting master switch from 192.168.244.10(192.168.244.10:3306) to 192.168.244.20(192.168.244.20:3306)? (yes/NO):
以下输出中间都有两次空白,其中第一次空白之前的输出对应第一次确认之前,第二次之前的输出对应第二次确认之前。
原master 192.168.244.10
170412
16:
52:
38
23
Connect monitor@node4 on
23 Query set autocommit=
1
23
Query SELECT CONNECTION_ID() AS Value
170412
16:
52:
39
24
Connect monitor@node4 on
24 Query set autocommit=
1
24
Query SELECT CONNECTION_ID() AS Value
24 Query SET wait_timeout=
86400
24
Query SELECT @@global.server_id As Value
24
Query SELECT VERSION() AS Value
24
Query SELECT @@global.gtid_mode As Value
24 Query SHOW GLOBAL VARIABLES LIKE
'
log_bin
'
24
Query SHOW MASTER STATUS
24
Query SELECT @@global.datadir AS Value
24
Query SELECT @@global.slave_parallel_workers AS Value
24
Query SHOW SLAVE STATUS
24
Query SELECT @@global.read_only As Value
24
Query SELECT @@global.relay_log_purge As Value
170412
16:
54:
06
24
Query FLUSH NO_WRITE_TO_BINLOG TABLES
24 Query SELECT GET_LOCK(
'
MHA_Master_High_Availability_Monitor
',
'
0
'
) AS Value
24
Query SHOW PROCESSLIST
170412
16:
55:
51
24
Query SHOW SLAVE STATUS
24 Query CHANGE MASTER TO MASTER_HOST=
'
dummy_host
'
170412
16:
55:
52
24
Query SHOW SLAVE STATUS
24 Query RESET SLAVE
/*
!50516 ALL
*/
24 Query SELECT RELEASE_LOCK(
'
MHA_Master_High_Availability_Monitor
'
) As Value
24
Quit
25
Connect monitor@node4 on
25 Query set autocommit=
1
25
Query SELECT CONNECTION_ID() AS Value
25 Query SET sql_log_bin=
0
25
Query SHOW PROCESSLIST
25
Query SELECT @@global.read_only As Value
25 Query SET GLOBAL read_only=
1
25
Query SELECT @@global.read_only As Value
25
Query SHOW PROCESSLIST
25 Query SET sql_log_bin=
1
25
Quit
26
Connect monitor@node4 on
26 Query set autocommit=
1
26
Query SELECT CONNECTION_ID() AS Value
26 Query SET wait_timeout=
86400
26
Query FLUSH TABLES WITH READ LOCK
26
Query SHOW MASTER STATUS
170412
16:
55:
53
26
Query UNLOCK TABLES
26 Query CHANGE MASTER TO MASTER_HOST =
'
192.168.244.20
' MASTER_USER =
'
repl
' MASTER_PASSWORD = <secret>
MASTE
R_PORT=
3306 MASTER_LOG_FILE =
'
mysql-bin.000010
' MASTER_LOG_POS =
120
26 Query SET GLOBAL relay_log_purge=
0
26
Query START SLAVE
27 Connect Out repl@
192.168.
244.20:
3306
26
Query SHOW SLAVE STATUS
26 Query SELECT RELEASE_LOCK(
'
MHA_Master_High_Availability_Failover
'
) As Value
26 Quit
新master 192.168.244.20
170412 16:52:38 23 Connect monitor@node4 on
23 Query set autocommit=1
23 Query SELECT CONNECTION_ID() AS Value
170412 16:52:39 24 Connect monitor@node4 on
24 Query set autocommit=1
24 Query SELECT CONNECTION_ID() AS Value
24 Query SET wait_timeout=86400
24 Query SELECT @@global.server_id As Value
24 Query SELECT VERSION() AS Value
24 Query SELECT @@global.gtid_mode As Value
24 Query SHOW GLOBAL VARIABLES LIKE 'log_bin'
24 Query SHOW MASTER STATUS
24 Query SELECT @@global.datadir AS Value
24 Query SELECT @@global.slave_parallel_workers AS Value
24 Query SHOW SLAVE STATUS
24 Query SELECT @@global.read_only As Value
24 Query SELECT @@global.relay_log_purge As Value
24 Query SELECT @@global.relay_log_info_repository AS Value
24 Query SELECT @@global.datadir AS Value
24 Query SELECT @@global.relay_log_info_file AS Value
24 Query SHOW SLAVE STATUS
24 Query SELECT Repl_slave_priv AS Value FROM mysql.user WHERE user = 'repl'
170412 16:54:06 24 Query SELECT GET_LOCK('MHA_Master_High_Availability_Failover', '0') AS Value
24 Query SHOW SLAVE STATUS
24 Query SHOW SLAVE STATUS
170412 16:55:52 24 Query SHOW PROCESSLIST
25 Connect monitor@node4 on
25 Query set autocommit=1
25 Query SELECT CONNECTION_ID() AS Value
25 Query SELECT @@global.read_only As Value
25 Query SELECT @@global.read_only As Value
25 Quit
24 Query SHOW SLAVE STATUS
24 Query SELECT MASTER_POS_WAIT('mysql-bin.000017','120',0) AS Result
24 Query STOP SLAVE SQL_THREAD
24 Query SHOW SLAVE STATUS
24 Query SHOW MASTER STATUS
26 Connect monitor@node4 on
26 Query set autocommit=1
26 Query SELECT CONNECTION_ID() AS Value
26 Query SET sql_log_bin=0
26 Query SELECT @@global.read_only As Value
26 Query SET GLOBAL read_only=0
26 Query SET sql_log_bin=1
26 Quit
24 Query SELECT @@global.read_only As Value
27 Connect repl@node3 on
27 Query SELECT UNIX_TIMESTAMP()
27 Query SHOW VARIABLES LIKE 'SERVER_ID'
27 Query SET @master_heartbeat_period= 1799999979520
27 Query SET @master_binlog_checksum= @@global.binlog_checksum
27 Query SELECT @master_binlog_checksum
27 Query SELECT @@GLOBAL.GTID_MODE
27 Query SHOW VARIABLES LIKE 'SERVER_UUID'
27 Query SET @slave_uuid= '8a1093c8-1d00-11e7-954f-000c299a5715'
27 Binlog Dump Log: 'mysql-bin.000010' Pos: 120
170412 16:55:53 28 Connect repl@node1 on
28 Query SELECT UNIX_TIMESTAMP()
28 Query SHOW VARIABLES LIKE 'SERVER_ID'
28 Query SET @master_heartbeat_period= 1799999979520
28 Query SET @master_binlog_checksum= @@global.binlog_checksum
28 Query SELECT @master_binlog_checksum
28 Query SELECT @@GLOBAL.GTID_MODE
28 Query SHOW VARIABLES LIKE 'SERVER_UUID'
24 Query STOP SLAVE
28 Query SET @slave_uuid= '2a6365e0-1d05-11e7-956d-000c29c64704'
28 Binlog Dump Log: 'mysql-bin.000010' Pos: 120
24 Query SHOW SLAVE STATUS
24 Query RESET SLAVE /*!50516 ALL */
24 Query SHOW SLAVE STATUS
24 Query SELECT RELEASE_LOCK('MHA_Master_High_Availability_Failover') As Value
24 Quit
slave 192.168.244.30
170412 16:52:37 16 Connect monitor@node4 on
16 Query set autocommit=1
16 Query SELECT CONNECTION_ID() AS Value
170412 16:52:38 17 Connect monitor@node4 on
17 Query set autocommit=1
17 Query SELECT CONNECTION_ID() AS Value
17 Query SET wait_timeout=86400
17 Query SELECT @@global.server_id As Value
17 Query SELECT VERSION() AS Value
17 Query SELECT @@global.gtid_mode As Value
17 Query SHOW GLOBAL VARIABLES LIKE 'log_bin'
17 Query SHOW MASTER STATUS
17 Query SELECT @@global.datadir AS Value
17 Query SELECT @@global.slave_parallel_workers AS Value
17 Query SHOW SLAVE STATUS
17 Query SELECT @@global.read_only As Value
17 Query SELECT @@global.relay_log_purge As Value
17 Query SELECT @@global.relay_log_info_repository AS Value
17 Query SELECT @@global.datadir AS Value
17 Query SELECT @@global.relay_log_info_file AS Value
17 Query SHOW SLAVE STATUS
17 Query SELECT Repl_slave_priv AS Value FROM mysql.user WHERE user = 'repl'
170412 16:54:05 17 Query SELECT GET_LOCK('MHA_Master_High_Availability_Failover', '0') AS Value
17 Query SHOW SLAVE STATUS
17 Query SHOW SLAVE STATUS
170412 16:55:50 17 Query SHOW SLAVE STATUS
170412 16:55:51 17 Query SHOW SLAVE STATUS
17 Query SELECT MASTER_POS_WAIT('mysql-bin.000017','120',0) AS Result
17 Query STOP SLAVE SQL_THREAD
17 Query SHOW SLAVE STATUS
17 Query STOP SLAVE
17 Query STOP SLAVE
17 Query SHOW SLAVE STATUS
17 Query RESET SLAVE
17 Query CHANGE MASTER TO MASTER_HOST = '192.168.244.20' MASTER_USER = 'repl' MASTER_PASSWORD = <secret> MASTE
R_PORT = 3306 MASTER_LOG_FILE = 'mysql-bin.000010' MASTER_LOG_POS = 120 17 Query SET GLOBAL relay_log_purge=0
17 Query START SLAVE
18 Connect Out repl@192.168.244.20:3306
17 Query SHOW SLAVE STATUS
170412 16:55:52 17 Query SELECT RELEASE_LOCK('MHA_Master_High_Availability_Failover') As Value
17 Quit
参考