1 变更原因
Tstack网络节点重启之后,引发以下问题导致虚拟机网络故障
- 问题1: 服务器重启后,neutron拉起keepalived进程时报告找不到ha port,keepalived进程无法被拉起,当另一台上面的keepalived挂掉之后,无法切换
- 问题2: 服务器重启后,部分端口tag被设置成4095例如ha qg qr这些关键端口tag被设置成4095会导致虚拟机网络故障
- 问题3: 服务器重启后,neutron拉起keepalived进程时报告进程已存在,keepalived进程无法被拉起,当另一台上面的keepalived挂掉之后,无法切换
2 原因分析
问题1 2皆为启动顺序问题其中:
- 问题1:
neutron-l3-agent
服务先于openvswitch
服务启动,导致l3
启动检查ovs
网桥时br-int
不存在 - 问题2:
neutron-l3-agent
服务先于neutron-openvswitch-agent
服务启动后会去找neutron-server
绑定port,但neutron-openvswitch-agent
还未启动,绑定失败,neutron server
更新数据库里端口的状态为binding failed
状态,neutron-openvswitch-agent
服务启动后,发现端口的状态是binding failed
,就把它们的vlan tag
都设置成4095。 - 问题3:
vrouter
ns 的keepalived
进程死掉之后重新启动,不会删除掉keepalived
的pid
文件和pid-vrrp
导致重启之后keepalived
已经启动
3 变更过程
- 变更范围: 3台网络节点
- 变更影响: 变更后需重启neutron服务,重启时会造成vxlan虚拟机网络闪断
3.1 服务顺序变更
3.1.1 备份
mkdir -p /root/backup/neutron_fix_20190429
cp /usr/lib/systemd/system/neutron-l3-agent.service /root/backup/neutron_fix_20190429
3.1.2 变更执行
编辑 /usr/lib/systemd/system/neutron-l3-agent.service
的 Unit
节
Description=OpenStack Neutron Layer 3 Agent
After=syslog.target network.target neutron-openvswitch-agent.service
Requires=neutron-openvswitch-agent.service
执行
systemctl daemon-reload
systemctl restart neutron-l3-agent
3.2 L3 keepalived pid问题修复
3.2.1 备份
cp /usr/lib/python2.7/site-packages/neutron/agent/linux/keepalived.py /root/backup/neutron_fix_20190429/
3.2.2 执行
将升级包中得 keepalived.py
拷贝至服务器上
cp keepalived.py /usr/lib/python2.7/site-packages/neutron/agent/linux/keepalived.py
重启 neutron
服务
openstack-service restart neutron
4 验证
4.1 基础功能验证
- 查看服务是否正常
openstack-service status neutron
- 检查虚拟机网络是否正常
- 检查l3日志是否有异常
/var/log/neutron/l3-agent.log
- 检查系统日志是否有异常
/var/log/meassages
4.2 修复功能验证
当neutron服务全部正常之后,选定一个 vrouter
测试ha切换
- 找到该
vrouter
的master
节点A
/var/lib/neutron/ha_confs/${VroterID}/state
输出为 master
的即是
- 关闭服务器A的
keepalived
服务
ps -ef|grep keepalived|grep ${VroterID}|grep -v grep|xargs kill -9
master状态会切到服务器B或者服务器C
- 关闭服务器B/C的
keepalived
进程,master会切到另外一台
切换过程该 vrouter
的虚拟机会有3s左右的网络中断
以上每个阶段切换完毕虚拟机网络正常则变更成功
注意: 手动kill进程是非常不友好的操作,会使 keepalived
在下次启动时脑裂 对于kill过的服务器作如下操作
openstack-service stop neutron
ovs-vsctl del-port ha-xxx
ovs-vsctl del-port qg-xxx
ip netns del ${VroterID} openstack-service restart neutron # 其中ha和qg后面的字符串可以用下面命令查看 ip netns exec ${VroterID} ip a
时限
以上验证通过则更新成功
5 回退
当变更过程中遇到短期无法解决的问题或者变更完影响业务则需要回退
回退服务文件
/root/backup/neutron_fix_20190429 /usr/lib/systemd/system/neutron-l3-agent.service
systemctl daemon-reload
systemctl restart neutron-l3-agent
回退 keepalived
代码
cp /root/backup/neutron_fix_20190429/keepalived.py /usr/lib/python2.7/site-packages/neutron/agent/linux/keepalived.py
openstack-service restart neutron
keepalived.py:
# Copyright (C) 2014 eNovance SAS <[email protected]> # # Licensed under the Apache License, Version 2.0 (the "License"); you may # not use this file except in compliance with the License. You may obtain # a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, WITHOUT # WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the # License for the specific language governing permissions and limitations # under the License. import errno import itertools import os import netaddr from oslo_config import cfg from oslo_log import log as logging from neutron.i18n import _, _LE from neutron.agent.linux import external_process from neutron.agent.linux import utils from neutron.common import exceptions VALID_STATES = ['MASTER', 'BACKUP'] VALID_AUTH_TYPES = ['AH', 'PASS'] HA_DEFAULT_PRIORITY = 50 PRIMARY_VIP_RANGE_SIZE = 24 # TODO(amuller): Use L3 agent constant when new constants module is introduced. FIP_LL_SUBNET = '169.254.30.0/23' KEEPALIVED_SERVICE_NAME = 'keepalived' GARP_MASTER_DELAY = 60 LOG = logging.getLogger(__name__) def get_free_range(parent_range, excluded_ranges, size=PRIMARY_VIP_RANGE_SIZE): """Get a free IP range, from parent_range, of the specified size. :param parent_range: String representing an IP range. E.g: '169.254.0.0/16' :param excluded_ranges: A list of strings to be excluded from parent_range :param size: What should be the size of the range returned? :return: A string representing an IP range """ free_cidrs = netaddr.IPSet([parent_range]) - netaddr.IPSet(excluded_ranges) for cidr in free_cidrs.iter_cidrs(): if cidr.prefixlen <= size: return '%s/%s' % (cidr.network, size) raise ValueError(_('Network of size %(size)s, from IP range ' '%(parent_range)s excluding IP ranges ' '%(excluded_ranges)s was not found.') % {'size': size, 'parent_range': parent_range, 'excluded_ranges': excluded_ranges}) class InvalidInstanceStateException(exceptions.NeutronException): message = _('Invalid instance state: %(state)s, valid states are: ' '%(valid_states)s') def __init__(self, **kwargs): if 'valid_states' not in kwargs: kwargs['valid_states'] = ', '.join(VALID_STATES) super(InvalidInstanceStateException, self).__init__(**kwargs) class InvalidAuthenticationTypeException(exceptions.NeutronException): message = _('Invalid authentication type: %(auth_type)s, ' 'valid types are: %(valid_auth_types)s') def __init__(self, **kwargs): if 'valid_auth_types' not in kwargs: kwargs['valid_auth_types'] = ', '.join(VALID_AUTH_TYPES) super(InvalidAuthenticationTypeException, self).__init__(**kwargs) class KeepalivedVipAddress(object): """A virtual address entry of a keepalived configuration.""" def __init__(self, ip_address, interface_name, scope=None): self.ip_address = ip_address self.interface_name = interface_name self.scope = scope def build_config(self): result = '%s dev %s' % (self.ip_address, self.interface_name) if self.scope: result += ' scope %s' % self.scope return result class KeepalivedVirtualRoute(object): """A virtual route entry of a keepalived configuration.""" def __init__(self, destination, nexthop, interface_name=None): self.destination = destination self.nexthop = nexthop self.interface_name = interface_name def build_config(self): output = '%s via %s' % (self.destination, self.nexthop) if self.interface_name: output += ' dev %s' % self.interface_name return output class KeepalivedInstance(object): """Instance section of a keepalived configuration.""" def __init__(self, state, interface, vrouter_id, ha_cidrs, priority=HA_DEFAULT_PRIORITY, advert_int=None, mcast_src_ip=None, nopreempt=False, garp_master_delay=GARP_MASTER_DELAY): self.name = 'VR_%s' % vrouter_id if state not in VALID_STATES: raise InvalidInstanceStateException(state=state) self.state = state self.interface = interface self.vrouter_id = vrouter_id self.priority = priority self.nopreempt = nopreempt self.advert_int = advert_int self.mcast_src_ip = mcast_src_ip self.garp_master_delay = garp_master_delay self.track_interfaces = [] self.vips = [] self.virtual_routes = [] self.authentication = None metadata_cidr = '169.254.169.254/32' self.primary_vip_range = get_free_range( parent_range='169.254.0.0/16', excluded_ranges=[metadata_cidr, FIP_LL_SUBNET] + ha_cidrs, size=PRIMARY_VIP_RANGE_SIZE) def set_authentication(self, auth_type, password): if auth_type not in VALID_AUTH_TYPES: raise InvalidAuthenticationTypeException(auth_type=auth_type) self.authentication = (auth_type, password) def add_vip(self, ip_cidr, interface_name, scope): self.vips.append(KeepalivedVipAddress(ip_cidr, interface_name, scope)) def remove_vips_vroutes_by_interface(self, interface_name): self.vips = [vip for vip in self.vips if vip.interface_name != interface_name] self.virtual_routes = [vroute for vroute in self.virtual_routes if vroute.interface_name != interface_name] def remove_vip_by_ip_address(self, ip_address): self.vips = [vip for vip in self.vips if vip.ip_address != ip_address] def get_existing_vip_ip_addresses(self, interface_name): return [vip.ip_address for vip in self.vips if vip.interface_name == interface_name] def _build_track_interface_config(self): return itertools.chain( [' track_interface {'], (' %s' % i for i in self.track_interfaces), [' }']) def get_primary_vip(self): """Return an address in the primary_vip_range CIDR, with the router's VRID in the host section. For example, if primary_vip_range is 169.254.0.0/24, and this router's VRID is 5, the result is 169.254.0.5. Using the VRID assures that the primary VIP is consistent amongst HA router instances on different nodes. """ ip = (netaddr.IPNetwork(self.primary_vip_range).network + self.vrouter_id) return str(netaddr.IPNetwork('%s/%s' % (ip, PRIMARY_VIP_RANGE_SIZE))) def _build_vips_config(self): # NOTE(amuller): The primary VIP must be consistent in order to avoid # keepalived bugs. Changing the VIP in the 'virtual_ipaddress' and # SIGHUP'ing keepalived can remove virtual routers, including the # router's default gateway. # We solve this by never changing the VIP in the virtual_ipaddress # section, herein known as the primary VIP. # The only interface known to exist for HA routers is the HA interface # (self.interface). We generate an IP on that device and use it as the # primary VIP. The other VIPs (Internal interfaces IPs, the external # interface IP and floating IPs) are placed in the # virtual_ipaddress_excluded section. primary = KeepalivedVipAddress(self.get_primary_vip(), self.interface) vips_result = [' virtual_ipaddress {', ' %s' % primary.build_config(), ' }'] if self.vips: vips_result.extend( itertools.chain([' virtual_ipaddress_excluded {'], (' %s' % vip.build_config() for vip in sorted(self.vips, key=lambda vip: vip.ip_address)), [' }'])) return vips_result def _build_virtual_routes_config(self): return itertools.chain([' virtual_routes {'], (' %s' % route.build_config() for route in self.virtual_routes), [' }']) def build_config(self): config = ['vrrp_instance %s {' % self.name, ' state %s' % self.state, ' interface %s' % self.interface, ' virtual_router_id %s' % self.vrouter_id, ' priority %s' % self.priority, ' garp_master_delay %s' % self.garp_master_delay] if self.nopreempt: config.append(' nopreempt') if self.advert_int: config.append(' advert_int %s' % self.advert_int) if self.authentication: auth_type, password = self.authentication authentication = [' authentication {', ' auth_type %s' % auth_type, ' auth_pass %s' % password, ' }'] config.extend(authentication) if self.mcast_src_ip: config.append(' mcast_src_ip %s' % self.mcast_src_ip) if self.track_interfaces: config.extend(self._build_track_interface_config()) config.extend(self._build_vips_config()) if self.virtual_routes: config.extend(self._build_virtual_routes_config()) config.append('}') return config class KeepalivedConf(object): """A keepalived configuration.""" def __init__(self): self.reset() def reset(self): self.instances = {} def add_instance(self, instance): self.instances[instance.vrouter_id] = instance def get_instance(self, vrouter_id): return self.instances.get(vrouter_id) def build_config(self): config = [] for instance in self.instances.values(): config.extend(instance.build_config()) return config def get_config_str(self): """Generates and returns the keepalived configuration. :return: Keepalived configuration string. """ return '\n'.join(self.build_config()) class KeepalivedManager(object): """Wrapper for keepalived. This wrapper permits to write keepalived config files, to start/restart keepalived process. """ def __init__(self, resource_id, config, conf_path='/tmp', namespace=None, process_monitor=None): self.resource_id = resource_id self.config = config self.namespace = namespace self.process_monitor = process_monitor self.conf_path = conf_path self.process = None def get_conf_dir(self): confs_dir = os.path.abspath(os.path.normpath(self.conf_path)) conf_dir = os.path.join(confs_dir, self.resource_id) return conf_dir def get_full_config_file_path(self, filename, ensure_conf_dir=True): conf_dir = self.get_conf_dir() if ensure_conf_dir: utils.ensure_dir(conf_dir) return os.path.join(conf_dir, filename) def _output_config_file(self): config_str = self.config.get_config_str() config_path = self.get_full_config_file_path('keepalived.conf') utils.replace_file(config_path, config_str) return config_path @staticmethod def _safe_remove_pid_file(pid_file): try: os.remove(pid_file) except OSError as e: if e.errno != errno.ENOENT: LOG.error(_LE("Could not delete file %s, keepalived can " "refuse to start."), pid_file) def get_vrrp_pid_file_name(self, base_pid_file): return '%s-vrrp' % base_pid_file def get_conf_on_disk(self): config_path = self.get_full_config_file_path('keepalived.conf') try: with open(config_path) as conf: return conf.read() except (OSError, IOError) as e: if e.errno != errno.ENOENT: raise def spawn(self): config_path = self._output_config_file() keepalived_pm = self.get_process() vrrp_pm = self._get_vrrp_process( self.get_vrrp_pid_file_name(keepalived_pm.get_pid_file_name())) keepalived_pm.default_cmd_callback = ( self._get_keepalived_process_callback(vrrp_pm, config_path)) keepalived_pm.enable(reload_cfg=True) self.process_monitor.register(uuid=self.resource_id, service_name=KEEPALIVED_SERVICE_NAME, monitored_process=keepalived_pm) LOG.debug('Keepalived spawned with config %s', config_path) def disable(self): self.process_monitor.unregister(uuid=self.resource_id, service_name=KEEPALIVED_SERVICE_NAME) pm = self.get_process() pm.disable(sig='15') def get_process(self, callback=None): return external_process.ProcessManager( cfg.CONF, self.resource_id, self.namespace, pids_path=self.conf_path, default_cmd_callback=callback) def _get_vrrp_process(self, pid_file): return external_process.ProcessManager( cfg.CONF, self.resource_id, self.namespace, pid_file=pid_file) def _get_keepalived_process_callback(self, vrrp_pm, config_path): def callback(pid_file): # If keepalived process crashed unexpectedly, the vrrp process # will be orphan and prevent keepalived process to be spawned. # A check here will let the l3-agent to kill the orphan process # and spawn keepalived successfully. if vrrp_pm.active: vrrp_pm.disable() self._safe_remove_pid_file(pid_file) self._safe_remove_pid_file(self.get_vrrp_pid_file_name(pid_file)) cmd = ['keepalived', '-P', '-f', config_path, '-p', pid_file, '-r', self.get_vrrp_pid_file_name(pid_file)] return cmd return callback