版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/zhonglinzhang/article/details/84245188
What?
Why?
kubelet通过OOM Killer来回收缺点:
- System OOM events会保存记录直到完成了OOM
- OOM Killer干掉containers后,Scheduler可能又会调度新的Pod到该Node上或者直接在node上重新运行,又会触发该Node上的OOM Killer,可能无限循化这种操作
How?
kubelet启动eviction默认值
--eviction-hard="imagefs.available<15%,memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%"
--eviction-max-pod-grace-period="0"
--eviction-minimum-reclaim=""
--eviction-pressure-transition-period="5m0s"
--eviction-soft=""
--eviction-soft-grace-period=""
注意:分为eviction-soft和eviction-hard。soft到达threshold值时会给pod一段时间优雅退出,而hard直接杀掉pod,不给任何优雅退出的机会
eviction singal
- memory.available
- nodefs.available
- nodefs.inodesFree
- imagefs.available
- imagefs.inodesFree
- allocatableMemory.available
注意:
- nodefs: 指node自身的存储,存储运行日志等
- imagefs: 指dockerd存储image和容器可写层
managerImpl结构体
- killPodFunc: 赋值为killPodNow方法
- imageGC: 出现diskPressure时,imageGC进行删除未使用的镜像
- thresholdsFirstObservedAt : 记录threshold第一次观察到的时间
- resourceToRankFunc - 定义各种Resource进行evict 挑选时的排名方法。
- nodeConditionsLastObservedAt: 上一次获取的eviction signal的记录
- notifierInitialized - bool值,表示threshold notifier是否已经初始化,以确定是否可以利用kernel memcg notification功能来提高evict的响应速度。目前创建manager时该值为false,是否要利用kernel memcg notification,完全取决于kubelet的--experimental-kernel-memcg-notification参数。
// managerImpl implements Manager
type managerImpl struct {
// used to track time
clock clock.Clock
// config is how the manager is configured
config Config
// the function to invoke to kill a pod
killPodFunc KillPodFunc
// the interface that knows how to do image gc
imageGC ImageGC
// the interface that knows how to do container gc
containerGC ContainerGC
// protects access to internal state
sync.RWMutex
// node conditions are the set of conditions present
nodeConditions []v1.NodeConditionType
// captures when a node condition was last observed based on a threshold being met
nodeConditionsLastObservedAt nodeConditionsObservedAt
// nodeRef is a reference to the node
nodeRef *v1.ObjectReference
// used to record events about the node
recorder record.EventRecorder
// used to measure usage stats on system
summaryProvider stats.SummaryProvider
// records when a threshold was first observed
thresholdsFirstObservedAt thresholdsObservedAt
// records the set of thresholds that have been met (including graceperiod) but not yet resolved
thresholdsMet []evictionapi.Threshold
// signalToRankFunc maps a resource to ranking function for that resource.
signalToRankFunc map[evictionapi.Signal]rankFunc
// signalToNodeReclaimFuncs maps a resource to an ordered list of functions that know how to reclaim that resource.
signalToNodeReclaimFuncs map[evictionapi.Signal]nodeReclaimFuncs
// last observations from synchronize
lastObservations signalObservations
// dedicatedImageFs indicates if imagefs is on a separate device from the rootfs
dedicatedImageFs *bool
// thresholdNotifiers is a list of memory threshold notifiers which each notify for a memory eviction threshold
thresholdNotifiers []ThresholdNotifier
// thresholdsLastUpdated is the last time the thresholdNotifiers were updated.
thresholdsLastUpdated time.Time
}
1. eviction manager初始化
路径: pkg/kubelet/kubelet.go
1.1 eviction 配置参数
可以参照上面kubelet启动eviction默认值
thresholds, err := eviction.ParseThresholdConfig(enforceNodeAllocatable, kubeCfg.EvictionHard, kubeCfg.EvictionSoft, kubeCfg.EvictionSoftGracePeriod, kubeCfg.EvictionMinimumReclaim)
if err != nil {
return nil, err
}
evictionConfig := eviction.Config{
PressureTransitionPeriod: kubeCfg.EvictionPressureTransitionPeriod.Duration,
MaxPodGracePeriodSeconds: int64(kubeCfg.EvictionMaxPodGracePeriod),
Thresholds: thresholds,
KernelMemcgNotification: experimentalKernelMemcgNotification,
PodCgroupRoot: kubeDeps.ContainerManager.GetPodCgroupRoot(),
}
1.2 初始化eviction manager
// setup eviction manager
evictionManager, evictionAdmitHandler := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.imageManager, klet.containerGC, kubeDeps.Recorder, nodeRef, klet.clock)
klet.evictionManager = evictionManager
klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler)
1.3 运行eviction manager
隐藏的够深
- Run(updates <-chan kubetypes.PodUpdate) ->
- fastStatusUpdateOnce() ->
- updateRuntimeUp() ->
- initializeRuntimeDependentModules() ->
- kl.evictionManager.Start(kl.StatsProvider, kl.GetActivePods, kl.podResourcesAreReclaimed, evictionMonitoringPeriod)
2. Start函数
路径:pkg/kubelet/eviction/eviction_manager.go
3. synchronize函数
3.1 buildSignalToRankFunc函数和buildSignalToNodeReclaimFuncs函数
- buildSignalToRankFunc注册signal资源函数
- buildSignalToNodeReclaimFuncs注册signal reclaim函数
// build the ranking functions (if not yet known)
// TODO: have a function in cadvisor that lets us know if global housekeeping has completed
if m.dedicatedImageFs == nil {
hasImageFs, ok := diskInfoProvider.HasDedicatedImageFs()
if ok != nil {
return nil
}
glog.Infof("zzlin managerImpl synchronize m.dedicatedImageFs == nil &hasImageFs: %v", &hasImageFs)
m.dedicatedImageFs = &hasImageFs
m.signalToRankFunc = buildSignalToRankFunc(hasImageFs)
m.signalToNodeReclaimFuncs = buildSignalToNodeReclaimFuncs(m.imageGC, m.containerGC, hasImageFs)
}
3.2 Get函数获取node以及pod信息
路径pkg/kubelet/server/stats/summary.go
activePods := podFunc()
updateStats := true
summary, err := m.summaryProvider.Get(updateStats)
if err != nil {
glog.Errorf("eviction manager: failed to get get summary stats: %v", err)
return nil
}
3.3 makeSignalObservations函数
显示signal资源情况包括如下:
- imagefs.inodesFree
- pid.available
- memory.available
- allocatableMemory.available
- nodefs.available
- nodefs.inodesFree
- imagefs.available
// make observations and get a function to derive pod usage stats relative to those observations.
observations, statsFunc := makeSignalObservations(summary)
debugLogObservations("observations", observations)