我们在处理磁盘报警的时候遇到了哪些问题?

作者:郭晓东
时间:2018-08-01


磁盘报警可以说是运维工程师经常碰到的一类问题了,起初,我们收到低频的磁盘报警的时候选择手工清理或者写一些简单的脚本来应付了事,随着业务日志的增长,出现了越来越多的磁盘报警,也出现了不同版本的磁盘清理脚本,但是仍然没有明显的减少磁盘报警对运维人员的干扰, 于是开始了一个算是较为通用的磁盘清理脚本的编写。

在写脚本的过程中考虑了几个问题
1.要清理哪些日志,应该设计成可配置的
2.清理的权限要控制得当,不能误删文件
3.日志要保留多长时间也应该设计成可配置的
4.删除了什么文件应该被记录下来
5.要删除的文件应该判断fd是否被关闭,否则删除掉并不能释放磁盘空间
6.磁盘应该在使用率高于60%再清理

于是开始写了第一版清理脚本并且快速的上线了,没过几天,发现仍然有磁盘报警,登上机器发现Package目录下存放了10个版本的历史包,一共占了近20多个G, 好, 加上这个目录的清理逻辑,只保留最新日期的3个包, 磁盘空间降下来了。
平稳的度过一段时间后,又发生了几次磁盘报警, 排查发现,我们配置的日志保留日期是7天,但业务日志在2天前突然猛增,直接写满了磁盘,那现在要解决的事情就是怎么避免磁盘在短时间内被快速写满, 解决方法是每次执行都判断一下磁盘空间的使用率, 在突破60%后,仍然增长到大于80%,就不再按照保留期限删除,开始删除5分钟前的数据,直到磁盘降到低于80%

附上脚本:

#!/bin/bash
#set -x

lockfile=$(dirname $0)/.hd_delete.lock
config=$(dirname $0)/hd_delete.config
logfile=$(dirname $0)/hd_delete.log
timestamp=$(date “+%Y-%m-%d %H:%M:%S”)
tempconfig=$(dirname $0)/.hd_delete.temp
type lsof || yum -y install lsof

safeDelete() {
fileName=${1?}
test -f $fileName || return
lsof $fileName >/dev/null 2>&1 && {
cat /dev/null >$fileName
} || {
/bin/rm -f $fileName
}
echo “$timestamp – $fileName deleted by hd_delete cron ” >> $logfile
}

cleanup() {
path=${1?}
min=${2?}
find $path -type f -mmin +$min | while read logFile
do
safeDelete $logFile
done
}

cleanBigfile() {
path=${1?}
max=${2?}
find $path -type f -size +${max}G | while read logFile
do
safeDelete $logFile
done
}

ExportUsage() {
res=$(df /export | awk ‘$NF==”/export”{print $(NF-1)}’ | sed ‘s/%//g’)
echo $res
}

RootUsage() {
res=$(df / |awk ‘$NF==”/”{print $(NF-1)}’ | sed ‘s/%//g’)
echo $res
}

PackageClean() {
res=$(find /export/Packages/*/*/ -maxdepth 0 -type d | wc -l)
[[ $res -gt 3 ]] || return
rc=$(expr ${res?} ‘-‘ 3)
rd=$(find /export/Packages/*/*/ -maxdepth 0 -type d| xargs ls -dtr|head -n “$rc”)
for x in ${rd[@]}
do
[[ x”$x” != x”/” ]] || return 1
[[ x”$x” != x”/export” && x”$x” != x”/export/” ]] || return 1
[[ x”$x” != x”/export/servers” && x”/export/servers/” ]] || return 1

/usr/bin/rm -rf $x
echo “$timestamp – $x deleted by hd_delete cron ” >> $logfile
done
}

CataClean() {
res=$(find /export/Logs/* -type f -name catalina.out -size +1G)
[[ x”$res” == x”” ]] || ( > $res; echo “$timestamp – Catalina.out has arrived 1G deleted by hd_delete cron ” >> $logfile )
}

validateConfig() {
x=$1; y=$2
[[ x”$x” != x”” && x”$y” != x”” ]] || return 1
[[ -d $x ]] || return 1
[[ x”$x” != x”/” ]] || return 1
[[ x”$x” != x”/export” && x”$x” != x”/export/” ]] || return 1
[[ x”$x” != x”/export/servers” && x”/export/servers/” ]] || return 1
echo $y | grep ‘^[0-9]\+$’ || return 1
}

main() {
cat $config | sed “/^#.*/d” | sed “/^$/d” > $tempconfig
m=$(awk ‘BEGIN{x=0}{if(x<$2){x=$2}}END{print x}' $tempconfig) n=$(awk 'BEGIN{x=1000}{if(x>$2 && $2 != “”){x=$2}}END{print x}’ $tempconfig)
[[ x”$m” != x”” && x”$n” != x”” ]] || return 1
for ((i=m;i>=n;i–))
do
diskUsage=$(ExportUsage)
diskUsage=${diskUsage:-0}

rootUsage=$(RootUsage)
rootUsage=${rootUsage:-0}
(( diskUsage > 60 )) || (( rootUsage > 60 )) || break
while read line
do
validateConfig $line || continue
p=$(echo $line | awk ‘{print $1}’)
k=$(echo $line | awk ‘{print $2}’)
k=$(expr ${k?} ‘*’ 60)
(( i>=k )) && cleanup $p $k
cleanBigfile $p 1
done < $tempconfig done for (( i=n;i>=1;i– ))
do
diskUsage=$(ExportUsage)
diskUsage=${diskUsage:-0}

rootUsage=$(RootUsage)
rootUsage=${rootUsage:-0}

(( diskUsage > 80 )) || (( rootUsage > 80 )) || break
while read line
do
validateConfig $line || continue
p=$(echo $line | awk ‘{print $1}’)
cleanup $p 5
done < $tempconfig done } test ! -f $lockfile || (echo "hd_delete is running or lock file not released" ; exit 1) [ `whoami` != 'root' ] || (echo "[warn] don't use root privilege for auto_delete "; exit 1) touch $lockfile main PackageClean CataClean /bin/rm -f $lockfile
发布了48 篇原创文章 · 获赞 0 · 访问量 1256

猜你喜欢

转载自blog.csdn.net/zhinengyunwei/article/details/104033607