Linux集群和自动化维2.6.4 开发类脚本

    xiaoxiao2024-01-23  27

    2.6.4 开发类脚本

    业务需求在不断地变化,有时候互联网上的开源方案并不能全部解决,这个时候就需要自己写一些开发类的脚本来满足工作中的需求了,虽然很多时候脚本都可以独立运行,但笔者的做法还是尽量将其return结果写成Nagios能够识别的格式,以便配合Nagios发送报警邮件和信息。

    1.监测redis是否正常运行

    笔者接触的线上NoSQL业务主要是redis数据库,多用于处理大量数据的高访问负载需求。为了最大化地利用资源,每个redis实例分配的内存并不是很大,有时候程序组的同事导入数据量大的IP list时会导致redis实例崩溃,所以笔者开发了一个redis监测脚本并配合Nagios进行工作,脚本内容如下所示(此脚本在Amazon Linux AMI x86_64下已测试通过):

    #!/usr/bin/python

    #Check redis Nagios Plungin,Please install the redis-py module.

    import redis

    import sys

     

    STATUS_OK = 0

    STATUS_WARNING = 1

    STATUS_CRITICAL = 2

     

    HOST = sys.argv[1]

    PORT = int(sys.argv[2])

    WARNING = float(sys.argv[3])

    CRITICAL = float(sys.argv[4])

     

    def connect_redis(host, port):

        r = redis.Redis(host, port, socket_timeout = 5, socket_connect_timeout = 5)

        return r

     

    def main():

        r = connect_redis(HOST, PORT)

        try:

            r.ping()

        except:

            print HOST,PORT,'down'

            sys.exit(STATUS_CRITICAL)

     

        redis_info = r.info()

        used_mem = redis_info['used_memory']/1024/1024/1024.0

        used_mem_human = redis_info['used_memory_human']

     

        if WARNING <= used_mem < CRITICAL:

            print HOST,PORT,'use memory warning',used_mem_human

            sys.exit(STATUS_WARNING)

        elif used_mem >= CRITICAL:

            print HOST,PORT,'use memory critical',used_mem_human

            sys.exit(STATUS_CRITICAL)

        else:

            print HOST,PORT,'use memory ok',used_mem_human

            sys.exit(STATUS_OK)

     

    if __name__ == '__main__':

        main()

    2.监测机器的IP连接数

    需求其实比较简单,先统计IP连接数,如果ip_conns值小于15 000则显示为正常,介于15 000至20 000之间为警告,如果超过20 000则报警,脚本内容如下所示(此脚本在Amazon Linux AMI x86_64下已测试通过):

    #!/bin/bash

    #Nagios plugin For ip connects

    #$1 = 15000 $2 = 20000

    ip_conns=`netstat -an | grep tcp | grep EST | wc -l`

    messages=`netstat -ant | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'|tr -s '\n' ',' | sed -r 's/(.*),/\1\n/g' `

     

    if [ $ip_conns -lt $1 ]

    then

        echo "$messages,OK -connect counts is $ip_conns"

        exit 0

    fi

    if [ $ip_conns -gt $1 -a $ip_conns -lt $2 ]

    then

        echo "$messages,Warning -connect counts is $ip_conns"

        exit 1

    fi

    if [ $ip_conns -gt $2 ]

    then

        echo "$messages,Critical -connect counts is $ip_conns"

        exit 2

    fi

    3.监测机器的CPU利用率脚本

    线上的bidder业务机器,在业务繁忙的高峰期会出现CPU利用率达到100%(sys%+ user%),导致后面的流量打在上面却完全进不去的情况,但此时机器、系统负载及Nginx+Lua进程都是完全正常的,所以这种情况下需要开发一个CPU利用率脚本,在超过自定义阀值时报警,方便运维人员批量添加bidder AMI机器以应对峰值,AWS EC2实例机器是可以以小时来计费的,大家在这里也要注意分清系统负载和CPU利用率之间的区别。脚本内容如下所示(此脚本在Amazon Linux AMI x86_64下已测试通过):

    #!/bin/bash

    # ==============================================================================

    # CPU Utilization Statistics plugin for Nagios

    #

    # USAGE     :   ./check_cpu_utili.sh [-w <user,system,iowait>] [-c <user,system,iowait>] ( [ -i <intervals in second> ] [ -n <report number> ])

    #          

    # Exemple: ./check_cpu_utili.sh

    #          ./check_cpu_utili.sh -w 70,40,30 -c 90,60,40

    #          ./check_cpu_utili.sh -w 70,40,30 -c 90,60,40 -i 3 -n 5

    #-------------------------------------------------------------------------------

    # Paths to commands used in this script.  These may have to be modified to match your system setup.

    IOSTAT="/usr/bin/iostat"

     

    # Nagios return codes

    STATE_OK=0

    STATE_WARNING=1

    STATE_CRITICAL=2

    STATE_UNKNOWN=3

     

    # Plugin parameters value if not define

    LIST_WARNING_THRESHOLD="70,40,30"

    LIST_CRITICAL_THRESHOLD="90,60,40"

    INTERVAL_SEC=1

    NUM_REPORT=1

    # Plugin variable description

    PROGNAME=$(basename $0)

     

    if [ ! -x $IOSTAT ]; then

        echo "UNKNOWN: iostat not found or is not executable by the nagios user."

        exit $STATE_UNKNOWN

    fi

     

    print_usage() {

            echo ""

            echo "$PROGNAME $RELEASE - CPU Utilization check script for Nagios"

            echo ""

            echo "Usage: check_cpu_utili.sh -w -c (-i -n)"

            echo ""

            echo "  -w  Warning threshold in % for warn_user,warn_system,warn_iowait CPU (default : 70,40,30)"

            echo "  Exit with WARNING status if cpu exceeds warn_n"

            echo "  -c  Critical threshold in % for crit_user,crit_system,crit_iowait CPU (default : 90,60,40)"

            echo "  Exit with CRITICAL status if cpu exceeds crit_n"

            echo "  -i  Interval in seconds for iostat (default : 1)"

            echo "  -n  Number report for iostat (default : 3)"

            echo "  -h  Show this page"

            echo ""

        echo "Usage: $PROGNAME"

        echo "Usage: $PROGNAME --help"

        echo ""

        exit 0

    }

     

    print_help() {

        print_usage

            echo ""

            echo "This plugin will check cpu utilization (user,system,CPU_Iowait in %)"

            echo ""

        exit 0

    }

     

    # Parse parameters

    while [ $# -gt 0 ]; do

        case "$1" in

            -h | --help)

                print_help

                exit $STATE_OK

                ;;

            -v | --version)

                    print_release

                    exit $STATE_OK

                    ;;

            -w | --warning)

                    shift

                    LIST_WARNING_THRESHOLD=$1

                    ;;

            -c | --critical)

                   shift

                    LIST_CRITICAL_THRESHOLD=$1

                    ;;

            -i | --interval)

                   shift

                   INTERVAL_SEC=$1

                    ;;

            -n | --number)

                   shift

                   NUM_REPORT=$1

                    ;;       

            *)  echo "Unknown argument: $1"

                print_usage

                exit $STATE_UNKNOWN

                ;;

            esac

    shift

    done

     

    # List to Table for warning threshold (compatibility with

    TAB_WARNING_THRESHOLD=(`echo $LIST_WARNING_THRESHOLD | sed 's/,/ /g'`)

    if [ "${#TAB_WARNING_THRESHOLD[@]}" -ne "3" ]; then

        echo "ERROR : Bad count parameter in Warning Threshold"

        exit $STATE_WARNING

    else 

    USER_WARNING_THRESHOLD=`echo ${TAB_WARNING_THRESHOLD[0]}`

    SYSTEM_WARNING_THRESHOLD=`echo ${TAB_WARNING_THRESHOLD[1]}`

    IOWAIT_WARNING_THRESHOLD=`echo ${TAB_WARNING_THRESHOLD[2]}`

    fi

     

    # List to Table for critical threshold

    TAB_CRITICAL_THRESHOLD=(`echo $LIST_CRITICAL_THRESHOLD | sed 's/,/ /g'`)

    if [ "${#TAB_CRITICAL_THRESHOLD[@]}" -ne "3" ]; then

        echo "ERROR : Bad count parameter in CRITICAL Threshold"

        exit $STATE_WARNING

    else

    USER_CRITICAL_THRESHOLD=`echo ${TAB_CRITICAL_THRESHOLD[0]}`

    SYSTEM_CRITICAL_THRESHOLD=`echo ${TAB_CRITICAL_THRESHOLD[1]}`

    IOWAIT_CRITICAL_THRESHOLD=`echo ${TAB_CRITICAL_THRESHOLD[2]}`

    fi

     

    if [ ${TAB_WARNING_THRESHOLD[0]} -ge ${TAB_CRITICAL_THRESHOLD[0]} -o ${TAB_WARNING_THRESHOLD[1]} -ge ${TAB_CRITICAL_THRESHOLD[1]} -o ${TAB_WARNING_THRESHOLD[2]} -ge ${TAB_CRITICAL_THRESHOLD[2]} ]; then

      echo "ERROR : Critical CPU Threshold lower as Warning CPU Threshold "

      exit $STATE_WARNING

    fi

     

    CPU_REPORT=`iostat -c $INTERVAL_SEC $NUM_REPORT | sed -e 's/,/./g' | tr -s ' ' ';' | sed '/^$/d' | tail -1`

    CPU_REPORT_SECTIONS=`echo ${CPU_REPORT} | grep ';' -o | wc -l`

    CPU_USER=`echo $CPU_REPORT | cut -d ";" -f 2`

    CPU_SYSTEM=`echo $CPU_REPORT | cut -d ";" -f 4`

    CPU_IOWAIT=`echo $CPU_REPORT | cut -d ";" -f 5`

    CPU_STEAL=`echo $CPU_REPORT | cut -d ";" -f 6`

    CPU_IDLE=`echo $CPU_REPORT | cut -d ";" -f 7`

    NAGIOS_STATUS="user=${CPU_USER}%,system=${CPU_SYSTEM}%,iowait=${CPU_IOWAIT}%,idle=${CPU_IDLE}%"

    NAGIOS_DATA="CpuUser=${CPU_USER};${TAB_WARNING_THRESHOLD[0]};${TAB_CRITICAL_THRESHOLD[0]};0"

     

    CPU_USER_MAJOR=`echo $CPU_USER| cut -d "." -f 1`

    CPU_SYSTEM_MAJOR=`echo $CPU_SYSTEM | cut -d "." -f 1`

    CPU_IOWAIT_MAJOR=`echo $CPU_IOWAIT | cut -d "." -f 1`

    CPU_IDLE_MAJOR=`echo $CPU_IDLE | cut -d "." -f 1`

     

    # Return

    if [ ${CPU_USER_MAJOR} -ge $USER_CRITICAL_THRESHOLD ]; then

            echo "CPU STATISTICS OK:${NAGIOS_STATUS} | CPU_USER=${CPU_USER}%;70;90;0;100"

            exit $STATE_CRITICAL

        elif [ ${CPU_SYSTEM_MAJOR} -ge $SYSTEM_CRITICAL_THRESHOLD ]; then

            echo "CPU STATISTICS OK:${NAGIOS_STATUS} | CPU_USER=${CPU_USER}%;70;90;0;100"

            exit $STATE_CRITICAL

        elif [ ${CPU_IOWAIT_MAJOR} -ge $IOWAIT_CRITICAL_THRESHOLD ]; then

            echo "CPU STATISTICS OK:${NAGIOS_STATUS} | CPU_USER=${CPU_USER}%;70;90;0;100"

            exit $STATE_CRITICAL

        elif [ ${CPU_USER_MAJOR} -ge $USER_WARNING_THRESHOLD ] && [ ${CPU_USER_MAJOR} -lt $USER_CRITICAL_THRESHOLD ]; then

            echo "CPU STATISTICS OK:${NAGIOS_STATUS} | CPU_USER=${CPU_USER}%;70;90;0;100"

            exit $STATE_WARNING

          elif [ ${CPU_SYSTEM_MAJOR} -ge $SYSTEM_WARNING_THRESHOLD ] && [ ${CPU_SYSTEM_MAJOR} -lt $SYSTEM_CRITICAL_THRESHOLD ]; then

            echo "CPU STATISTICS OK:${NAGIOS_STATUS} | CPU_USER=${CPU_USER}%;70;90;0;100"

            exit $STATE_WARNING

          elif  [ ${CPU_IOWAIT_MAJOR} -ge $IOWAIT_WARNING_THRESHOLD ] && [ ${CPU_IOWAIT_MAJOR} -lt $IOWAIT_CRITICAL_THRESHOLD ]; then

            echo "CPU STATISTICS OK:${NAGIOS_STATUS} | CPU_USER=${CPU_USER}%;70;90;0;100"

            exit $STATE_WARNING

    else

          

            echo "CPU STATISTICS OK:${NAGIOS_STATUS} | CPU_USER=${CPU_USER}%;70;90;0;100"

            exit $STATE_OK

    fi

    此脚本参考了Nagios的官方文档https://exchange.nagios.org/并进行了代码精简和移值,源代码是运行在ksh下面的,这里将其移植到了bash下面,ksh下定义数组的方式跟bash还是有区别的;另外有一点也请大家注意,Shell本身是不支持浮点运算的,但可以通过bc或awk的方式来处理。

    另外,若要配合PNP4nagios出图(PNP4nagios可以观察一段周期内的CPU利用率峰值),此脚本还可以更精简,脚本内容如下所示(此脚本在Amazon Linux AMI x86_64下已测试通过):

    #!/bin/bash

    # Nagios return codes

    STATE_OK=0

    STATE_WARNING=1

    STATE_CRITICAL=2

    STATE_UNKNOWN=3

     

    # Plugin parameters value if not define

    LIST_WARNING_THRESHOLD="90"

    LIST_CRITICAL_THRESHOLD="95"

    INTERVAL_SEC=1

    NUM_REPORT=5

     

    CPU_REPORT=`iostat -c $INTERVAL $NUM_REPORT  | sed -e 's/,/./g' | tr -s ' ' ';' | sed '/^$/d' |tail -1`

    CPU_REPORT_SECTIONS=`echo ${CPU_REPORT} | grep ';' -o | wc -l`

    CPU_USER=`echo $CPU_REPORT | cut -d ";" -f 2`

    CPU_SYSTEM=`echo $CPU_REPORT | cut -d ";" -f 4`

    # Add for integer shell issue

    CPU_USER_MAJOR=`echo $CPU_USER | cut -d "." -f 1`

    CPU_SYSTEM_MAJOR=`echo $CPU_SYSTEM | cut -d "." -f 1`

    CPU_UTILI_COU=`echo ${CPU_USER} + ${CPU_SYSTEM}|bc`

    CPU_UTILI_COUNTER=`echo $CPU_UTILI_COU | cut -d "." -f 1`

     

    # Return

    if [ ${CPU_UTILI_COUNTER} -lt ${LIST_WARNING_THRESHOLD} ]

    then

        echo "OK - CPUCOU=${CPU_UTILI_COU}% | CPUCOU=${CPU_UTILI_COU}%;80;90"

        exit ${STATE_OK}

    fi

    if [ ${CPU_UTILI_COUNTER} -gt ${LIST_WARNING_THRESHOLD} -a ${CPU_UTILI_COUNTER} -lt ${LIST_CRITICAL_THRESHOLD} ]

    then

        echo "Warning - CPUCOU=${CPU_UTILI_COUNTER}% | CPUCOU=${CPU_UTILI_COUNTER}%;80;90"

        exit ${STATE_WARNING}

    fi

    if [ ${CPU_UTILI_COUNTER} -gt ${LIST_CRITICAL_THRESHOLD} ]

    then

       echo "Critical - CPUCOU=${CPU_UTILI_COUNTER}% | CPUCOU=${CPU_UTILI_COUNTER}%;80;90"

        exit ${STATE_CRITICAL}

    fi

    相关资源:敏捷开发V1.0.pptx
    最新回复(0)