The stop request took longer

Peter Mutisya

Last week, we had an issue where some of our services went down. The reason is our restart script did a stop followed by a start. The stop request took longer and by the time the start was getting called, the service was not yet down. When the service went down, at last, it stayed down.

The stop script

Below is our stop script for the said service, let's call it servicex.

#!/bin/sh

# kill docker process 
if [ `ps -ef | grep -v grep | grep "servicex" |wc -l` -eq 1 ]
then
        echo "Killing the servicex service - docker instance"

        ps -ef | grep "servicex" | grep -v grep  | awk '{ print $2 }' | xargs kill -9

        echo "Killed the servicex process"
fi

exit 0;

There are a couple of issues with it:

  1. Non-deterministic: The script can exit and still leave the service running.
  2. Forceful: The script executes a force kill ( kill -9 ) which prevents graceful shutdown.

I decided to replace the script to add a while loop to check whether the service has shut down and remove the force kill. Here is the new stop script:

if [ `ps -ef |  grep -v grep | grep "servicex" |wc -l` -eq 1 ]
then
        echo "Killing the old servicex service"

        ps -ef | grep "servicex" | grep -v grep  | awk '{ print $2 }' | xargs kill

        while [ `ps -ef | grep -v grep | grep "servicex" |wc -l` -eq 1 ]
        do
            echo "waiting for service to stop"
            sleep 1
        done

        echo "Killed the old servicex process"
fi

exit 0;

I can make one final addition to inform me how long (approximately) the service took to stop.

if [ `ps -ef | grep -v grep | grep "servicex" |wc -l` -eq 1 ]
then
        echo "Killing the old servicex service"

        ps -ef | grep "servicex" | grep -v grep  | awk '{ print $2 }' | xargs kill

        i=0

        while [ `ps -ef | grep -v grep | grep "servicex" |wc -l` -eq 1 ]
        do
            echo "waiting for service to stop"
            sleep 1
            i=$(( $i +1 ))
        done

        echo "Killed the old servicex process in ${i} seconds"
fi

exit 0;

When you run the service, this is logged:

> Killing the old servicex service
> waiting for service to stop
> Killed the old servicex process in 1 seconds