Avoid Blind Wait In DevOps Code

Occasionally DevOps code needs to check and wait status, before running further steps. For example, wait for service A to be up, then start service B; confirm TCP port is listening, then launch requests; etc.

For simplicity or time pressure, people usually use a blind wait like “sleep 10” to fix this. This is certainly not good enough. How we can improve this with affordable cost?

Blind Wait

Let’s examine below automation requirement, which is quite common in daily life of DevOps. You’re asked to start service1, then service2. However you can only start service2, after service1 is up and running well. If not, service2 may fail to start or run into unexpected behaviors.

1.1 Solution v1.0: blind wait

service service1 start
sleep 10
service service2 start

Here we wait for a while (10 seconds) in between. The good news is it may work in most cases. However this implementation has two drawbacks:

  • No Guarantee Of Assumption. Even after waiting for 10 seconds, we can’t be sure service1 is up. Service2 start may still fail. Furthermore running following steps with this false assumption may result in unexpected situation.
  • Waste Of Time. Let’s say service1 start usually takes less than 4 seconds. This means we always waste over 6 seconds doing the blind wait.

To improve this, we can keep polling the status of service1. Though it’s usually not easy to claim whether service is 100% healthy, we can make a safe trade-off. If “service XXX status” reports running or the TCP port is listening, we can say the service is probably OK.

1.2 Solution v2.0: wait with bash loop

service service1 start

# Wait and poll status with timeout mechanism
for((i=0; i<timeout_seconds; i++)); do
    if lsof -i tcp:$tcp_port | grep -i listen; then
        echo "$tcp_port is listening"
    sleep 1

if $check_pass; then
    echo "check pass"
    echo "check fail"
service service2 start

With around 20 extra code lines, we solve the problem beautifully. So is it good enough now? Not really! We can see this requirement is very common, which indicates a lot of code duplication.

What we need is a common wait mechanism. If the condition meets, it reports OK. If it fails or timeout, it reports ERROR.

Here comes a general tool: waitfor.sh. GitHub


1.3 Solution v3.0: wait in a simple and clever way

service service1 start
wait_for.sh "nc -z -v -w 5 8080" 10
service service2 start

Simply and Easy. Isn’t it?

More Reading: Parallel Run Commands On Multiple Servers


PRs Welcome

Blog URL: https://www.dennyzhang.com/blind_wait

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.