9 Key Feedbacks For Prod Envs Maintenance

To break silos and improve availability, DevOps/Ops should be actively collecting useful feedback of prod env maintenance on a regular basis. Enable developers to easily access it and improve feedback loop together as a team effort.

The very first and most important part. What To Examine, Providing Developers Meaningful Feedback?

Continuous Feedback

Here is a list I frequently check these years.

1.1 1. Fundamental Monitoring Matters At Both OS And Process Level.

Apparently we need to measure the usage of 4 key resources: memory, CPU, disk and network. Sometime OS is fine, just your crucial process suffers. To be on top of this, monitor at process level as well.

1.2 2. Detect Resource Leak In Your Applications.

  • Memory leak: This defect is a close friend of service outage. If memory usage keeps rising steadily, ring a bell to your dev team.
  • Stale file handlers: Files may have been deleted somehow, but your application still hold the file handlers or even read/write those files. Detect it by “pmap -x $pid | grep deleted“.
  • Overwhelming network sockets: Either your application can’t serve requests fast enough or it has issues reclaiming socket fd. Check this by “lsof -p $pid | grep -iE ‘TCP|pipe|socket|anoninode‘”. If lots of TCP socket are in WAITCLOSE state, it’s a bad sign too.
# Detect unhandled file deletion
pmap -x $pid | grep deleted

1.3 3. Always On Top Of Logfiles.

Believe or not, I saw applications diligently recording hundreds of messages to logfiles every second. This eats up disk so fast, even before logrotate takes effect.

For application logging, alert developers for any major errors/exceptions found. For syslogs, DevOps/Ops are usually the only gatekeepers.

1.4 4. Monitor DB Slow Query.

This usually incur random or constant performance penalty to your applications. If we can grab this information to developers, it would be a very valuable input for developers’ trouble shooting.

1.5 5. Change History Of Prod Envs.

A clear and full changelist of prod env may empower developers to identity root cause quickly. See how to Automatically Track All Change History.

1.6 6. Observe Machine Reboot And Service Restart.

Whether all services will come back to normal after machine reboot? This is especially important for cluster env with complex service dependencies.

Restarting service could be scaring. Service stop might hang, service start might be slow.

When we restart db services, do we have to restart application services?

Not all developers know or remember /tmp directory won’t survive from machine reboot.

1.7 7. Enable Coredump When Applications Crash.

Coredump helps developers to understand which thread and which function causes the crash.

1.8 8. Examine JVM For Key Metrics.

For Java application operation, JVM toolkit might be helpful to detect suspicious issues. Be familiar with tools like jps, jstack, jmap, etc.

1.9 9. Simulate Prod Env At Reasonable Cost.

The last but not the least. If DevOps can simulate prod env quickly, developers can have a safe playyard to do tests or dryrun patches. Some common obstacles to achieve this:

  • Budget concern. We may need to start enough VMs, in order to get a min “prod env”.
  • Automate to automate. Not only to automate cluster deployment but also data export and import.
  • Simulate prod env as much as possible. This would the most difficult part. And it varies across projects.

More Reading: Generate Common DB Data Report By ELK


PRs Welcome

Blog URL: https://www.dennyzhang.com/continuous_feedback

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.