Reduce Support Effort Of Low Free Disk Issues

When you hand over developers new VM(s), or when you have just created a critical env for your team, remember to be prepared for low disk issues.

If you don’t, it will bite you sooner or later. More annoying, it usually come out as a recurring issue. Evaluating the total support effort, we might get a much bigger number than we thought. Are you feeling the same, my friend?

So how we can make low disk issues less likely to happen? And when it does happen, how we can resolve it faster with less impact?

Reduce Support Effort Of Low Disk Issues


Original Article: http://dennyzhang.com/low_disk

1. Reasonable capacity planning is a must.

You can’t put 20GB data into 10GB disk. The good news is cloud providers usually enable us to attach new volumes. Then move data and application(s) into new volumes.

But it may not be working in the way you like.

For example, in Azure you can keep adding volumes, but you can’t have volumes bigger than 1TB. With Linode4096, the total disk capacity will be smaller than 48GB, regardless how many volumes you have created. Surprised, aren’t you? Anyway it’s better we do the capacity planning in advance.

2. Monitor all disk volumes instead of rootfs only.

Apparently we want to be notified earlier before others can notice.

One common mistake I’ve noticed is that people have attached more volumes, but they just forget to monitor the new disks as well. And it usually lead to surprises.

If you use nagios, check_linux_stats.pl can easily monitor all volumes[1]. Probably you can find many other open source alternatives. (Leave me comments, if you have good recommendations.)

# Check all disk volumes.
# Get warning alerts, if disk utilization is over 90%
# Get critical alerts, if disk utilization is over 95%
./check_linux_stats.pl -D -w 10 -c 5

For small teams or less critical envs, Nagios/Zabbix are just too heavy-weight for me, or I simply don’t have the time to learn and setup that. Yes, I understand your pain and concern.

Here would be my suggestions, though I don’t have code snippets ready.

  1. If the VM runs docker daemon, we can start a diagnostic container with overlay/aufs driver. Then the container runs the periodical check.
  2. Automate the configuration. So it will download the check script, create crontab job and enable proper alerting. And everything is automated.

What’s your cure about this scenario, my friend? Share it with me.

3. Monitor both disk volumes but also selective folders.

With enough information of your applications, you will know which folders are most likely to be oversized.

Common Sources Of Low Disk Issues.

Data Sources Category Comment
Old Artifacts System From old deployments
Temporary Backup System People run manual cp command
General Logfiles System Logs without rotate
Coredump System Coredump from frequent app crash
DB data App Just too many business data
Trash from bugs App Application bugs lead to garbages.
Fast-growing logfiles App Ever notice 10GB log per day?

So identity those folders, and put effective watch on them. du command could be quite handy for this job.

root@dennyzhang:~# du -h -d 2 /opt
36K	/opt/chef/bin
168M	/opt/chef/embedded
168M	/opt/chef
48K	/opt/devops/bin
52K	/opt/devops
12K	/opt/java_jce/8
16K	/opt/java_jce
..
7.5M	/opt/digitalocean/bin
7.5M	/opt/digitalocean
114G	/opt

4. Enforce daily cleanup in a managed way.

You have received low disk alerts and identified the big folders. Now it’s time to do the cleanup!

So what do you usually do? Something like this?

# find and rm
find /var/log/ -name "*.gz" -type f -exec rm -f {} \;

# manually rm
> application.log
rm -rm /data/backup/couchbase-20170522

It will work, but …

How you track the changes for future trouble shooting purpose?

For the old useless files in specific folders, we can usually find some pattern to remove. Instead of manual cleanup, why not automate it?

Here comes cleanup_old_files.py in Github.

  1. It will remove files/folders by pattern.
  2. By default, it will keep latest several backups.
  3. Track what are removed in log files

Reduce Support Effort Of Low Disk Issues

What’s better, we can wrap up this cleanup script as a scheduled Jenkins job. Then enable slack notification of this job.

Reduce Support Effort Of Low Disk Issues

So we can easily enforce daily cleanup. And when something wrong, we will get slack notifications. Wnat to check the change history? Jenkins GUI is super easy and straightforward.

My friend, hope you can save your precious time for dealing with the notorious low disk issues. And everything is automated as much as you can. Cheers!

More Reading:

Footnotes:

[1] exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_linux_stats/details

Check our popular posts? Discuss with us on LinkedIn, Twitter Or NewsLetter.

Leave a Reply

Your email address will not be published. Required fields are marked *