linux image

Relieving Linux Server Disk Pressure

Logs and Backups are the No1 cause of critical disk pressure

If you think Sysadmin and Site Reliability Engineer SRE work is all about automation, then you are right to an extent. The advance of infrastructure as code aka IaC and programming languages into the backend has advanced the programmatic abstraction of command line work. However, it has (unfortunately) depreciated the need for good command-line skills and a reasonable working knowledge of lower-level object behaviours. One of those gotchya moments is forgetting bread and butter runbook responses to disk pressure. The first thing in response is to tread carefully, (e.g. 'rm -f' command cannot be undone) so be sure of your strategy knowing that you should only action remediation on files that the application is no longer using. Some general guidance on how to approach disk pressure relief is as follows:

- Delete: use the rm command to delete files that are no longer used in any use case including audits of digital assets.

- Archive: files that are no longer used are using large amounts of disk space. These can be archived (e.g. 'tar -cvf folder filename.tgz') but depending on the file type (logs v binaries), space gains may vary. Careful with this one too, and be sure to not tar active application logs, it will crash the running Tomcat or Apache server.

- Moving: files that are archived on the server and just sitting there should be moved off the server to free up space for active server use after a period of x days.

When the above is considered to relieve current disk pressure, you should investigate your server by logging in at the root level and then checking your root partition /. See where your disk pressure is by sorting your biggest folders in ascending order so the biggest disk space used is at the bottom of the console return and the subfolders that are using it is at the top. You can do this by doing the following steps:

- 'du -sh /' to get overall disk used space and free space unallocated by 'df -h /'. Note the differences if any, which may signal held memory by orphaned child processes or possible zombie processes holding space. This can indicate issues to be investigated via helpful commands like 'ps', 'pgrep', 'free' and 'netstat' and 'lsof'.

- 'du -xh / | sort -n | tail -n 40' - it limits the return by 40 records which should give you an idea of the biggest space user file path at the end and the lower level space users above it. There you can see what subfolders of your heavy space users are using memory and use them as target file paths for investigation and remediation

Now investigate your target file path looking for logs and backups in particular by count so you can see a non-verbose context for the space been used (i.e. all logs past x days or a mix):

- 'find /filepath/targetfolder/ -name backup* (or logs if not in a logs folder like var/log/) -mtime +3 | wc -l' - returns list of files/folders that start with backups in the target folder, which are older then 3 days by count only 

- 'find /filepath/targetfolder/ -type f -mtime +3 | wc -l' - looks for files in the same folder by count, which are older then 3 days. If the count both match, then you know the space used is files beginning with 'backup'

- 'find /filepath/targetfolder/ -name backup* -mtime +3' will return the list of files for review so you can confirm it looks ok to action

Update your ticket with your findings and next steps (you may ID multiple folders to run these commands on), and consult your Runbook for the next steps as approved by management can be conditional deletes/moves/archives. Once you have the authorisation to perform an action to relieve the memory pressure, then do so e.g. delete all backup files in x folder with 'find /filepath/targetfolder/ -name backup* -mtime +3 -exec rm -f {} +' .

Now you can recheck your disk pressure with 'du -sh /' and 'df -h/' to see the pressure release. Note if your type/flavour of Linux requires a restart, make sure it's a graceful one that has management approval and is executed safely given the risks associated with restarting servers (v rebooting VMs, which is inherently riskier). This command-line interaction will allow you to respond to disk pressure in a logical and coherent manner executing safe remediation on the immediate issue. It also provides a data point for future automation of this process to enhance your server's availability.

Stay tuned for more on DevOps in this blog along with articles on other areas of interest in the Writing and Infrastructure arenas. To not miss out on any updates on my availability, tips on related areas or anything of interest to all, sign up for one of my newsletters in the footer of any page on Maolte. I look forward to us becoming pen pals!

Best Regards

John 

 

Related Articles

Predict Data Science and Analysis event in 2018 in Dublin, Ireland.

Events