Apparently process checking is critical. Yes, we already have tons of linux tools and tips available. Getting familiar with your weapons is actually the first step, and the easiest part.
More importantly, what questions you ask, what for, when approaching your critical process. Fortunately even with plain common sense, we can dig out lots of valuable information.
Original Article: https://dennyzhang.com/check_process
Assumptions Before Deep Dive
Here we assume you are familiar with:
- FD (file descriptor): Everything in linux is a file.
- /proc pseudo filesystem: How Linux kernel exposes in-depth information of process.
- lsof, top, ps, grep: First time heard of them? Excuse me?
Basic Check For Linux Process
Note: If you’re lucky enough to run service by systemd, “service XXX status” will give you a lot of useful information.
- When the process is started and how long it runs? This helps us to detect whether an unexpected or suspicious service restart has happened. As a supplementary, decent service will always do proper logging, which can confirms our observation.
# Get start time by pid ps -eo pid,comm,etime,user | grep $pid # Sample output: root@s1:~# ps -eo pid,comm,etime,user \ | grep 20513 20513 dockerd 8-00:58:30 root # It means 8 days, 58 min and 30 sec
- Where is the log file? A very common question, especially from Dev or QA. Usually process will do continuous logging. Thus it holds fd of log files. lsof can list all fd opened by the process. So you don’t need to ask anyone to find out the answer!
# Find out log files by pid lsof -P -n -p $pid | grep ".*log$" # Sample output: # root@s1/# lsof -p 40 | grep ".*log$" # daemon .. /var/log/jenkins/jenkins.log # daemon .. /var/log/jenkins/jenkins.log # Check log files for error/exceptions grep -C 3 -iE "exception|error" $logfile
- How many CPU and memory the process takes? We certainly need to be on top of any abnormal resource utilization. Fortunately almost all modern monitoring systems enable us to see the history. A big plus for trouble shooting.
# Check process resource utilization top -p $pid
- What’s the command line starting the process? People ask this question, when they’re required to manage unfamiliar or uncomfortable services. A more urgent case: the stupid service just mysteriously refuses to start. Wrong java opts? File permission issue? The process command line can give us some insight or hints.
# Find out process start command line cat /proc/$pid/cmdline
- What TCP ports are listening by the process? Nowdays the majority of service are web-based or micro-services. It helps, if we can understand what TCP ports the process is listening.
# Check what ports are serving lsof -P -n -p $pid | grep -i listen # Check whether given port is listening lsof -i tcp:$tcp_port
- How many fd the process is opening? Usually too many fd opening is a bad sign, say over 3000: a bad design makes application is inefficient for handling requests; fd resource leak; too many requests exceeding our expectation.
# Get total fd count opened by pid lsof -p $pid | wc -l
Advanced Check For Linux Process
- Check how resident memory is used by the process? This is especially important when the process is taking way too much memory. pmap reports memory map of a process.
# Display detail memory usage pmap -x $pid
- Find out process tree? For mult-threading process, displaying all threads and their starting commands might be helpful. It gives us very good insight.
# Get all threads for a given process pstree -A -a -p $pid # keep checking process tree watch "pstree -A -a -p $pid"
- Detect Long TCP connections and how long they have been running? Watch out long TCP connections. Daemon service might not only take requests, but also initiate connections. Developers may keep long tcp connections from applications to DB services. When app nodes and DB nodes are disconnected or db instances are restarted, will your process survive from the chaos and behave functional?
# List TCP connections it starts lsof -p $pid | grep ESTABLISHED # Check create/update time for given fd stat /proc/$pid/fd/$fd_num # Sample: # root@s1:~# date # Fri Sep 23 23:22:22 EDT 2016 # #root@s1:# lsof -p 265 |grep ESTABLISHED # 134u . 47..33 s1:59427->s2:9300 (EST.. # 140u . 47..10 s1:38078->s2:9300 (EST.. # 142u . 47..11 s1:38079->s2:9300 (EST.. # 143u . 47..81 s1:51033->s2:9300 (EST.. # # root@s1:~# stat /proc/265/fd/134 # File: /proc/265/fd/134->socket:[47.. # Size: 64 Blocks: 0 .. # Device: 3h/3d Inode: 463..8 Links:.. # Access: (0700/lrwx------) Uid: (0/.. # Access: 2016-09-23 19:50:12... -0400 # Modify: 2016-09-05 19:48:05... -0400 # Change: 2016-09-05 19:48:05... -0400
- How to detect FD leak? If application keeps opening files or sockets without gracefully closing them, it’s a FD leak issue. Its fd count will keep rising, and eventually the process will crash. Usually this happens in problematic error handling logic.
# Get total fd count by pid lsof -p $pid | wc -l
- What files are being downloading and what is the progress status? The application might be stucked doing heavy internet request, e.g downloading huge files. To dig out the detail status, we can get the fd, which should be regular file and in write mode. Then keep polling file size to understand where we are.
# Get REG(regular) fd with write mode lsof -p $pid | grep REG | grep "w " # Check file size watch "ls -lth /proc/$pid/fd/$fd_num"
- Check files are deleted but not gracefully closed? When files are removed somehow, your process might still hold the stale fd. Or even try to read or write the file. This should be definitely avoided and get developers alerted.
# List unexpected file deletion lsof -p $pid | grep deleted
More Reading: 9 Key Feedbacks For Prod Envs Maintenance.