As DevOps/Ops, you maintain DB instances or RAM intensive services. You see OOM issues occasionally, don’t you? Yes, the scary Out-Of-Memory issues.
Nobody enjoys OOM issues. When it does happen, what should be checked? More importantly, how to monitor OOM issues? And get alerts, before it actually happens.
Here are some of my thoughts. Take a look and discuss with me!
Say you have issued a command in your servers. Typically the command might either backup something or perform a critical hot fix.
Surely you know the start time of the process. But when it will end? How can you find the execution time, when the process has already been started?
Before deployment, people might need to provide multiple information. For example, which nodes to deploy what services, use which tcp ports to listen on application endpoints, etc.
Even very careful person would make stupid mistakes! e.g, wrong ip format, invalid port, unsupported OS version, machine doesn’t have RAM, etc.
These human errors may not only fail your deployments, but also cause unexpected damages to your existing envs. Even mess up critical envs sometimes. So it’s better we enforce pre-check before update.
People might manually change critical config files in servers occasionally. For example, /etc/hosts, /etc/hostname, etc.
As an experienced operator, you will remember to backup, before making any changes. Right? What would you do? cp /etc/hosts /etc/hosts.bak.
But is that good enough?
Using Docker, deployments are more reliable and faster than ever. But how about the docker images build? Containers don’t have any silver bullets. It shifts installation instability from deployment cycle to image build cycle.
I would expect a general solution for the verification of all docker images build. And it should work across different projects. This means less time and effort. Certainly, save money!
Following git workflows, there is a branch called activesprint, or develop. It is the release candidate. Most of active branches should base on it.
Team need to be notified, whenever a new activesprint branch has been created. To lower the communication effort, we can automate the detection process and get slack notifications.
NeuVector is a startup company in Bay Area, focusing on run-time container security. In our previous post, we find docker-bench-security useful to avoid many common Docker pitfalls.
NeuVector helps to address some Docker security issues, which are not well resolved before. e.g, intelligently detect malicious traffic within servers of our critical envs, visualize network topology with large scale of docker envs, etc.
Has the deployment been initiated? Already finished? And how does it look after the deployment? These are typical questions people will frequently ask. Especially managers and key holders.
Thanks to Slack, team can sync up much easier than before. With more and more DevOps adoptions, we’re likely to have one-button click deployment.
Let’s send out slack notifications for system upgrade. Better sync up, better control.
As DevOps or IT professionals, people may ask us why they can’t ssh to servers. It happens from time to time. Isn’t right? Not much fun. Just routine work.
Want to ease the pain and burden? Let’s examine common ssh failures together. Next time forward this link to your colleagues, if useful. People may be able to identify the root cause all by themselves, or be efficient in collecting all necessary information, before turning to us.