For DevOps, installation is one of our major tasks. People may wonder package installation is pretty straight-forward and easy now. Just run commands like apt-get, yum, brew, etc. Or simply leave it to containers.
Is it really that easy? Here is a list of headache and hidden costs. Discuss with us, DevOps gurus!
Original Article: https://dennyzhang.com/installation_failure
Admit it. We all have unexpected installation failures.
Okay, we have wrapped up multiple scripts, which will install and configure all required services and components. And the test looks good. Services are running correctly, GUI opens nicely. It feels just great. Maybe even a bit proud of our achievements. Shouldn’t we?
Then more and more people are starting to use our code to do the deployment. That’s when the real fun comes. Oh, yes. Surprises and embarrassments too. Packages installation fails with endless issues. The process mysteriously stucks somewhere with little clues. Or installation itself seem to be fine, but system just doesn’t behavior the same as our testing envs.
Firstly people won’t complain, and they understand it happens. However with more and more issues, the smell changes. And you feel the pressure! You gonna tell yourself the failure won’t and shouldn’t happen again. But do you really have 100% confidence?
Your boss and colleagues have their concerns too. This part hasn’t been changed that much. And the task seems to be quite straight-forward. Why it takes so long? And how much longer you will need to stabilize the installation?
Sounds familiar? It’s exactly how I felt in the past years. So what the moving parts and obstacles really are, in terms of system installation? We want to deliver the installation feature quickly. And it has to be reliable and stable.
Problem1: Tools are in rapid development, and complicated package dependencies with incompatible versions.
Linux is powerful, because it believes in the philosophy of simplicity. Each tool is there for one and simple purpose. Then we combine different tools into bigger ones, for bigger missions.
That’s so called integration. Yeah, the integration!
If we only integrate stable and well-known tools, we’re in luck. Probably things will go smoothly. Otherwise the situation would be much different.
- Tools are in rapid development simply indicates issues, limitations and workarounds.
Even worse, the error messages could be confusing. Check below error of chef development. How we can easily guess it’s a locale issue, not a bug, at the first time?
Installing yum-epel (0.6.0) from https://supermarket.getchef.com ([opscode] https://supermarket.chef.io/api/v1) Installing yum (3.5.3) from https://supermarket.getchef.com ([opscode] https://supermarket.chef.io/api/v1) /var/lib/gems/1.9.1/gems/json-1.8.2/lib/json/common.rb:155:in `encode': "\xC2" on US-ASCII (Encoding::InvalidByteSequenceError) from /var/lib/gems/1.9.1/gems/json-1.8.2/lib/json/common.rb:155:in `initialize' from /var/lib/gems/1.9.1/gems/json-1.8.2/lib/json/common.rb:155:in `new' from /var/lib/gems/1.9.1/gems/json-1.8.2/lib/json/common.rb:155:in `parse' from /var/lib/gems/1.9.1/gems/ridley-4.1.2/lib/ridley/chef/cookbook/metadata.rb:473:in `from_json' from /var/lib/gems/1.9.1/gems/ridley-4.1.2/lib/ridley/chef/cookbook/metadata.rb:29:in `from_json' from /var/lib/gems/1.9.1/gems/ridley-4.1.2/lib/ridley/chef/cookbook.rb:36:in `from_path' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/cached_cookbook.rb:15:in `from_store_path' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/cookbook_store.rb:86:in `cookbook' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/cookbook_store.rb:67:in `import' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/cookbook_store.rb:30:in `import' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/installer.rb:106:in `block in install' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/downloader.rb:38:in `block in download' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/downloader.rb:35:in `each' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/downloader.rb:35:in `download' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/installer.rb:105:in `install' from /var/lib/gems/1.9.1/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26:in `public_send'
- Issues of incompatible version frequently happen in system integration. Usually using latest released version for all tools may work. But not always. Sometimes our develop team may have their own preference, which makes things a bit complicated.
We see issues like below constantly. Yes, I know. I need to upgrade ruby, python, or whatever. It just takes time. Unplanned work, again.
sudo gem install rack -v '2.0.1' ERROR: Error installing rack: rack requires Ruby version >= 2.2.2.
Tips: Record the exact version for all components, including OS. After a successful deployment, I usually automatically dump versions via the trick listed in another post: Compare Difference Of Two Envs.
Problem2: Every network request is an vulnerable failing point
It’s quite common, installation will run commands like “apt-get/yum” or “curl/wget”. It will launch outgoing requests.
Well, watch out any network request, my friends.
- The external server may run into 5XX error, timeout or slower than before.
- Files are removed in server, which result in HTTP 404 error.
- Corporate firewall blocks the requests, for the concern of security or data leak.
Each ongoing network request is a failure point. Consequently our deployment fails or suffers.
Tips: Replicate as many as possible in servers under our control. e.g, local http server, apt repo server, etc.
People might try to pre-cache all internet download, by building customized OS images or docker images. This is meaningful for installation with no network. It comes with a cost. Things are now more complicated and it takes a significant amount of effort.
Tips: Record all outgoing network requests during deployment. Yes, the issue is still there. But this give us with an valuable input: what to improve or what to check. Tracking requests can be done easily: Monitor Outbound Traffic In Deployment.
Problem3: Always install latest versions could be troublesome.
People install package like below quite often.
apt-get -y update && \ apt-get -y install ruby
But what version we will get? Today we get ruby 1.9.5. But months later, it would be ruby 2.0.0, or 2.2.2. You do see the potential risks, do you?
Tips: Only install packages with fixed version
|Ubuntu||apt-get install docker-engine||apt-get install docker-engine=1.12.1-0~trusty|
|CentOS||yum install kernel-debuginfo||yum install kernel-debuginfo-2.6.18-238.19.1.el5|
|Ruby||gem install rubocop||gem install rubocop -v “0.44.1”|
|Python||pip install flake8||pip install flake8==2.0|
|NodeJs||npm install express||npm install email@example.com|
Problem4: Better avoid installation from 3rd repo
Let’s say we want to install haproxy 1.6. However official Ubuntu repo only provides haproxy with 1.4 or 1.5. So we finally find a nice way like this.
sudo apt-get install software-properties-common add-apt-repository ppa:vbernat/haproxy-1.6 apt-get update apt-get dist-upgrade apt-get install haproxy
It works like a charm. But wait, does this really put an end to this problem? Yes, mostly. However it fails from time to time.
- The availability of 3rd repo is usually lower than official repo.
---- Begin output of apt-key adv --keyserver keyserver.ubuntu.com --recv 1C61B9CD ---- STDOUT: Executing: gpg --ignore-time-conflict --no-options --no-default-keyring --homedir /tmp/tmp.VTYpQ40FG8 --no-auto-check-trustdb --trust-model always --keyring /etc/apt/trusted.gpg --primary-keyring /etc/apt/trusted.gpg --keyring /etc/apt/trusted.gpg.d/brightbox-ruby-ng.gpg --keyring /etc/apt/trusted.gpg.d/oreste-notelli-ppa.gpg --keyring /etc/apt/trusted.gpg.d/webupd8team-java.gpg --keyserver keyserver.ubuntu.com --recv 1C61B9CD gpgkeys: key 1C61B9CD can't be retrieved STDERR: gpg: requesting key 1C61B9CD from hkp server keyserver.ubuntu.com gpg: no valid OpenPGP data found. gpg: Total number processed: 0 ---- End output of apt-key adv --keyserver keyserver.ubuntu.com --recv 1C61B9CD ----
- 3rd repo is more likely to change. Now you get 1.6.5 and happy with that. But suddenly, days later, it starts to install 1.6.6 or 1.6.7. Surprise!
Tips: Avoid 3rd repo as much as possible. If no way, track and examine version installed closely.
Problem5: Install by source code could be painful.
If we can install directly from source code, it’s much more reliable.
But the problem is …
- It’s usually harder. Try to build linux from the scratch, you will feel the disater and mess. Too many wired errors, missing packages, conflict versions, etc. Feel like flying a plane without manual.
- Source compile takes much longer. For example, compile nodejs would take ~30 min. But apt-get only take seconds.
- Missing facility of service management. We want to manage service by via “service XXX status/stop/start”, and configure it to be autostart. With source code installation, they might be missing.
Does container cure the pain?
Nowadays more and more people start to use containers, to avoid installation failure.
Yes, it largely reduce the failures for end users.
Well it doesn’t solve the problem completely. Especially for DevOps. We’re the ones who provide the docker image. Right?
To build images from Dockerfile, we still have 5 common failures listed in above. As a conclusion, container shifts the failure risks from real deployment to image build process.
Further reading: 5 Tips For Building Docker Image.
Bring it all together
Improvement suggestions for package installation
- List all versions and hidden dependencies
- Monitor All External Outgoing Traffic
- Only Install Package With Fixed Version
- Try your best to avoid 3rd repo
Containers help to reduce installation failure largely. For DevOps like us, we still need to deal with all above possible failures in images build process.
More Reading: How To Check Linux Process Deeply With Common Sense