Unix Sysadmin Aphorisms

Where I work currently we have an email alias to which admins are expected to send reports on system things they change which have any small chance of breaking things. This alias sends mail to all the admins, and further is logged into a file - one per year - in a directory devoted to these. Now, this logging may be objected to as being redundant if you're already using version control, but this alias gives the admin a chance to go into background which might be omitted in the change log. It's also a lot easier to grep for changes this way, not surprisingly.

Finally, Catherine Fulmer also sent me these comments about keeping your coworkers informed:

In addition to your email alias for sysadmins to share information, we use another technique that I find very useful. First we train all sysads to use su to become root (forcing it to always be "su - root") and make exit (and/or shutdown) an alias that sends the command history to the email group.

This not only tells everyone that someone did something as root (in case they forget to), but comes in handy as reference to see what you did the last time when installing software xyz...

Alias example (we use ksh for root):
HISTSIZE=1999;export HISTSIZE
alias exit="{ fc -l 1 | mailx -s su_`hostname` suadmin; kill -1 $$; }"
One thing that we also tend to do is use an echo command to add comments to the history (occassionally even amusing ones).
1       cdsa
2       cd bin
3       l
4       ./OFFSITE_ALL /tmp/offsave 2>&1
5        echo offsite complete
6       exit

Originally submitted by William S. Annis on 21dec99.

What was the return code?

If you write a script that uses a command capable of failing - all of them basically - you should check the return codes. Many commands will in fact return different error codes for different sorts of problems, conveniently listed in the man page near the end.

If you're using perl, use C style error code checking. If you're using shell (Bourne, Bash or Korn shell - never program in the C shell) checking for errors will add a few lines of code, but the check itself is trivial, although surprisingly few people seem aware of the syntax to do so.

Everyone is familiar with the standard file checks in shell:

    if [ ! -x /usr/sbin/mount ]
    then
        echo "Where did /usr/sbin/mount go?!  Exiting."
        exit 1
    fi

Since every command in the unix world returns an error code (remember the strange convention here: zero is true/success and non-zero is false/failure) you can simply do the following to catch an error:

    # Note: you don't need the square brackets here.
    if ! mkdir -m 0750 $udir
    then
        echo "I can't make $udir!  Bailing."
        exit 1
    fi

As an aside, grep(1) has a lovely property: it returns a failure exit status when a search fails, so you can do things like this:

    # Redirect output if you don't want your script to show matches.
    if ! grep $user /etc/passwd > /dev/null
    then
        echo "User $user doesn't exist."
        exit 1
    fi

Originally submitted by William S. Annis on 21dec99.

Where does your symbolic link point today?

There are a slew of denial of service and security attacks that take advantage of the many programs that don't verify that they are in fact talking to a real file and not a symbolic link. Tricks like this:

      # ln -s /etc/passwd /etc/syslog.pid

will cause any number of trials and tribulations when syslogd restarts or when the machine reboots. There are of course many more insidious and subtle examples. See any good security site for more details on this.

Originally submitted by William S. Annis on 22dec99.

Don't HUP RPC.

Most RPC services get very irritable and stop working when either they or the portmapper/rpcbind daemon get killed or receive a HUP signal. So don't do it.

Originally submitted by William S. Annis on 21dec99.

Users want to work, not to hear about Unix admin.

When you fix a problem for a user, tell them it's fixed. Don't tell them how you fixed the problem, or even give them the details of what the problem is unless they want to know or if something they are doing is causing the problem.

Tech talk is not interesting to most people, and worse, may make them feel inadequate or that they're being patronized.

Originally submitted by William S. Annis on 21dec99.

Who's using this computer right now?

That is, don't do wild things on production machines. See who's logged on and what daemons are running before any unplanned reboot.

Originally submitted by William S. Annis on 21dec99.

A web browser is a user.

This could just as easily have been 'DNS is a user' or 'NFS is a user.' The point is that although no one may be logged into a machine, many people may still be depending on it.

Originally submitted by William S. Annis on 21dec99.

Linux puts everything somewhere else.

For people who did sysadmin before Linux came on the scene, this is perhaps the most irritating feature of that environment. Config files, startup scripts, binaries... everything seems to end up someplace a little odd.

So, it is very important for new admins whose only previous experience is with Linux to keep that in mind and not try to turn everything they get their hands on into Linux. In fact, this can be said of any OS. Each has their own differences which have to be learned and for the most part accepted. I've seen what happens when Solaris is bent into looking like AIX. It ain't pretty.

Originally submitted by William S. Annis on 21dec99.

Backticks are fundamental to shell, poison to perl.

When using perl, use open(CMD, "command |") instead of doing everything via backticks. Backticks result in wildly non-portable code, especially if you do complex pipes and shell tricks within them. If you're being tricky in backticks in perl, perhaps you should use shell for the whole project.

Originally submitted by William S. Annis on 21dec99.

RPC treats every network interface like a computer

This includes virtual network interfaces.

If you're using NFS and the netgroup map, you need to remember to include the name and IP of the new NIC in that map, or you'll get permissions errors.

As it turns out, some OSes handle this correctly and direct all outgoing traffic through the primary interface, but you can't guarantee that in many other cases.

Originally submitted by William S. Annis on 21dec99.

Does it really have to run as root?

When setting up cron jobs to do useful things for you, make sure you're not running them as root when that isn't necessary. Creating a system admin account to keep such cron jobs in is a good idea.

Originally submitted by William S. Annis on 21dec99.

Why are you still root?

When you're done doing a root task, log out of the root account. It's much too dangerous staying root when you don't need that power.

One popular restatement of this: "Put the knives away when you're done with them."

Originally submitted by William S. Annis on 21dec99.

It's not finished until you can close the rack door.

That is, keep your cables neat. Apart from looking better, it will make it much easier to hunt down what's attached to whom.

Originally submitted by Tom Limoncelli on 21dec99.

It's not finished until it's documented.

This may originally have been said by Tom Limoncelli.

I myself am a big fan of building documentation into whatever you've done to the extend that's possible. Using Perl's POD to build a man page right into a program is a good example. We love Python's doc strings.

Originally submitted by David Todd on 21dec99 .

Documentation isn't done until someone else understands it.

Make sure your documentation can actually be used by someone else. In particular, a procedure to follow to accomplish some task should be written in much the same way you'd write a program. This includes a debugging phase where someone unfamiliar with the task follows the procedure to see if it actually works.

Originally submitted by William S. Annis on 12jan2000.

security = 1/convenience

Choosing a balance between security and convenience can be maddeningly difficult. Any choice is probably going to irritate some users.

Originally submitted by Steve Barnet on 21dec99.

Friday afternoon is a good time not to change production systems.

Unless you're on call and like coming in on saturday evening.

Originally submitted by Steve Barnet on 21dec99.

Someone had this problem before.

And almost certainly wrote a utility to deal with it already.

Originally submitted by David Todd on 21dec99 .

Make sure you can go backward before you go forward.

"Backup files, backup tapes, backup hardware, backup admins."

Originally submitted by David Todd on 21dec99 .

Email alone is not communication.

Just because you sent someone email that does not necessarily mean you have communicated with them. There's no reason not to call someone if you're not sure.

Sending email is almost an instinctual form of communication among Unix admins, but do try to avoid the common mistake of sending email to someone who's having email problems.

Originally submitted by David Todd on 21dec99 .

There's no such thing as a temporary fix.

Make sure you can live with it longer than you think you will have to.

Originally submitted by David Todd on 21dec99 .

Renormalize, renormalize, renormalize.

Two copies of the data means twice as much work, and n^2 as much hassle.

Originally submitted by David Todd on 21dec99.

Is it really your responsibility to fix this machine?

One of your coworkers may already be working on it. Also, there are a slew of change control issues lurking in this sentence. Finally, if you're part of a team with well-defined specialties, perhaps you should focus on doing your own part, not everyone else's, fun though their job may seem at the moment.

Originally submitted by Chris Josephes on 21dec99.

Your way of thinking isn't the only way of thinking.

'Nuff said. There's often a lot to learn from a new viewpoint.

Originally submitted by Chris Josephes on 21dec99.

cron is not a shell script tool.

Now, we've all done this. We have a problem that can be solved by running a few quick commands nightly. So we create a cron entry to do so. Then we realize we forgot something, and change the cron entry a little, until you end up with something much like this (from Chris's original email):

    0 0 * * * (cd /var/log; mv *.log oldlogs/; PID=`ps -ef | grep syslog`;
    kill -HUP $PID; mailx -s "Here's the output of my cron job"
    name_withheld@somewhere.com)

Don't do this. Write a script and call that from cron.

Originally submitted by Chris Josephes on 21dec99.

Don't rm big files, cat them.

This one is a touch subtle, and is also one everyone has messed up at least once in their careers. Unix uses a reference counting scheme to decide when to wipe a file's inode off the disk. So, a file with several hard links to it will have a count greater than one (see your man page on ls to find out which column of a long list gives the reference count).

The entertaining part comes into play when you realize that an open file has its ref count moved up. This prevents you from removing a file out from beneath some poor, hapless program.

So, you've got a machine with no space in /var and you find to your horror that your cron logs are vast.

So you remove that log file.

The disk space stays the same.

The file is still there on the disk, and will remain there until its reference count goes to zero. Only the name is missing. You'll have to restart cron at this point, to get it to close the log.

The correct way to deal with huge files when shutting off the service to clean the log file is not an option is to cat /dev/null into them, which obliterates the contents: cat /dev/null > huge_file. This may also render it unparsable to log parsing tools if an incomplete log line ends up at the head of the file, so make sure it's genuinely not possible to shut down the service for the 30 seconds or so it would take to clean the log file normally

(Some admins object to using this trick since the resulting log file can get corrupted. YMMV - your mileage may vary).

One side benefit of this behavior comes into play for certain security situations. You can open a temporary file and unlink it before shoving data into it. So long as you never close it you can continue to use that file with no one knowing it exists. Well, except lsof.

Originally submitted by Ben Woodard on 21dec99.

The best time to version control a config file is before you modify it.

Don't wait until you need to change a file before you register it with your change control system. It's too easy to forget in the heat of the moment.

Originally submitted by Tom Limoncelli on 21dec99.

Every config file has a history.

Do you know it? You should.

A config file should know where it came from, especially if you're using a central repository for canonical versions of these files.

It should know how old it is, which most version control systems will handle for you. Don't confuse the Unix mtime date with when the file was last edited. Some configuration management systems may screw that up.

It should know what changes it has experienced in its life. Again, a good version control system will let you do this, so check in changes regularly, don't just let them accumulate.

Finally, if the location information in the config file doesn't say how it was copied to all machines, you should include that information as well, so your cohorts will know whether then need to run make or rdist or something else on a configuration repository machine.

Originally submitted by Tom Limoncelli on 21dec99.

Think twice. Hit enter once.

Originally submitted by Michael Jennings on 22dec99.

666: permissions of the devil.

This is an obvious security problem for any system files. In fact, can you think of a single file everyone should be able to write to?

Originally submitted by Michael Jennings on 22dec99.

`t' is for sticky; `t' is for /tmp

The various high bits of the stat structure have different meanings for directories than they do on files. The sticky bit (chmod 1555 *file*) was originally used to say that you wanted a binary file to linger for a while in memory (as I recall). On a directory, however, the sticky bit says "even if a person has write permission to this directory, don't let them remove files they don't own." So, the correct permissions for /tmp are attained thus: chmod 1777 /tmp

Originally submitted by Michael Jennings on 22dec99.

When everything breaks, don't forget to take breaks.

To quote Derek in full:

This one is related to, though different from `Think twice, hit return once' and `remember, it's only money you're loosing, not human life'. Aside from needing time, even in panicky, rushed situations, to get your head straight about how to fix something, your body _always_ needs to stop typing every once in a while and stretch, relax, etc. You can push yourself beyond this with `What I'm doing right now is too important to take a break' but in the end, you might not be working at all, and no one wants that, not even your panicked boss, who's breathing down your neck for a solution.

Originally submitted by Derek Wright on 27dec99.

Keystrokes kill. Automate.

Now, while there are several aphorisms warning of the dangers of reinventing the wheel, you should still be on the lookout for things you find yourself typing a lot. Especially in these days of increasing awareness of the dangers RSI, saving keystrokes is important.

Originally submitted by Derek Wright on 27dec99.

Fess up.

If you break something visible, don't blame it on the system. You're supposed to be the one keeping it running well anyway, right?

Originally submitted by Ryan Donnelly on 5jan2000 .

Preserve precious permissions with cp -p.

From Dave's original submission:

Using cp(1)'s "-p" option will retain the file modification time. Quite often, when doing a post-mortem or investigating which change went awry, the timestamp is as precious as the file content.

The owner and group will also be preserved if you're root or if you have the misfortune to be using a system that lets one give away files (yuck).
   # cp -p file file.orig
Why preserving the modification time, etc. wasn't the default behavior with cp(1) I don't know, but rarely, if ever, would using "-p" be troublesome.

As it turns out, I can think of situations where you do not want to preserve permissions, especially when you're not doing temporary copies, so I'm not sure I can recommend aliasing cp in good conscience. It'll certainly be the right thing in any context where security is relevant.

The -p option to tar also preserves permissions this way, so don't forget that when you do the magical tar-pipe copy.

Originally submitted by Dave Plonka on 5jan2000.

New eyes have X-ray vision.

When you've been looking at a problem constantly for several hours (or days) it may be time to bring in a fellow sysadmin to look at the problem. Often even the simplest problems can elude our attention when we've been starting at the same file or situation for too long.

I've found new eyes can be particularly helpful when you're dealing with config files with dense or fussy syntax (sendmail comes to mind) or when programming.

Originally submitted by William S. Annis on 12jan2000.

Just because you can, doesn't mean you should.

Several people suggested this one in various forms. I've been saying it for years. Chris Ice has the best story for an example, though:

I'm also reminded of my calculus prof in college. He worked on a research project for Wisconsin Power which looked at using a giant, superconduting coil as a power-generation "flywheel". Store energy during off-peak demand, and draw on it to smooth out during peak.
Problem was, if any one part of the coil got warm by even a degree F, the localized heating would cause a chain reaction and make the coil go almost instantly non-conductive. A couple hundred thousand Kw would then be converted directly to heat. The explosion would be as devastating as a meduim sized nuclear device.

Moral of the story..."Just because we *can*, doesn't mean we should."

Originally submitted by William Annis on 13jan2000.

Know the problem before choosing the solution.

This one has come in under various forms and has generated a lot of interesting discussion. In fact, this aphorism has three distict areas of application.

First, make sure you actually do some analysis of a problem before making a fix. The symptom is not always the problem. I know I've provided solutions looking for a problem before. This application of the aphorism used to say "Analysis, analysis, not blind groping!"

The second area of application here requires more delicacy in handling. Often, once a user learns the technical details of some problem which has existed in the past, they tend to tell you what you need to do, not what problem they're having. So once again you run the risk of solving a problem which ins't in fact broken. One admin talked about how for months after one person got strange permissions on /dev/null, some users were sending in requests to fix the permissions of /dev/null for every problem they had.

The final area of application for this requires the most delicacy of all, since it is really one which should apply to your users. I mention it since we all have to learn how to stear users away from requesting solutions they've seen in magazines or heard about on the web and instead teach them to tell us what their problems really are, and what they need to accomplish, so we can work with them to provide a solution which works best for them in the environment we support. Sometimes the hot, new technology they want will work. Often it will not, or will do so badly.

David Parter and Ryan Donnelly both discussed this topic with me at various times.

Originally submitted by William S. Annis on 24jan2000.

Sometimes it's just the sum of the parts.

Don't discount the possibility that a problem may not be a problem at all, but a manifestation of the limitations of the system.

Originally submitted by Ryan Donnelly on 1Feb2000.

kill -9 is the last resort, not the first

Mark's own comments:

Simply because you can send a KILL signal (which cannot be trapped) to any process doesn't mean that you should - you may well end up with a zombie. In fact, kill -9 may make the situation worse as it will not give the process enough time to terminate cleanly and the process may not release shared memory, semaphores, file locks, etc. Always try to terminate a process with kill -1 or kill -15 first. If the process won't die, check to see if it is still running before resorting to kill -9. If the process is "sleeping", don't bother with kill -9 as the process is not going to die.

Originally submitted by Mark MacLennan on 2Sep2000.

An unsolvable problem is really a fact.

That is, some things we consider problems may not have workable solutions. This means, then, that we have to come up with ways to work with the problems most effectively, rather than waste our time trying to prevent the inevitable.

See the Recovery-Oriented Computing project at Berkeley for more information.

Originally submitted by William Annis on 17apr2002.

Avoid TLAs.

Your user base may not speak the lingo.

Originally submitted by Jeff Beneker on May 15, 2007.

Generated on Wed Dec 5 13:24:39 2007.