Friday, July 15, 2005

Unix-Fu

Yesterday I assisted a good friend in extracting some lines from a database dump in fixed-width format, so that he could validate the dump before importing it into another database. He is a full-fledged DB person, grand master in PL/SQL and all. Yet he has only some practical Unix experience, and only knows some key commands for manipulating big text files so that he can check his database dumps.

He had already checked the start of the file with head -10000, and now was checking the end of the file with tail -10100, in order to get the last 10,100 lines (records, in a DB dump) from the file. For some reason his sample file was coming out with only about 79 lines, so he sent me a message to ask for help.

We checked what Unix he was using (uname -a) and behold! It was HP-UX, which I use almost daily at work, so it would be easy to check manpages and all. I asked him to count the lines in the file (wc -l), and the big dump had 15,700,000 lines in it. It was clear tail should find the last 10,100 lines from the file without problems. We looked at file size (ls -l) and it was just under 4GB. Since he was at a 64bit machine, this should not be a problem.

I took a look at tail's manpage on a local HP-UX machine, and then easily found the source of his problems: HP-UX's tail operates on a fixed 20KiB buffer, and surely we found that his sample file was exactly 20KiB in size. Given his limited experience he thought this would be the end of it, but I promised to message him back some other command line that would extract the last 10,100 lines he wanted.

I quickly looked at awk's manpage (I should exercise my memory better) and sent him 'awk "NR >= 15700000 - 10100" dump_file.txt > sample_file.txt'. It crunched the file for a while and after a few minutes exited leaving a perfect sample file for his validation. He was all happy and took note of this command on his notepad. I thought it was it.

Later in the day he sent another message, stating some records were broken and he was trying to find lines which did not have the expected 259 character length. What I liked about it is that he was already peeking awk's docs and was trying 'awk "length =! 259" dump_file.txt > broken_recs.txt'. I told him it was != instead of =!, and he quickly found the broken records the database dumper had generated. I really enjoyed the fact that he took my small awk tidbit and was already improving on it. I explained to him that if he handled textual database dumps frequently, awk would be his life saver, and he asked if I could send him docs and examples.

I recognized that in some sense he had gotten to a newer level in his practical Unix experience, and jokingly told him we would later have his ceremony where he would receive his new Unix-Fu belt. Unix-Fu belts stuck on my mind during the rest of the day (not for the first time, though: I once sent email to ThinkGeek proposing they create cool Unix-Fu, Korn-Fu and Perl-Fu belts, but never had an answer), and today I decided to take a look at the Kung-Fu belt degrees and mimic that with Unix-Fu knowledge and experience.

So here's my initial proposition at a Unix-Fu belt system:

White Belt: Has a few commands written on a notepad, retypes them when needed with little variation but is starting to perceive the power of the command line and wants to learn the path.

Yellow Belt: Knows about variable substitution and shell redirection, able to create simple pipelines to achieve some interesting data manipulations. Still gets some Useless Use of Cat Awards, though. Can create small scripts with their sequences of simple pipelines and a few temporary files. Reuses some sed and awk tricks from the written notepad.

Orange Belt: Has picked a preferred shell and read its documentation several times. Uses all kinds of substitutions on the command line, and knows when to use ', " and \ to achieve the necessary level of escaping and substitution. Creates networked shellscripts. Never sees a ^H again and is fully able to understand and [ab]use X. Read Sed&Awk three times.

Green Belt: Perl. C. CVS. Contributes to free software projects. Understands Unix history and picked a Unix variant of choice with reasonable argument.

Blue Belt: Has created some unique commands for personal use and shared with friends. Started some free software projects, based on those commands. Contributed full userlevel features to their Unix of choice. Has ~/bin as the first element of $PATH.

Purple Belt: Knows every config file in /etc and installs perfectly tight Unix systems. Frequently recompiles their Unix of choice, creating custom installation images. Creates LiveCD images of their Unix of choice to carry around.

Brown Belt: Understands the static and dynamic binary formats of their Unix of choice, has enhanced and fixed bugs in basic libraries such as libc, libm and libz and has implemented a few network services, with and without inetd. Is capable of creating auto-generating shell command lines of the third degree.

Black Belt: Has developed considerable features and fixed bugs at kernel level for at least one Unix variant. Has fought and won SMP race conditions on device drivers without resorting to big locks. Recompiles their Unix of choice with a bigger limit for command lines.

Or maybe we should split the art of Unix-Fu in styles: User Style, Admin Style and Developer Style, this last one having two variations: the way of the process and the way of the kernel. but that's for another day.