February 18, 2009

The Curious Case of 1392

There are occasions in technology where events or results inspire feelings of mystery. Those, “it must be magic” moments when our existing body of knowledge is incapable of processing the situation. I encountered one such event today, so I thought I would share a bit.

Some time ago I designed an architecture for a business partner whereby a series of system updates could be delivered into a closed network topology. Meaning, the target computers were behind a gateway system with no direct ingress/egress points. We needed an engine of the gateway system that would take the staged system updates, distribute them securely and with integrity to the inward-facing computers and oversee the installation of said updates in a safe manner (read: catch exceptions, manage installation order, etc.). It took a few days but we finally nailed it. Another few days were spent in rigorous QA since one of the updates was quite large and made serious modifications to the target system architecture. All use cases passed.

Then we tried in in a pilot production environment. Failure.

We tried again the next night after some modifications to the deployment architecture. Failure again.

Another day or two was spent analyzing the logs from the deployment but nothing stood out. So, I recommended that we manually install the system updates to see if the deployment architecture might not be handling some kinds of exceptions (missing files, corruption, etc.). This was tricky due to the closed network topology but we pulled it off. Lo and behold- there was corruption and it was an entire directory.

Enter stage left- error 1392.

1392 is a return error code from the system’s native copy function. Although my initial prediction rang true, I was stunned. First of all, an entire directory was corrupt but the systems had never encountered any issues. Secondly, I had a strange feeling that this corruption might be systemic; part of a base image that had been rolled out some time in the past. What was I to do?

I solved the curious case of 1392 by writing a shell script that did the following:

1) Take an empty text file and copy it into the possibly corrupted directory. This is key since it forces the target system to tell us if it indeed has a corrupted directory.

2) If we get a return code 0 (zero), go ahead and begin the deployment process.

3) If we get a return code of anything but 0 (zero) and 1392, toss an error and exit.

4) If we get a return code of 1392, take a (pre-staged) copy of the now-known-to-be corrupted directory and overwrite the corrupted one. Then start the deployment routine.

Case solved. We were able to begin full deployment on schedule.