Thursday, April 24, 2008

Upgrading Part I: Human Error

In my previous post I've talked about a problem I experienced before upgrading my EBS database to 10g, well, the upgrade has been performed and I think it has supplied me with ideas for at least three posts. Of course the bug experienced on the test environment didn't show itself on production (I didn't make the fix intentionally to see if it re-occurs), and what I have now is a very "nice" variety of other bugs. But why rush to the aftermath? Lets start with the upgrade process itself.

I had all the upgrade procedure in soft and hard copies, have made about 4 upgrades on dev and test environments, was pretty sure I would successfully handle any problem that will pop up. Actually, many steps of the upgrade process came down to a Copy-Paste action from my upgrade document to the command-line. It's so frustrating that what eventually got me delayed for about five(!) hours is a totally stupid mistake on my behalf.

Of course, before such an upgrade a backup should be taken, I have a script that runs every night: it brings down the system, a backup is taken (we use EMC's BCV) and then the system is brought up, all I had to do is re-schedule this script. Being a smart ass I've decided to save some time and disable the startup script, I only need the database up for the upgrade process. I could think of plenty ways to do this but decided that emptying the startup script will be the most secure - this way I'm not touching the main backup script. So I've made a copy of the startup script and... emptied the shutdown script. Yes you read it right, no typo.
So the time comes, I'm watching the backup logs and WOW! that was fast, usually it takes about five minutes for the shutdown process to complete and now the system is up after one minute, great! Hmmm... maybe not so great after all. So I take a closer look at the logs and discover what I have just done, the meaning is that the BCV split was performed on an open database and it's worth nothing (well, maybe not nothing but that's a bad idea to start the upgrade with this kind of backup). Oh, I'm going to use some EMC terms but it's OK since many of them I myself discovered only during this upgrade.
Well, actually there's no reason for panic, the backup devices are mostly synchronized and all I have to do is shutdown the database and take the backup again. I've called the guy in charge of storage and asked him if I'm right, he said I'm right and guided me through the process of taking the backup from the command line. But, alas, That's not so simple!
Apparently when EMC tells you that a split is complete, well, it's lying. The split process (again, if you have no idea what I'm talking about, just imagine some backup process) continues in the background and until it's completed you can't take another backup. The more frustrating thing is that the storage guy didn't know about this (he's new, cut him some slack) and I didn't know how to check the progress of this background split. At some point - when only one disk was still splitting and the others were already re-synchronizing - I've discovered how to check the progress of the split. Great, I'm 99%, but boy this 1% took it's share of time to complete.
Phew! At least now I know where I'm standing. Well not exactly, EMC has another "great" feature, when it comes to synchronizing the last parts of disks it take a totally disproportional time (remember the 1%), so when there were only 50MB left to synchronize the machine had to work forever for every 0.1MB, no kidding. So I didn't really have any estimation as to when I'll finally start the upgrade process and I was already about 3.5 hours late, and this is a downtime we're talking about here, we usually like to minimize those.
So I waited. I guess this is what they call "paying for your mistakes".

Eventually the synchronization completed and I got to the upgrade process. The upgrade was perfect, no unexpected errors and the entire process was completed (long) before scheduled time - thank God for buffers.

Well, a nice prelude indeed...

Wednesday, April 9, 2008

The Persistent Bug

My system has a bug that has been around long before I became the sysadmin, it reincarnates each time in a slightly different form, but for me it's the same old bug.
I call it "the -107 bug" because it's main phenomena is an error message in the concurrent log that specifies a long number - -107... (I don't really remember all the digits, there are lots of them), it looks to me like a really big integer going over to the negative side or something like this, but I don't really know.

I was first introduced to this bug after installing some major patch (maybe ATG_PF.H) when QA alerts started failing in every execution, then, other (with no pattern I could pinpoint) concurrents started to fail with this error as well - but not all the time.
I've opened two separate SRs to handle those issues and over time some actions have been made to improve the situation and eventually a patch that was supposed to fix the problem for the concurrents had pretty much fixed the problem for the QA alerts. In the last half year or so those errors became really rare.
But then, yesterday it hit again. During the testing session before upgrading the production to a 10g database, QA alerts started to fail with this error (although sometimes doing their job), eventually the errors have stabilized themselves to target only specific transactions. Getting desperate I came to the point of telling my boss I believe the best action plan is to perform the upgrade anyway, I had several arguments:
1. This bug might skip the production.
2. We have a project that will alter the QA methodology to a new one in which this specific action won't be performed.
3. I know this long enough to know that it won't be solved in a reasonable time frame.
4. Schedule considerations.

But something was different this time, the pattern was too consistent, usually this bug is unpredictable and actually never was reproduced on a dev environment. Indeed, a few minutes after my conversation with the boss, one of the team members found the problem - total voodoo crap if you ask me.
The INSERT action this specific alert performs had a section that read "-1*&some_parameter" changing it to "-&some_parameter" solved the issue.
OK, let's say the new and shiny 10g doesn't know how to digest a -1*x expression, but the alert won't fail when it's the only one triggered as part of a transaction (I don't really know the technicalities of this whole process so I hope my explanations are clear enough), it only fails when other alerts are triggered as well.

Anyway, I guess that's another score for me in the endless struggle against this stubborn bug. On the other hand I'm pretty embarrassed, I did tell my boss there's no way to solve the issue...