In my previous post I've talked about a problem I experienced before upgrading my EBS database to 10g, well, the upgrade has been performed and I think it has supplied me with ideas for at least three posts. Of course the bug experienced on the test environment didn't show itself on production (I didn't make the fix intentionally to see if it re-occurs), and what I have now is a very "nice" variety of other bugs. But why rush to the aftermath? Lets start with the upgrade process itself.
I had all the upgrade procedure in soft and hard copies, have made about 4 upgrades on dev and test environments, was pretty sure I would successfully handle any problem that will pop up. Actually, many steps of the upgrade process came down to a Copy-Paste action from my upgrade document to the command-line. It's so frustrating that what eventually got me delayed for about five(!) hours is a totally stupid mistake on my behalf.
Of course, before such an upgrade a backup should be taken, I have a script that runs every night: it brings down the system, a backup is taken (we use EMC's BCV) and then the system is brought up, all I had to do is re-schedule this script. Being a smart ass I've decided to save some time and disable the startup script, I only need the database up for the upgrade process. I could think of plenty ways to do this but decided that emptying the startup script will be the most secure - this way I'm not touching the main backup script. So I've made a copy of the startup script and... emptied the shutdown script. Yes you read it right, no typo.
So the time comes, I'm watching the backup logs and WOW! that was fast, usually it takes about five minutes for the shutdown process to complete and now the system is up after one minute, great! Hmmm... maybe not so great after all. So I take a closer look at the logs and discover what I have just done, the meaning is that the BCV split was performed on an open database and it's worth nothing (well, maybe not nothing but that's a bad idea to start the upgrade with this kind of backup). Oh, I'm going to use some EMC terms but it's OK since many of them I myself discovered only during this upgrade.
Well, actually there's no reason for panic, the backup devices are mostly synchronized and all I have to do is shutdown the database and take the backup again. I've called the guy in charge of storage and asked him if I'm right, he said I'm right and guided me through the process of taking the backup from the command line. But, alas, That's not so simple!
Apparently when EMC tells you that a split is complete, well, it's lying. The split process (again, if you have no idea what I'm talking about, just imagine some backup process) continues in the background and until it's completed you can't take another backup. The more frustrating thing is that the storage guy didn't know about this (he's new, cut him some slack) and I didn't know how to check the progress of this background split. At some point - when only one disk was still splitting and the others were already re-synchronizing - I've discovered how to check the progress of the split. Great, I'm 99%, but boy this 1% took it's share of time to complete.
Phew! At least now I know where I'm standing. Well not exactly, EMC has another "great" feature, when it comes to synchronizing the last parts of disks it take a totally disproportional time (remember the 1%), so when there were only 50MB left to synchronize the machine had to work forever for every 0.1MB, no kidding. So I didn't really have any estimation as to when I'll finally start the upgrade process and I was already about 3.5 hours late, and this is a downtime we're talking about here, we usually like to minimize those.
So I waited. I guess this is what they call "paying for your mistakes".
Eventually the synchronization completed and I got to the upgrade process. The upgrade was perfect, no unexpected errors and the entire process was completed (long) before scheduled time - thank God for buffers.
Well, a nice prelude indeed...
No comments:
Post a Comment