Wednesday, April 9, 2008

The Persistent Bug

My system has a bug that has been around long before I became the sysadmin, it reincarnates each time in a slightly different form, but for me it's the same old bug.
I call it "the -107 bug" because it's main phenomena is an error message in the concurrent log that specifies a long number - -107... (I don't really remember all the digits, there are lots of them), it looks to me like a really big integer going over to the negative side or something like this, but I don't really know.

I was first introduced to this bug after installing some major patch (maybe ATG_PF.H) when QA alerts started failing in every execution, then, other (with no pattern I could pinpoint) concurrents started to fail with this error as well - but not all the time.
I've opened two separate SRs to handle those issues and over time some actions have been made to improve the situation and eventually a patch that was supposed to fix the problem for the concurrents had pretty much fixed the problem for the QA alerts. In the last half year or so those errors became really rare.
But then, yesterday it hit again. During the testing session before upgrading the production to a 10g database, QA alerts started to fail with this error (although sometimes doing their job), eventually the errors have stabilized themselves to target only specific transactions. Getting desperate I came to the point of telling my boss I believe the best action plan is to perform the upgrade anyway, I had several arguments:
1. This bug might skip the production.
2. We have a project that will alter the QA methodology to a new one in which this specific action won't be performed.
3. I know this long enough to know that it won't be solved in a reasonable time frame.
4. Schedule considerations.

But something was different this time, the pattern was too consistent, usually this bug is unpredictable and actually never was reproduced on a dev environment. Indeed, a few minutes after my conversation with the boss, one of the team members found the problem - total voodoo crap if you ask me.
The INSERT action this specific alert performs had a section that read "-1*&some_parameter" changing it to "-&some_parameter" solved the issue.
OK, let's say the new and shiny 10g doesn't know how to digest a -1*x expression, but the alert won't fail when it's the only one triggered as part of a transaction (I don't really know the technicalities of this whole process so I hope my explanations are clear enough), it only fails when other alerts are triggered as well.

Anyway, I guess that's another score for me in the endless struggle against this stubborn bug. On the other hand I'm pretty embarrassed, I did tell my boss there's no way to solve the issue...

No comments: