Monday, July 26, 2010

History

Only last week I've completed an upgrade I've been working on for a long long time - working on a low priority project while leading a team doesn't contribute to agile upgrading. At least I've learned some things since back when.

Interestingly enough the Apache .dll error didn't reproduce on other development environments. Up until the moment of truth - the production upgrade. Obviously I had no recollection of the solution, I did remember running into this error and had it documented in my notes but not the solution - shame on me. Luckily, this time this error was documented in Metalink and the solution was much simpler- there was a missing Windows .dll (msvcr70.dll if I'm not mistaken).
But the more interesting issue is the one that prevented OPMN startall CA from completing...

After opening an SR (and a lot of debugging) I discovered the issue was with "ldapbinding" on the SSL port taking a long long time, in other words some heavy load on this port. I was advised starting the oidldapd process with more dispatchers (I later learned adding more workers worked better for me), that did solve the issue but this is not what baffles me about this case.
The first OID server I upgraded didn't suffer from this issue, I used to attribute it to the fact that it was the only server not being part of an OID cluster - although because of the nature of the upgrade process it always seemed like a lame excuse.
The thing is... the first production server I've upgraded was OK as well. There's now only one thing I can think of common to both servers: they both weren't installed by me. My memory fails me again here, since there might be an additional server I didn't install and did had the problem, but there's a good chance it was installed in a slightly different way, a way I don't think should affect oidldapd's function but still a different way.
The thought that drives me crazy right now is what kind of different installation steps my ancestors could have taken that matter so much, all I did was following their (with time evolved) documentation. Or maybe they did do some secret configuration steps they left no trace of???

Saturday, July 17, 2010

Temporary

One of the things I've learned being a sysadmin is that there's no such thing as a "temporary solution", if you're presented with one or thought about one yourself first ask yourself how do you feel about this solution becoming permanent, because it's here to stay.
Reasons for temporary solutions becoming permanent are numerous, but I think most of them come down to the (reasonable) claim that it's better to turn efforts to something that needs fixing than to something that already works. Many times the burden of maintaining such a temporary solution falls on the sysadmin so one else sees how bad it really is and sometimes it's just more efficient (in the immediate time frame) to accept this burden than to solve the issue once and for all.

One example I always fall back to when I have an argument about a temporary solution is the example of the email sending mechanism we use in our Oracle Applications system:
Our basic solution for sending email is based on an EBS alert that monitors a table we insert data into when we want the system to send an email to someone. An important detail: an EBS alert is simply a database trigger that is wrapped with a test that the table was updated by APPS - the EBS administrative user.
Now, several years ago we've started using ApEx as a platform for developing applications over our EBS. The catch - ApEx applications don't use the APPS user so they can't send email the way we used to, at least not without modification to our mechanism. "Luckily", ApEx has it's own email sending mechanism so we decided to use it as a temporary solution. That was about three years ago.
At this point you probably can ask me "well, who cares?", and I'll probably answer "well, I do". I do, because when we upgraded our mail server to Exchange 2007 it didn't really like the Hebrew encoding ApEx was using and decided not to display it properly. So after a day or two in which emails just weren't sent from the affected ApEx applications we had to devise a solution I'm not sure anyone will know about when the Exchange server will be upgraded to 2010.
It's a shame because the solution to this is issue was designed a long time ago, the ApEx mechanism causes application pages to hang while mail is being sent, but still - why fix something that isn't broken?

Beware of temporary solutions.