Sysadmin Tales: June 2008

Saturday, June 28, 2008

Just Stupid

Well, I was actually going to write the second (and probably final) part of Light at the End of the Tunnel but some things are just too provoking (and I'm not at the end of the tunnel yet anyway).

My folks bought a new phone device today, probably the cheapest device in the whole store, but that's not really a justification for the piece of crap this device turned out to be.
What can be so terrible about a simple phone device? Well, I'll supply an example, and please keep in mind that this is after only a ten minutes interaction with the device, I'm actually a bit curious to discover what other flaws the device has...
To the example then. My dad opens the box, connects the phone and tries to dial... nothing. Apparently someone had a "brilliant" idea, the device has a lock (see picture below) that somehow reminds the ages when computers used to had locks - didn't find an image of that one (and didn't look to hard either), anyway, the purpose of the lock is to block calls to certain numbers: it has three modes - dial any number, don't dial any number and dial only numbers that don't start with an '0'.

Those three modes are represented by three colors: red, yellow and green, at least it's trivial which color represents which mode, that's right, red is "dial anywhere" and green is "forget about dialing", someone else thinks that's extremely stupid? If you take a close look at the picture you'll also notice that there's no real way to know at what position the lock is now (yeah, it's not green), very amusing. What adds to the amusement, considering the fact that the phone didn't came "red" by default, is that the Hebrew version of the user's guide doesn't have a single word about the lock.
And if that wasn't enough I have another example. The device has a speaker, so you'd probably expect to be able to dial a number, press the speaker button and have a conversation rolling, not with this one. Once you press the speaker button the number typed disappears and you're welcome to type in a new number, so you have to press the speaker button before dialing or use some commando techniques such as pressing the redial button after the number you typed has dissolved - that's a bit weird since you didn't really "dial" that number before, oh well.

A home phone device is supposed to be as simple as possible to use and trivial to understand for everyone, this one has definitely failed this test. Just Stupid.

Tuesday, June 24, 2008

Alerts

Last weekend I woke in the middle of the night to discover my arm totally numb, and when I say "totally" I really mean it - I couldn't feel it nor move it.
I don't know if I'm some kind of freak or maybe it happens to everyone occasionally but it's definitely not the first time it has happened to me. What I usually do when it happens is to grab my "dead" arm with my other arm and throw it off the side of the bed - I do it to hasten the blood flow back into my arm (have no idea if this actually has any effect though) - this is actually the creepiest part of this whole process, because when I grab my numb arm with my not-numb arm I don't feel like touching myself and the arm is simply dangling like a dead piece of meat until blood flow is restored.
Up until now the arm revived each time, but I always wonder to myself what would've happened have I not woken up. I also wonder if what got me awake is some alerting (hence the title) mechanism of my body, and since I don't tend to wake up in the middle of the night this is probably the case.

What this short tale got to do with system administration (except happening to a sysadmin), well, nothing really. But it did got me thinking that I should write a post about the different monitoring and administration tools I use to detect and analyze bad things in my system, of course, I still care more about my own arm - can't type commands too fast with just one arm :).

Friday, June 13, 2008

Information Boom

I'll take another break from my upgrade aftermath tales to write about an issue I think many sysadmins face, and not only sysadmins. Handling information.

It seems that since I started this job I'm in a constant race to beat the accumulation of information, and I never seem to be on top.
When I started my training I had a notebook in which I wrote all my remarks, since at the beginning you write any minor note (although later it's trivial anyway) I soon started a second notebook. But notebooks aren't so efficient so after starting the job for real I've started to use OneNote, WOW! I can create plenty of pages in a single tab, this will definitely do the trick! But soon I needed more tabs and then more OneNote notebooks too.
To store documents my team (a team of sysadmins) uses a SharePoint portal, we used to have a folder-hierarchy structure to our documents, but at some point it just got too messy so we've decided just to throw everything in the root folder (chaos indeed) and tag it, then of course we've created views and filters for the different tags.
We also have a Wiki to store and share important knowledge.

But even with all those mechanisms set in place it seems there's just too much information to handle:
I have my daily tasks to manage. Every new topic (regardless how minor) I deal with is a page or a tab in my OneNote. A major upgrade or installation consist of my own notes (what I did, what errors I got), official installation/upgrade notes, know bugs documents, new features' documents etc.. Every problem has a related metalink note or my own document describing how to fix it. And except the EBS I'm also in charge of an OID and an iAS servers.
That's just a partial list and nothing in it can be skipped.
Both external(e.g. supplied by the vendor) and internal (written by the sysadmin) documentation is critical to every sysadmin - you can't memorize every parameter and functionality in your system and you can't expect to remember every error you handled three years ago.
But the problem starts when you need to retrieve this information, if you don't organize your information wisely it doesn't matter that you have a perfect documentation of how to fix a bug because you just won't be able to find it, too bad. I can think of many times when I was sure I saw something familiar before and had to start guessing where exactly - my OneNote? The portal? Maybe in my documents' folder before I started using the portal? Maybe I eventually didn't download it from the metalink? Or maybe someone else showed it to me? Trust me, this can get pretty frustrating.
Having a good logic for storing your information is crucial but from my experience it's just not enough as information tends to outgrow any such solution.
As I see it, the only real solution is a powerful searching solution that has access to all the storage solutions you use.

I'm actually going to explore this field of enterprise search solutions so I might have more insight in following posts.

Friday, June 6, 2008

Light at the End of the Tunnel, Part I

This post should actually be titled "Upgrading Part IV", but my Roman math is not so good - I admit checking if I should write IV or VI - so I decided to stick to the simple numbers.

After bitching about it a lot, it's time I tell you how my inconsistent problem got solved.
This one was too much for me to solve on my own so I had an Oracle expert come over. After describing the issue I'm facing he said we should set up a trace on all client connections from the 8.0.6 home - we had to do this for all connections since we had no idea which concurrent will get stuck. It took some time before we realized that we had too small a pool of trace files and that trace files we were looking at didn't belong to a stuck concurrent anymore. After doing some re-configuration on the trace settings we finally found a trace file that belonged to a stuck concurrent.
Boy! The guy has patience. As someone on my team said, a good expert can be measured by the patience he demonstrates going over log and trace files.
Going over the entire trace file we discovered that at some point during connection establishment the process just stops, we could see that the client was waiting for some input and... nothing. Although it felt like closing on on the problem it seemed like a long and tedious process will be needed through a TAR, since however good the expert I had on site is, it's just not his specialty, and it definitely looked like some internal bug.
Before leaving, the expert noticed I had NTS defined in my sqlnet.ora file for SQLNET.AUTHENTICATION_SERVICES, I explained I was using it to logon as sysdba without entering a password. He suggested I try to remove it, so I did.
At this point we were towards the end of the day so I didn't expect any concurrents to get stuck anyway. But the next day the problem disappeared, I was thrilled, seems like the NTS setting was the problem all the way! Thinking about it, there's no chance I would've thought of it to be the root of all evil. I think the only reason the expert noticed it is because he's not used to dealing with installations on Windows, it seems to me that even he didn't believe it will solve anything and suggested it just as a shot in the dark.
Proper disclosure. The expert suggested some other changes as well. Those were changes to DB parameters as specified in an Oracle document regarding DB parameters for an EBS installations.
The changes had effect only on the next day (the day everything worked) as well, so I might be wrong and what really fixed the problem was altering those parameters. Since I'm curious and somewhat suicidal I'll probably have the NTS returned and check if the problem reoccurs, promise to let know if my intuition here is wrong.

The funny thing is that just after discovering this issue with NTS, I found another one, working on an entirely different thing. Try googling for "NTS sqlnet.ora".

Well, one problem down, one more to go. Probably will tell about it in my next post, although it's still not solved.

Sysadmin Tales