Friday, December 31, 2010

Multiple Points of Failure

I wrote in an earlier post about how sometimes it's best to rely on well known technologies rather than on carefully tailored solutions. Last week we had just the perfect example for this concept and this time from the hardware domain.

So what happened?
One of our system crashed and won't boot, probably because of a major OS failure since the mirrored disc seemed to be corrupted as well, the only guess we have for the underlying cause is lack of space on the C: drive but we don't really have a way of knowing for sure post-mortem.
The system was all on this single disc so we had to use the backup. Bad news, apparently the last backup was from like a month ago. Recovering data from the corrupted discs we've managed to narrow the damage to two weeks of lost data, but that's not the point, losing data is something that should never happen and we really try hard to prevent those kind of things from happening.

So where did we go wrong? Well, most of our systems use EMC for storage, not in this case, why?
Usually when we configure a new system we ask ourselves two main questions:
1. Is the system gonna need an SRDF/CE configuration?
2. Is the required storage for the system is more than what comes with the target server's disk drives?
If the answer to either question is YES, we use EMC otherwise we usually use it only as a location for storing the backup of the system. In our case it was the 'otherwise' option.
The problem is that when we use EMC for storage, we also use EMC's RM to take backup and have an already existing system of emails that let us know if the backup was successful.
In our case not only the data wasn't on the EMC but on a local disk but also monitoring on backups didn't exist, of course it is our bad, but that's exactly what I was talking about when I said that it's better to rely on something you're well familiar with.

I guess one of the conclusions is to try and use EMC for every system no matter how low the high availability requirements are and how little disk space is required.

Saturday, September 11, 2010

Exploring

One of the most basic things one does with his OS is managing files, this task becomes critical if you have to access multiple file locations - I know it's critical for me, since I manage files on many different servers and on several local locations during a typical workday.
The basic requirements for a file managing utility are pretty much well, ehh... basic: just give me something to comfortably view and transfer files. There are of course a lot of nice additional features: packing/unpacking, good search capabilities, convenient selecting, etc., but for start just give me something I can work with.

Back to my first sentence: managing files should be basic, and well in my humble opinion, not that complicated. So how come Microsoft still have the crippled Windows Explorer?!
Really, unless I'm missing some patent the whole world has I don't understand how come Explorer doesn't implement the most basic feature even the old Norton Commander had - two panes with which you can copy files between locations or easily see the disk's information. Without this so basic ability I don't know if there's really a need to mention other drawbacks (however, I will) like:
  • The XP's\2003 (didn't test it on Windows 7 because I kinda lost hope) totally not working searching within documents feature. I mean, OK let the search be slow, inefficient, whatever, but how can it miss a simple phrase in a .txt file?
  • How do I select all *.foo files in a folder? I'd be very happy to hear a simple answer to that as I really don't know how this is done, and it seems such a basic task...
  • And if it's even possible Windows7\2008 interface is even less user(\me) friendly, didn't really understand what they were trying to do.
  • Well, you got the point by now.
As for me, I've been working with FAR Manager for many years, but today my weapon of choice is Total Commander, obviously it's not perfect but it does the job and does it good. One of many features it has (and Explorer doesn't) is the tabbed interface, I won't say it's genius because I think it's simply a trivial requirement from something that wasn't written 5 years ago.

The thing that most amazes me is that most people still use the old good(?) Explorer, some maybe I'm just hard to please...

Saturday, August 21, 2010

Performance

Every sysadmin has to deal with performance issues once in a while.
As the "the sysadmin of sysadmins" I'm dealing a lot with allocating resources to VMs the developers for the different systems use. This is not a simple task since I have limited physical hardware and about 100 virtual development servers to manage and of course everyone wants his VMs to be as strong as needed. Or more accurately as desired.

The thing is that from my experience, the strength you need for a development server is usually far below official specs since you don't experience the same load on development environments and you don't really need everything to be as smooth as with a production environment.
Of course we do experience more serious performance issues on development environments - especially with new systems we are not fully familiar with yet, but it seems in most cases those issues are not solved by pimping up the server.
I've found myself more than once trying to explain that there's a "real" performance issue and that someone (depending on the case) has to take a really good look at the code, the design or whatever and find the problem, because it's not going to be solved by adding cores, RAM or using a physical server. In those cases I usually have to supply a proof that the RAM and CPU are sufficient and even then I have to deal with claims like "but it's a VM, it might be lying".

It's really like trying to solve an NP-Complete problem in a brute force technique by adding more computational power - this is not a real solution if your input is not bounded and things are never really this simple...

Monday, July 26, 2010

History

Only last week I've completed an upgrade I've been working on for a long long time - working on a low priority project while leading a team doesn't contribute to agile upgrading. At least I've learned some things since back when.

Interestingly enough the Apache .dll error didn't reproduce on other development environments. Up until the moment of truth - the production upgrade. Obviously I had no recollection of the solution, I did remember running into this error and had it documented in my notes but not the solution - shame on me. Luckily, this time this error was documented in Metalink and the solution was much simpler- there was a missing Windows .dll (msvcr70.dll if I'm not mistaken).
But the more interesting issue is the one that prevented OPMN startall CA from completing...

After opening an SR (and a lot of debugging) I discovered the issue was with "ldapbinding" on the SSL port taking a long long time, in other words some heavy load on this port. I was advised starting the oidldapd process with more dispatchers (I later learned adding more workers worked better for me), that did solve the issue but this is not what baffles me about this case.
The first OID server I upgraded didn't suffer from this issue, I used to attribute it to the fact that it was the only server not being part of an OID cluster - although because of the nature of the upgrade process it always seemed like a lame excuse.
The thing is... the first production server I've upgraded was OK as well. There's now only one thing I can think of common to both servers: they both weren't installed by me. My memory fails me again here, since there might be an additional server I didn't install and did had the problem, but there's a good chance it was installed in a slightly different way, a way I don't think should affect oidldapd's function but still a different way.
The thought that drives me crazy right now is what kind of different installation steps my ancestors could have taken that matter so much, all I did was following their (with time evolved) documentation. Or maybe they did do some secret configuration steps they left no trace of???

Saturday, July 17, 2010

Temporary

One of the things I've learned being a sysadmin is that there's no such thing as a "temporary solution", if you're presented with one or thought about one yourself first ask yourself how do you feel about this solution becoming permanent, because it's here to stay.
Reasons for temporary solutions becoming permanent are numerous, but I think most of them come down to the (reasonable) claim that it's better to turn efforts to something that needs fixing than to something that already works. Many times the burden of maintaining such a temporary solution falls on the sysadmin so one else sees how bad it really is and sometimes it's just more efficient (in the immediate time frame) to accept this burden than to solve the issue once and for all.

One example I always fall back to when I have an argument about a temporary solution is the example of the email sending mechanism we use in our Oracle Applications system:
Our basic solution for sending email is based on an EBS alert that monitors a table we insert data into when we want the system to send an email to someone. An important detail: an EBS alert is simply a database trigger that is wrapped with a test that the table was updated by APPS - the EBS administrative user.
Now, several years ago we've started using ApEx as a platform for developing applications over our EBS. The catch - ApEx applications don't use the APPS user so they can't send email the way we used to, at least not without modification to our mechanism. "Luckily", ApEx has it's own email sending mechanism so we decided to use it as a temporary solution. That was about three years ago.
At this point you probably can ask me "well, who cares?", and I'll probably answer "well, I do". I do, because when we upgraded our mail server to Exchange 2007 it didn't really like the Hebrew encoding ApEx was using and decided not to display it properly. So after a day or two in which emails just weren't sent from the affected ApEx applications we had to devise a solution I'm not sure anyone will know about when the Exchange server will be upgraded to 2010.
It's a shame because the solution to this is issue was designed a long time ago, the ApEx mechanism causes application pages to hang while mail is being sent, but still - why fix something that isn't broken?

Beware of temporary solutions.

Tuesday, June 15, 2010

Copy-Paste

I can't think of many features that are both absolutely necessary and the biggest threat to the world as we know it (well, almost) at the same time as the ability to copy and paste text at the blink of a Ctrl+C\V.
Retyping (or rewriting) a text over and over again is something computers excel at, but we humans can't stand, I guess that's the origin of punishments where you have to write a sentence one hundred times or copy sections from a book.

But when I think about it, if you work with computers and you're using copy-paste too frequently you're probably doing something wrong, a few examples:
1. The most trivial example: if you're coding and you've just pasted a code snippet you copied from somewhere else in your code you probably should write a method that implements this code snippet. Even if you've copied someone else's code (from the internet for example) I think that you'll have a better understanding of what the code does by retyping it and there's less chance you'll miss any adjustments you should do.
2. In my line of business I'm often tempted to copy contents of configuration files and the like but as with code, there's a good chance I'll miss some necessary environment adjustments.
3. The last example is "written in blood": you might think you see the text you're going to paste but YOU ARE WRONG. I don't know if hidden characters are to blame or some other clipboard wonder, but I keep encountering commands that malfunction when pasted: I have some OID configuration commands that do this, I encountered it when trying to install Oracle SOA Suite and just today I had a Java command copied straight from Oracle's documentation (as part of installing a connectivity web service between Microsoft's SCOM and Oracle's Grid Control) that just laughed at me until I manually retyped it.

So me, I usually retype EVERYTHING, this way I'm feeling more in control of what I'm doing.

Tuesday, May 18, 2010

Spooling

One of the things I thought was the coolest when I was just starting to learn SQL was the idea of using SQL queries to generate scripts. Sometimes you use this ability in some kind of a GUI and sometimes you spool the output into a file which you later on run as a script (SQL, cmd, etc.). When you spool into a script file you often need to have control on the output formatting - remove "x rows selected" lines and the kind - so that you script executes as intended. To achieve this there are many SQLPlus commands like:
set feedback on\off
set heading on\off
set echo on\off
...
But it seems there's a catch, I wasn't able to determine if this is the expected behavior or just some weird bug, but a while ago we had a problem with an SQL generated script and when we tried to debug it, it seemed like none of the above commands had any effect!
(Unfortunately) I was unable to find the site that helped us to solve the case, but apparently, to function those commands need to be run as part of a script and not from the SQLPlus prompt.

Saturday, April 10, 2010

Good Problems

Most of my posts have something to do with complaining about different products and the ridiculous bugs they have, and anyway who loves problems? Every sysadmin would like to come to the office in the morning drink his cup of coffee (or tea, in my case) and go spend the rest of the day at the beach knowing everything will work perfectly and his presence won't be needed.
But problems, as strange as it sounds, have an upside as well.

First of all, if things were so simple, people like me would be unemployed. We'd still be able to go to the beach, but we wouldn't be getting paid for it. But even without fantasizing about a prefect world, problems are important. Might sound like I'm a bit exaggerating, but I really think problems make you a better sysadmin.
Come to think about it, problems to a sysadmin are like fights to a relationship, no one really wakes up in the morning and thinks to himself "Oh, I'd like a good argument today", but couples do fight and resolving those arguments usually strengthens the bond (and like with systems too much fighting is probably a bad sign). And like those arguments, each problem you tackle makes you understand your system a bit better, understanding that is hard to achieve when everything just works.
The ERP system I used to administer is a great example. ERPs tend to be complex systems and ours not different, we have many utility scripts that accompany the system. Sure, I knew there were some scripts responsible for printing, but until I encountered some problems with them I didn't really get to understand each and single line of code in those scripts. And on the other hand, I wrote some scripts for an interface a few years ago, and those scripts work almost perfectly (I'm not necessarily objective here) up until now, so I'm pretty sure my replacement doesn't really know what's going on inside.

So maybe I should start giving more respect and appreciation to the problems I encounter (and complain less).

Monday, March 15, 2010

Pessimist

Generally, I think in most cases we all should be a bit pessimistic. Being objective is hard and personally I'd rather have a nice surprise than a great disappointment. But that's arguable - I guess optimistic people live longer - but then again, the benefit of long life is arguable as well.

But what I think is not really arguable, is that in my line of work you have to be pessimistic. A major part of being a sysadmin is keeping the system stable, and you can't really do this by being optimistic. To make the system stable you try to think about any scenario as improbable(and pessimistic) as it might be. One good example I can think of, is that having a daily backup is very important in our organization(and I really do hope we're not the only ones), yet for most systems, we've never needed those backups.

I might be taking this a bit too far when I'm berating team members who behave "too optimistically" and tend too answer "I'm getting paid for it" when asked why I'm so pessimistic, but I'm really just trying to make my point.

Monday, February 22, 2010

Fixation

Being a sysadmin has many controversies, take new technologies for instance:
Recently I've installed a Weblogic server to start examining it, as our Oracle AS servers will need to be replaced sometime in the next couple of years since this line of products is going bye-bye. Anyways, it was pretty exciting, new (for me) Fusion 11g UI and the main administration screen is SO busy. So it sounds like I'm all for new technologies... not really, not always.

Yes, you should upgrade existing technologies. Yes, you should check what new technologies are out there. No, you shouldn't use a new technology where an existing one could be used.

I think the best example for my point is building solutions for specific needs.
Let's take my organization's Oracle ERP system for example. Many times a specific need arises and people suggest all kind of cool solutions. In my opinion it's very important to try and base such a solution on an existing technology, I'd rather have a solution that is not perfect than a perfect(like such things exist) solution that will handle this specific case alone and that no one will know about if something goes wrong a few years from now.

So, sometimes, I end up being this guy with a fixation on old and boring technologies...

And if talking about sysadmins, here is a good one :).

Saturday, January 9, 2010

Roach-Tech

When a company designs a product, there are many aspects to take into consideration. The product should be appealing to the eye, ergonomic, cheap to produce, etc.. One aspect I believe most vendors don't consider is that the product should be roach unfriendly. What am I talking about?
Well, take for example Avaya's phone:
You can see that attached to it's back (didn't find a better image) is a panel that gives it it's angle. Apparently this panel is a great habitat (at least in the Israeli climate) for roaches and they even consider it a great place to drop eggs:

That's right, this brown thing is a roach egg.
The phone doesn't look so attractive right now, does it?