Sysadmin Tales

Friday, February 18, 2011

Wrong Conclusion

I've concluded my previous post by saying that our organization should probably use EMC for every system, I was berated for this conclusion and with a good reason - I was wrong.

Yes - if stability is the only consideration so sticking with EMC is a very good idea, but it's not.

There are also other considerations, like money - EMC (and it's maintenance) is expansive and many times you need a lot of storage for an unimportant purpose. There's also the fact that there's a limitation on how many servers you can physically connect to the EMC.

So the right answer is, like always, somewhere in between, trying to meet all the needs with one solution is very ambitious, there should be a few solutions available - not a single one and not dozens of them. The important thing is that when you arrive to a point where you decide to use a new solution:

You do it for a good reason.
You "take care of it" - you need to test it, monitor it and have a standard for maintaining it, just like you have with any other solution.

Hopefully this atones for my sin...

Friday, December 31, 2010

Multiple Points of Failure

I wrote in an earlier post about how sometimes it's best to rely on well known technologies rather than on carefully tailored solutions. Last week we had just the perfect example for this concept and this time from the hardware domain.

So what happened?

One of our system crashed and won't boot, probably because of a major OS failure since the mirrored disc seemed to be corrupted as well, the only guess we have for the underlying cause is lack of space on the C: drive but we don't really have a way of knowing for sure post-mortem.

The system was all on this single disc so we had to use the backup. Bad news, apparently the last backup was from like a month ago. Recovering data from the corrupted discs we've managed to narrow the damage to two weeks of lost data, but that's not the point, losing data is something that should never happen and we really try hard to prevent those kind of things from happening.

So where did we go wrong? Well, most of our systems use EMC for storage, not in this case, why?

Usually when we configure a new system we ask ourselves two main questions:

1. Is the system gonna need an SRDF/CE configuration?

2. Is the required storage for the system is more than what comes with the target server's disk drives?

If the answer to either question is YES, we use EMC otherwise we usually use it only as a location for storing the backup of the system. In our case it was the 'otherwise' option.

The problem is that when we use EMC for storage, we also use EMC's RM to take backup and have an already existing system of emails that let us know if the backup was successful.

In our case not only the data wasn't on the EMC but on a local disk but also monitoring on backups didn't exist, of course it is our bad, but that's exactly what I was talking about when I said that it's better to rely on something you're well familiar with.

I guess one of the conclusions is to try and use EMC for every system no matter how low the high availability requirements are and how little disk space is required.

Saturday, September 11, 2010

Exploring

One of the most basic things one does with his OS is managing files, this task becomes critical if you have to access multiple file locations - I know it's critical for me, since I manage files on many different servers and on several local locations during a typical workday.

The basic requirements for a file managing utility are pretty much well, ehh... basic: just give me something to comfortably view and transfer files. There are of course a lot of nice additional features: packing/unpacking, good search capabilities, convenient selecting, etc., but for start just give me something I can work with.

Back to my first sentence: managing files should be basic, and well in my humble opinion, not that complicated. So how come Microsoft still have the crippled Windows Explorer?!

Really, unless I'm missing some patent the whole world has I don't understand how come Explorer doesn't implement the most basic feature even the old Norton Commander had - two panes with which you can copy files between locations or easily see the disk's information. Without this so basic ability I don't know if there's really a need to mention other drawbacks (however, I will) like:

The XP's\2003 (didn't test it on Windows 7 because I kinda lost hope) totally not working searching within documents feature. I mean, OK let the search be slow, inefficient, whatever, but how can it miss a simple phrase in a .txt file?
How do I select all *.foo files in a folder? I'd be very happy to hear a simple answer to that as I really don't know how this is done, and it seems such a basic task...
And if it's even possible Windows7\2008 interface is even less user(\me) friendly, didn't really understand what they were trying to do.
Well, you got the point by now.

As for me, I've been working with FAR Manager for many years, but today my weapon of choice is Total Commander, obviously it's not perfect but it does the job and does it good. One of many features it has (and Explorer doesn't) is the tabbed interface, I won't say it's genius because I think it's simply a trivial requirement from something that wasn't written 5 years ago.

The thing that most amazes me is that most people still use the old good(?) Explorer, some maybe I'm just hard to please...

Saturday, August 21, 2010

Performance

Every sysadmin has to deal with performance issues once in a while.

As the "the sysadmin of sysadmins" I'm dealing a lot with allocating resources to VMs the developers for the different systems use. This is not a simple task since I have limited physical hardware and about 100 virtual development servers to manage and of course everyone wants his VMs to be as strong as needed. Or more accurately as desired.

The thing is that from my experience, the strength you need for a development server is usually far below official specs since you don't experience the same load on development environments and you don't really need everything to be as smooth as with a production environment.

Of course we do experience more serious performance issues on development environments - especially with new systems we are not fully familiar with yet, but it seems in most cases those issues are not solved by pimping up the server.

I've found myself more than once trying to explain that there's a "real" performance issue and that someone (depending on the case) has to take a really good look at the code, the design or whatever and find the problem, because it's not going to be solved by adding cores, RAM or using a physical server. In those cases I usually have to supply a proof that the RAM and CPU are sufficient and even then I have to deal with claims like "but it's a VM, it might be lying".

It's really like trying to solve an NP-Complete problem in a brute force technique by adding more computational power - this is not a real solution if your input is not bounded and things are never really this simple...

Monday, July 26, 2010

History

Only last week I've completed an upgrade I've been working on for a long long time - working on a low priority project while leading a team doesn't contribute to agile upgrading. At least I've learned some things since back when.

Interestingly enough the Apache .dll error didn't reproduce on other development environments. Up until the moment of truth - the production upgrade. Obviously I had no recollection of the solution, I did remember running into this error and had it documented in my notes but not the solution - shame on me. Luckily, this time this error was documented in Metalink and the solution was much simpler- there was a missing Windows .dll (msvcr70.dll if I'm not mistaken).

But the more interesting issue is the one that prevented OPMN startall CA from completing...

After opening an SR (and a lot of debugging) I discovered the issue was with "ldapbinding" on the SSL port taking a long long time, in other words some heavy load on this port. I was advised starting the oidldapd process with more dispatchers (I later learned adding more workers worked better for me), that did solve the issue but this is not what baffles me about this case.

The first OID server I upgraded didn't suffer from this issue, I used to attribute it to the fact that it was the only server not being part of an OID cluster - although because of the nature of the upgrade process it always seemed like a lame excuse.

The thing is... the first production server I've upgraded was OK as well. There's now only one thing I can think of common to both servers: they both weren't installed by me. My memory fails me again here, since there might be an additional server I didn't install and did had the problem, but there's a good chance it was installed in a slightly different way, a way I don't think should affect oidldapd's function but still a different way.

The thought that drives me crazy right now is what kind of different installation steps my ancestors could have taken that matter so much, all I did was following their (with time evolved) documentation. Or maybe they did do some secret configuration steps they left no trace of???

Saturday, July 17, 2010

Temporary

One of the things I've learned being a sysadmin is that there's no such thing as a "temporary solution", if you're presented with one or thought about one yourself first ask yourself how do you feel about this solution becoming permanent, because it's here to stay.

Reasons for temporary solutions becoming permanent are numerous, but I think most of them come down to the (reasonable) claim that it's better to turn efforts to something that needs fixing than to something that already works. Many times the burden of maintaining such a temporary solution falls on the sysadmin so one else sees how bad it really is and sometimes it's just more efficient (in the immediate time frame) to accept this burden than to solve the issue once and for all.

One example I always fall back to when I have an argument about a temporary solution is the example of the email sending mechanism we use in our Oracle Applications system:

Our basic solution for sending email is based on an EBS alert that monitors a table we insert data into when we want the system to send an email to someone. An important detail: an EBS alert is simply a database trigger that is wrapped with a test that the table was updated by APPS - the EBS administrative user.

Now, several years ago we've started using ApEx as a platform for developing applications over our EBS. The catch - ApEx applications don't use the APPS user so they can't send email the way we used to, at least not without modification to our mechanism. "Luckily", ApEx has it's own email sending mechanism so we decided to use it as a temporary solution. That was about three years ago.

At this point you probably can ask me "well, who cares?", and I'll probably answer "well, I do". I do, because when we upgraded our mail server to Exchange 2007 it didn't really like the Hebrew encoding ApEx was using and decided not to display it properly. So after a day or two in which emails just weren't sent from the affected ApEx applications we had to devise a solution I'm not sure anyone will know about when the Exchange server will be upgraded to 2010.

It's a shame because the solution to this is issue was designed a long time ago, the ApEx mechanism causes application pages to hang while mail is being sent, but still - why fix something that isn't broken?

Beware of temporary solutions.

Tuesday, June 15, 2010

Copy-Paste

I can't think of many features that are both absolutely necessary and the biggest threat to the world as we know it (well, almost) at the same time as the ability to copy and paste text at the blink of a Ctrl+C\V.

Retyping (or rewriting) a text over and over again is something computers excel at, but we humans can't stand, I guess that's the origin of punishments where you have to write a sentence one hundred times or copy sections from a book.

But when I think about it, if you work with computers and you're using copy-paste too frequently you're probably doing something wrong, a few examples:

1. The most trivial example: if you're coding and you've just pasted a code snippet you copied from somewhere else in your code you probably should write a method that implements this code snippet. Even if you've copied someone else's code (from the internet for example) I think that you'll have a better understanding of what the code does by retyping it and there's less chance you'll miss any adjustments you should do.

2. In my line of business I'm often tempted to copy contents of configuration files and the like but as with code, there's a good chance I'll miss some necessary environment adjustments.

3. The last example is "written in blood": you might think you see the text you're going to paste but YOU ARE WRONG. I don't know if hidden characters are to blame or some other clipboard wonder, but I keep encountering commands that malfunction when pasted: I have some OID configuration commands that do this, I encountered it when trying to install Oracle SOA Suite and just today I had a Java command copied straight from Oracle's documentation (as part of installing a connectivity web service between Microsoft's SCOM and Oracle's Grid Control) that just laughed at me until I manually retyped it.

So me, I usually retype EVERYTHING, this way I'm feeling more in control of what I'm doing.