Friday, December 26, 2008

Simple Testing

A while ago I've upgraded my whole development AS infrastructure to 10.1.2.3, after the upgrade SSO stopped working. Since SSO (at least mine) tends to be a bit fragile, I sighed with a familiar feeling that this is going to take some time and started browsing through the logs. Once again Oracle proved that their logs suck and tend to display the same error message for totally different errors. So the logs showed an error that usually accompanies a bad keytab file, since I had some similar issues lately I've decided to re-create the keytab file, but it didn't really help. Apparently, all I had to do is to search the Metalink. 

The real issue is that the jdk version this version(actually not only this one, if I'm not mistaken IDM 10.1.4.0.1 is as well) is shipped with (1.4.2_14) has some error that prevents SSO from working, the solution is a simple one - install a higher version (say 1.4.2_19).
Now, these things get me really frustrated. OK, I get it, your error handling is not the thing you take pride in (sure hope not), but it's not the first time (more examples to come in following posts or you can read this post again [last example]) it seems nobody have really tested the final product. Had somebody taken the final product (with jdk 1.4.2_14) about to be published for everyone to download, installed it and tested it for the very basic functionality this error would've been discovered (and hopefully the product wouldn't be published).

That's the part when I start to imagine the following conversation (a special bonus for whoever discovers the meaning behind the aliases):
M: "Hey, there's a new jdk out! We should ship 10.1.2.3 with it because it's the newest" (and new is good, right?)
S: "But we tested it with 1.4.2_x<14!"
M: "Yeah man, but we didn't really touch something that heavily depends on jdk specifics, it's a minor version anyway."
S: "You know what, you're totally right, let's do it! It's not like we've ever shipped any totally-unworking piece of code before."

Really, with the simple applications we have in my company we try to test them thoroughly and even though not always successful try to enforce different rules before deploying, things like a certain period of time in which the application has to work on a test environment without code modifications before moving to production, a clean testing environment for relatively big installations, etc.. 
So how come Oracle manages to ship a totally not working version of a product?

Thursday, December 18, 2008

Noise

Well, I haven't blogged for a while, you can probably blame halo 3 for that as well as some other XBOX games. Anyway this post is not going to be too professional as I just want to raise an issue that deeply bothers me.

Lately I have been to several meetings in different companies, at most times I haven't been too satisfied with the hosting. How come? 
Well, there's this issue with cookies and other snacks placed on the table, they tend to be noisy - I mean every bite you take makes a sound so loud it reminds me of the reasons I haven't blogged for a while, if you know what I mean. And so instead of enjoying a snack I end up trying to silence the breaking sounds - can't people provide snacks that behave?
In one company they gave me a drink in a glass bottle, the problem is that the table in the conference room was made of glass as well, so putting the bottle back on the table after drinking without generating a loud response required great skill - I kinda started imagining me as a Terminator measuring the distance to the table and the amount of force needed to put the bottle back silently.

Please people, think of me next time...

Saturday, November 8, 2008

Digging

One of the abilities a sysadmin must have is the ability to dig. What do I mean when I say "dig"? Well, I mean that many times you face a problem that seems to be rooted deeply inside the system, and a sysadmin has to know how to solve these problems. The first thing I personally do most of the times is to say "OK, that's as far as I'm going, time for official support", but as I wrote before, this usually ends up with me de-compiling Oracle's code.

Time for an example.
I've recently been trying to utilize Oracle's ESM (Enterprise Security Manager), since the system it's intended for requires batch actions I used the command line tool. Guess what, when I use one of the commands the GUI stops showing associated roles for the related Enterprise Role, since it was obviously a coding bug, I've logged an SR thinking to myself that this bug, if reproducible (and it is) means no one have ever tested this functionality. About a month later I was notified that a fix was created and they're testing it. Two days later - bad news, the fix did not pass testing. And this is actually the current status (a few months have passed by now).
Last week I got tired of this whole thing  so I've started digging for the source of the problem. Apparently, when you use the command line, the record generated in the OID has a case mistake. So my next move was to de-compile Oracle's code and fix it, and voila, it works!
When I told the support analyst about my fix, he said he knows the solution is easy but the problem is that they have some incompatibility issue with different database versions. I really don't care if it's reasonable for such a small fix to be taking such a long time to implement, I'm just happy I know how to dig.

Friday, October 31, 2008

Life-Work Balance

It doesn't matter how intensive your work is, how time consuming your studies are and how tired you get home after all of them. You should always make time for some other activity, it can be sports, some artistic hobby or whatever, as long as you enjoy it. Another important thing about this hobby is that it must be something you do constantly and don't skip no matter what. It might sound complicated but it's actually as simple as forcing it into your schedule and not giving up on it (and to yourself).

For me, practicing Judo is a great way to get my mind set on something other than work and everyday thoughts, it makes me concentrate on an entirely different thing. One of the things I like about it is that the great physical effort you have to put into a training session is a huge contrast to my work, that involves mostly sitting in front of a computer in different angles.
Yes, it can get tedious, making progress is hard and there are those days when I have to struggle myself not to give up on the training session. But the fact is that most of the time and especially after showing up for a session after missing a few I have real, genuine fun and that's what it's all about.

Tuesday, October 14, 2008

Traveling

I haven't blogged in a month, and for a good reason too. I was traveling Ireland for two and a half weeks and after physically landing it took some time to mentally land and even more time to "catch up with life" - organize my photos from Ireland, watch all the episodes I've missed while being abroad, etc..

This trip is obviously something I should write about, my trip companions actually offered all kind of ideas during the trip, but I don't really know what to write. As I write those lines I don't know yet what I will write about. Really. I guess I'll just write some outline of the trip and interesting highlights for those of you who plan to visit Ireland as well.

Part I - Shortened version of the Wicklow Way
Since we wanted to make the most of the time we had and the last part seemed a bit boring, we decided to start from Dublin and walk only for five days - up to Moyne/Tinahely and not up to Clonegal. 
The walk was really great. Especially since we had no rain at all during walking (and in general great weather), actually we had lots of luck since everyone we talked with before the trip told us we'd be totally soaked inside out, in reality the massive coat I brought ended up as just a burden, the only times I took it out of my bag was when I moved it to another bag. We had some rain during the first two evenings and some rain on the last two days in Ireland - and that's it. Oh, and one evening when one of the guys hung his clothes to dry outside, but that wasn't me so it doesn't count. All in all this absence of rain is another proof to my belief that I have tons of excess luck. Anyway, words (except maybe "Green") won't really describe our walk so you can either check out the photos or walk the Wicklow Way yourself (recommended option), and we've learned a few lessons as well:
Tip 1: Public transportation in Ireland sucks:
a. It's expansive - buses are expansive, taxis are expansive and trains I prefer to forget (31.5 euro for a train from Galway to Dublin).
b. It's hard to find. When planning to quit the Wicklow Way early we didn't consider the fact that we won't be able to find a bus and the "taxi" we eventually used was an elderly guy with a van (that robbed us nonetheless).
So what do you do? Whenever possible just rent a car, this method has it's own limitations (drop-off points and the like) but it's much more convenient and cheaper.
Tip 2: Don't carry around your big bags. The places you stay at, will be glad to transfer your belongings to your next stop for a fee (usually 20 euros), but it's worth it. And if you picked Ireland you can probably afford it.
Funny 1: We're lame. During the walk we've met many people doing the Wicklow Way. Many of those people aged more than twice our own age and they didn't even bother listening to my second tip... embarrassment indeed.

Part II - The South
Next we rented a car and visited:
a. Around Cork: 
Blarney Castle - a nice castle which has the "Stone of Eloquence" on the top floor, kissing it is not a simple task, so unlike other visitors we decided to pass.
Fota Wildlife Park - a very nice wildlife park - unlike a zoo, animals are not behind bars and mostly (not the cheetahs though) are separated from humans only by water or modest wire fences. 
b. Around Killarney:
A whole day of cycling in the Killarney National Park - Ross Castle, the Gap of Dunloe andMuckross House, including sailing through the three lakes by boat. 
This little adventure took some recovering from.

Part III - The West
In this part we dropped of the car and stayed for three days in Doolin - the place where everyone owns a B&B (at least one).
a. Walked the section of the Burren Way from it's beginning to Doolin (we took a bus to the beginning) through the magnificent Cliffs of Moher.
Tip 3: The Burren Way. Looking now (we didn't have a map as we did the walk) at maps I have, it seems there are at least two versions of the Burren Way - one that starts in Lahinch and simply skips the Cliffs of Moher and one that starts at Lisacnnor and goes through the cliffs. We tried to do the second version but it was very poorly signed (unlike the Wicklow Way), luckily we realized going down at the cliffs didn't make sense.
b. Took a ferry to the largest of the Aran Islands - Inishmor - and cycled through it (this time it was much easier).
Tip 4: Unless you stay for the night you have only about three hours to visit the island before the return ferry arrives.
Tip 5: There's an option to take ferries for the islands from Galway as well, an option worth exploring.
c. Spent about half a day in Galway.

Part IV - Northern Ireland and around
After a brief stop in Dublin to celebrate Rosh-HaShana and rent a car we headed up north.
a. On the way north we visited the megalithic tombs at Bru na Boinne.
b. We stayed at Carlingford near the invisible border and did a short (about 11 km) walking route - poor signing again.
c. Drove to the very north of Northern Ireland and been to the Giant's Causeway
Funny 2: When we finished with the Giant's Causeway it was already dark, we were hungry and we had a long long drive back to Carlingford so we entered a nearby hotel to eat. I don't know why but we were total strangers there - everyone looked at us,  they didn't know how to charge us for the beer and asked us to another room to serve us dinner, strange...

Part V - Dublin
We had about two days in Dublin - one before going up north and the last day, can't say the city is too exciting, the Guinness Storehouse is nice but unless you're a Guinness fanatic it's not that much of a thrill. Also there are some nice museums.

That's all folks.
Of course, if anyone is really planning on visiting Ireland, I'll be more than happy to answer any questions.

Monday, September 15, 2008

Buffering...

It's not really clear what I'm doing writing this post when I'm leaving for Ireland in an hour and a half, but I guess I have a buffer to waste...

Anyway, this post is about the single most important part of any IT (any?) project - the BUFFER. A big enough buffer should always be taken for whatever procedure you're planning - we usually take a 30% buffer. 
I'm not talking only about an installation process, but also about things like giving an answer to your manager as to when you'll have a new product checked or a patch installed for the first time on a development environment.  I'd prefer to get a raised eyebrow and maybe surprise my manager later to working under unneccesary pressure and making up execuses. Same goes for actual installation processes, I'd prefer announcing a long downtime and getting complaints from everyone to trying to decide in a rush if I should rollback now or hope I'm lucky enough to shrink an hour's work into fifteen minutes.

Well, that was my little - written in blood - piece of advice before I drown myself in beer, whiskey and god knows what else...

Saturday, September 6, 2008

Ignoring My Issues

A big portion of being a sysadmin is solving problems. One of the first changes I've noticed when I stopped being a sysadmin is the fact that I suddenly have (much) more time on my hands, time previously consumed by solving - usually  minor - issues. But I think a sysadmin's job is also to know when to walk away from solving an issue, let me give you a couple examples:
  • There's a client side issue with the EBS system I used to administrate, I was never able to reproduce it at will or find it's cause, but I do know that reseting the Windows profile for the client solves the issue every time. So that was my policy, I've decided that if the workaround is so simple and finding a real solution is so complicated (believe me, I've tried), I'd rather just ignore the problem.
  • Last week the new EBS sysadmin was trying to solve an issue with one of the EBS related custom applications we have, it has suddenly stopped working without any apparent reason. This application resides on an Oracle Application Server so he was going over the logs trying to find the problem. When I saw it's taking too much time I came over and just restarted the Process Manager and voilla, it works. Talking to him afterwards he said he was aware of this solution but as it was past working hours he thought he'd better explore the issue - well he had a point. But my point is that an issue that happens twice a year or so and that can be simply "killed" isn't worth wasting your time on even if "killing" means affecting some other applications running from the same AS as well.
So how do you know when to ignore a problem? There's no real answer for that but you should probably consider the following aspects: Is there a workaround? How bad is it? What's the frequency? How long to work out a solution?

Sunday, August 31, 2008

Upgrading + Blog Day

This time I won't be telling about a database upgrade.
Just wanted to tell that from tomorrow on I'm upgrading myself. From tomorrow on I'm no longer an Apps DBA, my replacement has completed his training and is ready to take over the job. Effectively, he's kinda been doing 99% of the tasks for at least two weeks now.
So now I'm a team leader of a team of sysadmins - the team I was previously a part of - except from an EBS sysadmin the team also includes other sysadmins (MOSS, DBA, etc.). As a consequence the emphasis in my posts will shift a bit from EBS to more general stuff, but there'll probably still be some amount of posts about Oracle products.

And to the second part of this post.
Didn't know about Blog Day until today, but since I'm a happy victim of it's customs, I feel obliged to "celebrate" it as well, so here's my list:
  • techno-mosh. Mosh is a very versatile guy and has many areas of interest so it's always fascinating to read his blog. Besides, he frequently writes about all kind of great tips (that get him fans from all around the internet :)), so it's definitely a most recommended blog.
  • Coding Horror. Definitely a great programming blog, always gives me insights and even if not it's still fun to read.
  • Steven Chan's Blog. Although, as mentioned, I'm no longer an Apps DBA this is still a great blog, that many times gave me new ideas and new features to research.
  • xkcd.com. Great (if somewhat twisted) humor.
  • Yevgeny. Gotta agree with Mosh on this one, Yevegeny does a great job sorting out the less interesting posts and sharing only what I want to read.
Happy Blog Day everyone!

Saturday, August 16, 2008

Bad Design

Yesterday I was shopping for clothes.
Usually (at least in Israel) dressing booths consist of a small booth that has some hooks for hanging clothes and that's all, not fancy but good enough. One of the stores had a somewhat different design, it's booths didn't have hooks and had a kind of a bench instead and it had mirrors on each booth wall as well - not full size mirrors but a pretty big round mirror on the wall opposing the door and stripe like mirrors on the other walls. It's probably not the only place they have those mirrors but I find this idea pretty bad. What are the mirrors for? Usually the mirrors are just outside the booths so the shop assistants can do their job - tell you how great you look in this shirt and that they have just the perfect pair of pants to go along. If I just try on everything inside the booth isn't it a missed opportunity?

Well, actually in this specific shop even when I used the mirrors outside the booth (the ones inside the booth just weren't comfortable) the shop assistants didn't do too good a job. Instead of a casual "Can I help you?" or "What are you looking for?" I got an awkward "Would you like to get service?" (from all of them so it's oughta be some kind of weird policy) or even worse - for Hebrew speakers - "אתה רוצה לקבל שירות ממי?". In addition to that, I could barely understand what one of the shop assistants was saying(mumbling actually), don't they check for such things when hiring people for this shop?
I wonder why the bad service, I mean, other shops in the same mall have normal shopping assistants. Do they pay less? Are they less thorough in choosing their staff? They're lucky I love their clothes...

Thursday, August 14, 2008

adpatch VS. OPatch

So, in my previous post I've introduced some issues with both patching tools, but the truth is that adpatch is usually pretty good. Usage is pretty trivial, you have some useful flags (I often use -nocompiledb and -prereq), logs are pretty coherent and you have a good control over the workers with adctrl - although I don't really understand why the so useful "Skip" is hidden under the 8th non-visible option.

OPatch, on the other hand, I really don't like.
How come the architects behind adpatch thought about storing all the version information in the database and the architects of the database itslef didn't come up with this same idea? Why do I need to maintain the inventory folders? It's another thing to take care of when cloning environments, actually, in my case I just ignore it - I don't even know how to clone it properly (although I'm pretty sure there's some documented procedure for this).
Why when I wanted to upgrade my 10.2.0.3 patchset X database to 10.2.0.4 I had first to remove the patchset? And why when I tried to do this I got wierd OPatch errors? Isn't the whole idea of a patching mechanism is to save me those kind of trouble?
It seems that OPatch is nice for checking what you have installed on your database, but as soon as it comes to installing any patch beyond the first one it's nothing but trouble.

So ok, when two patched make different modifications to oracle.exe it's a problem to resolve the conflict, but I refuse to bilieve it can't be done.

Thursday, August 7, 2008

Flexibility

Patching mechanisms such as Oracle's opatch and it's more successful brother adpatch (this issue deserves a post of it's own) are great stuff - you can introduce specific bug fixes into the system, install a whole new functionality or make your application compatible with a new database version and all without having to upgrade a whole application version. The other side of this coin is the fact that you can ignore bugs you don't care about and you don't have to maintain and debug functionallity you don't use. Well, sounds great, everyone gets a suit sewn to perfectly (or at least nearly perfectly) match his needs.

But those mechanisms frequently require strict dependency and versioning management, and sometimes, I think, Oracle just exaggerates.

Let me show you what I'm talking about. During installation of SCM_PF.J on 11.5.9 I got many failed workers, two of them relate to my point:

1. A worker failed and searching Metalink I've discovered that this bug can be fixed by applying a database patch. The README file for the patch says that I should apply the patch and run drkorean.sql, it also specifies very strictly that I should make sure my database is down during the process. Off I go and try to install the patch, but opatch complains about bad inventory which might actually be true since this is a replecated development environment and it might have a problematic inventory. But I'm not the one to give up easily, so I take a look at the actions.xml file of the patch, and... big surprise (not really)! All the patch does is to copy drkorean.sql to a designated folder. A more naive sysadmin could've started panicing, trying to fix his inventory or even open an SR to cure the poor thing.
Wouldn't a simple "copy-run" instruction be a better option? It's not like there're problematic versioning issues here - it's a single sql script that fixes a specific bug - if you insist on having some supervision here, let me take care of this single script's version, believe me I'm up to it. And what's that about taking my database down? Guys, I was in the middle of applying a patch!

2. Another worker failed and Metalink pointed me to a "PROBLEM" document that said that I should apply a patch - unluckily the patch is for 11.5.10, no good for me. Like in the previous case a brief glance inside the .drv file proved my suspicions that all the script really does is to replace the sql script that the worker failed on with a newer version. I'm 100% sure that there's no problem running it on 11.5.9 - and that's actually what I did.
OK, I get it, the bug was originally discovered on 11.5.10 and that's probably the only setup it was ever tested on. But, wouldn't arriving at the conclusion that it matches 11.5.9 as well save us all some time and blog lines? Woudln't it be ashame if I had to open an SR and wait for a reply to handle this trivial case?

Saturday, August 2, 2008

Thank God for Internet

I sometimes think that the Internet - with emphasis on the ability to search it - to technology is something like zero ('0') to mathematics, it's a relatively late development and without it some things are just impossible to accomplish.

I've recently been working on some small C# project. I don't really know C#, yeah I did a one-week course and I do know Java and C++, but that doesn't really mean I know all the techniques and tricks. So, naturally each time I hit a bump, I google it, and voila! The first or second result is usually what I'm looking for.
Now, what would I do in a world without Internet? Call a friend? Look up the subject in a C# book? This will probably force me to keep a small library of programming books, which is nice to brag about but space and money consuming and still much less efficient. And neither solution allows me to instantly paste code snippets into my code. You get the idea, not having Internet, my task would have been much more frustrating and challenging.
But who cares about me, what about serious software development? Think how much time would've been wasted if every time a programmer had a problem he would've had to start searching books. Of course he has his colleagues as well but they don't usually provide a written code example, more wasted time. And time is money.
I'm not really a (programming-)linguist but I think that modern programming languages like Java, C#, Perl, etc. couldn't be what they are today without the Internet, they depend on too much different interfaces and classes (hey, where would I get my Perl packages from) and one can't expect to master them all, there's also a limit on how much information and examples you can put on an MSDN disk you even if you release a new version once in a while. Besides, you'll never have what professional forums supply.

Since I work mostly with Oracle products, the trivial analogy is Metalink. Yes, I know Oracle precedes the Internet but I really don't know how support worked those days. Did they publish some kind of soft copy repository containg notes? Did you manage all your TARs on the phone? And what about uploading log files, did you have to send a diskette with the files to Mohender in India?
One thing I'm (pretty) sure of is that having an online support system, like Metalink - however not efficient it may be at times - enables Oracle (and every other software company as well) to develop more complicated products and more sophisticated features. It's one thing to search the Metalink or google once in a while and a totally different thing to search through documents and manage your TARs by phone.

Hail the Internet!

Thursday, July 24, 2008

Mass Changing

In some cases the domain name for an entire Oracle product has to be changed. Oracle tools have a pretty straightforward approach to this task: replace all occurrences of the domain name in all configuration files. Now, that's ok if the source domain name is lribinik.cool.blog, but what if I'm more enamored with Oracle than with myself and I happen to have a single label domain named ORA? Things are gonna get messy in this case (just use your imagination).

One of those cases is when you perform a cloning of your EBS environment and run adcfgclone. Now, I used to have this issue forever but I only noticed it used to target files under rdbms\admin, but after installng 11.5.9 CU2 (I think this is the patch to blame, but I might be wrong) I noticed every possible location is trampled by the huge Oracle monster: perl, jdk, jre, you name it. What da hell?! Why should perl\java code be changed AT ALL after a cloning process?
Had enough of this, really. To deal with this I've just removed lines relevant to those folders from the instconf.drv file and now I can clone quitely.

Similar processes for iAS domain name change are a bit more reasonable, they actually warn you about this in the documentation and let you know about a file that contains the list of files that will be changed, they actually try to stick with to relevant files.
I'm not sure though, if the chgip script (the one that changes the domain name) works with this file, because I think files not mentioned there are changed as well.

Anyway, I think the best way to avoid this problem is just to name your domain lribinik.cool.blog. It's a nice name.

Thursday, July 17, 2008

Education, Part II: Exams

I think I might have made a wrong impression on my previous post, studying is not really such a nightmare to me, I do enjoy it, it's just that sometimes it can get very exhausting and frustrating.

One of the aspects of studying that leave me emotion-mixed are exams.
On one hand, the exams' period is usually the period during the semester when I have the most free time - no lessons, (usually) no assignments and as my studying method is a quick revision of the lecture notes and solving previous tests I don't spend much time studying for an exam. Why not? Well, at least in my Masters I usually have a maximum of two previous exams to refer to and this doesn't leave me with much studying to do.
On the other hand, exams are annoying. One of the things that annoy me is that you would expect an exam to be designed to test your knowledge, but this is not always the case. That's why my studying method is as described above. You are not studying the course's material, you are studying for the exam.
For instance, I had three tests this last semester, each one of a different kind:
1. "A Literature exam", or at least this is the Hebrew term, meaning the exam isn't really designed to test your knowledge or understanding but how much you can write on a trivial subject and how accurately you can guess what negligible facts the lecturer expects you to mention.
2. An easy test but there was a catch... time shortage, and the lecturer won't add any time. So OK, it's a ranking system, the better students will probably solve the questions more quickly and will have a chance of answering all the questions, but I still think exams are supposed to test knowledge and not writing agility.
3. This one was an exam that really tested knowledge but we had only a small set of sample questions (without solutions) to study by.

Well, after telling everything I hate about studying I can now rest and don't think about it for the rest of the summer.

Thursday, July 10, 2008

Education, Part I: Masochism

Know this thing when bugs rush towards the light that's gonna fry them? Well, I sometimes feel like one, except I have more than a single chance.

What am I talking about? Last week I've finished the last test for the semester in my Master's studies and I can't help but feeling a great relief and thinking to myself "pheeeeew, done at last, three months of freedom, here I come!". And then I recall how it all began...

Before starting my Bachelor's degree I was excited - "WOW, University, that's going to be interesting!" (a bit geeky I know, but no hiding the truth here) - I think it took something like a couple of weeks to get me totally frustrated and counting the weeks 'till the end of the semester.
Before the 2nd year I thought - "Done with (almost)all the math so it's definitely going to be better this year" - well, not really(maybe a bit), but at least I started getting higher grades.
Before the 3rd and last year it was something like - "Well, I'm done with most mandatory courses so now I choose most of the courses I take, how bad can it get?" - and indeed, it was better but at the end of the year and the degree the only thing I could think regarding further education was "Nice it was, but a B.Sc. is more than enough for me. Thanks a lot dear university but I'll probably leave the Master's to someone else". I was truly sure there's no way I'll go for a Master's, at least not in the near five or so years.

Well, it took me one year to start considering the issue and another year to actually start studying, and again same old lies "It's a Master's so the courses are more interesting and less demanding", well it has some truth to it, but no matter what, at the end of the year I'm still exhausted and happy to get rid of all the courses I was so excited about taking.

So now I have three months to forget all I've written about here and get re-excited, you can count on me.

Talking about masochism, this week I've learned it can take the form of curiosity, I won't tell you what I'm talking about - this will be sadistic - so as translated directly from Hebrew "who understands, understands".

Thursday, July 3, 2008

Light at the End of the Tunnel, Part II: The End?

Well, it's time I conclude this whole upgrade story.

The other major problem I had, still have actually, is related to memory consumption. Apparently 10g processes consume more PGA, I haven't really understood yet if that's a planned behavior or a bug, but the fact is that just opening a forms' session process now takes something like 8MB of PGA instead of 3MB on 9i add to that the fact that I have more than a couple concurrent users that do more than just opening forms (they actually work with it), throw in my operating system (Windows 32bit), add to the mixture memory leaks in forms (mainly in the inventory module) and you might as well call Huston.

I had some advices from Oracle on this one but no real solution. I've tried reducing the stack size for processes using orastack, an Oracle expert (same one from here) suggested I ask the heavier users to close forms they don't currently use and found a yet unsolved bug that seems relevant, and I still have an open TAR regarding the issue.

So where do I stand now? I'm not really sure. I used to have the database crashing with ORA-7445 and ORA-04030 daily, and then it just stopped - yeah, I do face an occasional crash once in a week or two but somehow the problem is not such a pain in the ass anymore. One of the possible causes is a reduced number of concurrent users, it seems that it's just enough to save me from the fatal memory consumption limit.

I'll probably let you know if I have any insights and I do have a lot to say about the increased memory consumption of 10g but that's stuff for a post with a different title.

Saturday, June 28, 2008

Just Stupid

Well, I was actually going to write the second (and probably final) part of Light at the End of the Tunnel but some things are just too provoking (and I'm not at the end of the tunnel yet anyway).

My folks bought a new phone device today, probably the cheapest device in the whole store, but that's not really a justification for the piece of crap this device turned out to be.
What can be so terrible about a simple phone device? Well, I'll supply an example, and please keep in mind that this is after only a ten minutes interaction with the device, I'm actually a bit curious to discover what other flaws the device has...
To the example then. My dad opens the box, connects the phone and tries to dial... nothing. Apparently someone had a "brilliant" idea, the device has a lock (see picture below) that somehow reminds the ages when computers used to had locks - didn't find an image of that one (and didn't look to hard either), anyway, the purpose of the lock is to block calls to certain numbers: it has three modes - dial any number, don't dial any number and dial only numbers that don't start with an '0'. Those three modes are represented by three colors: red, yellow and green, at least it's trivial which color represents which mode, that's right, red is "dial anywhere" and green is "forget about dialing", someone else thinks that's extremely stupid? If you take a close look at the picture you'll also notice that there's no real way to know at what position the lock is now (yeah, it's not green), very amusing. What adds to the amusement, considering the fact that the phone didn't came "red" by default, is that the Hebrew version of the user's guide doesn't have a single word about the lock.
And if that wasn't enough I have another example. The device has a speaker, so you'd probably expect to be able to dial a number, press the speaker button and have a conversation rolling, not with this one. Once you press the speaker button the number typed disappears and you're welcome to type in a new number, so you have to press the speaker button before dialing or use some commando techniques such as pressing the redial button after the number you typed has dissolved - that's a bit weird since you didn't really "dial" that number before, oh well.

A home phone device is supposed to be as simple as possible to use and trivial to understand for everyone, this one has definitely failed this test. Just Stupid.

Tuesday, June 24, 2008

Alerts

Last weekend I woke in the middle of the night to discover my arm totally numb, and when I say "totally" I really mean it - I couldn't feel it nor move it.
I don't know if I'm some kind of freak or maybe it happens to everyone occasionally but it's definitely not the first time it has happened to me. What I usually do when it happens is to grab my "dead" arm with my other arm and throw it off the side of the bed - I do it to hasten the blood flow back into my arm (have no idea if this actually has any effect though) - this is actually the creepiest part of this whole process, because when I grab my numb arm with my not-numb arm I don't feel like touching myself and the arm is simply dangling like a dead piece of meat until blood flow is restored.
Up until now the arm revived each time, but I always wonder to myself what would've happened have I not woken up. I also wonder if what got me awake is some alerting (hence the title) mechanism of my body, and since I don't tend to wake up in the middle of the night this is probably the case.

What this short tale got to do with system administration (except happening to a sysadmin), well, nothing really. But it did got me thinking that I should write a post about the different monitoring and administration tools I use to detect and analyze bad things in my system, of course, I still care more about my own arm - can't type commands too fast with just one arm :).

Friday, June 13, 2008

Information Boom

I'll take another break from my upgrade aftermath tales to write about an issue I think many sysadmins face, and not only sysadmins. Handling information.

It seems that since I started this job I'm in a constant race to beat the accumulation of information, and I never seem to be on top.
When I started my training I had a notebook in which I wrote all my remarks, since at the beginning you write any minor note (although later it's trivial anyway) I soon started a second notebook. But notebooks aren't so efficient so after starting the job for real I've started to use OneNote, WOW! I can create plenty of pages in a single tab, this will definitely do the trick! But soon I needed more tabs and then more OneNote notebooks too.
To store documents my team (a team of sysadmins) uses a SharePoint portal, we used to have a folder-hierarchy structure to our documents, but at some point it just got too messy so we've decided just to throw everything in the root folder (chaos indeed) and tag it, then of course we've created views and filters for the different tags.
We also have a Wiki to store and share important knowledge.

But even with all those mechanisms set in place it seems there's just too much information to handle:
I have my daily tasks to manage. Every new topic (regardless how minor) I deal with is a page or a tab in my OneNote. A major upgrade or installation consist of my own notes (what I did, what errors I got), official installation/upgrade notes, know bugs documents, new features' documents etc.. Every problem has a related metalink note or my own document describing how to fix it. And except the EBS I'm also in charge of an OID and an iAS servers.
That's just a partial list and nothing in it can be skipped.
Both external(e.g. supplied by the vendor) and internal (written by the sysadmin) documentation is critical to every sysadmin - you can't memorize every parameter and functionality in your system and you can't expect to remember every error you handled three years ago.
But the problem starts when you need to retrieve this information, if you don't organize your information wisely it doesn't matter that you have a perfect documentation of how to fix a bug because you just won't be able to find it, too bad. I can think of many times when I was sure I saw something familiar before and had to start guessing where exactly - my OneNote? The portal? Maybe in my documents' folder before I started using the portal? Maybe I eventually didn't download it from the metalink? Or maybe someone else showed it to me? Trust me, this can get pretty frustrating.
Having a good logic for storing your information is crucial but from my experience it's just not enough as information tends to outgrow any such solution.
As I see it, the only real solution is a powerful searching solution that has access to all the storage solutions you use.

I'm actually going to explore this field of enterprise search solutions so I might have more insight in following posts.

Friday, June 6, 2008

Light at the End of the Tunnel, Part I

This post should actually be titled "Upgrading Part IV", but my Roman math is not so good - I admit checking if I should write IV or VI - so I decided to stick to the simple numbers.

After bitching about it a lot, it's time I tell you how my inconsistent problem got solved.
This one was too much for me to solve on my own so I had an Oracle expert come over. After describing the issue I'm facing he said we should set up a trace on all client connections from the 8.0.6 home - we had to do this for all connections since we had no idea which concurrent will get stuck. It took some time before we realized that we had too small a pool of trace files and that trace files we were looking at didn't belong to a stuck concurrent anymore. After doing some re-configuration on the trace settings we finally found a trace file that belonged to a stuck concurrent.
Boy! The guy has patience. As someone on my team said, a good expert can be measured by the patience he demonstrates going over log and trace files.
Going over the entire trace file we discovered that at some point during connection establishment the process just stops, we could see that the client was waiting for some input and... nothing. Although it felt like closing on on the problem it seemed like a long and tedious process will be needed through a TAR, since however good the expert I had on site is, it's just not his specialty, and it definitely looked like some internal bug.
Before leaving, the expert noticed I had NTS defined in my sqlnet.ora file for SQLNET.AUTHENTICATION_SERVICES, I explained I was using it to logon as sysdba without entering a password. He suggested I try to remove it, so I did.
At this point we were towards the end of the day so I didn't expect any concurrents to get stuck anyway. But the next day the problem disappeared, I was thrilled, seems like the NTS setting was the problem all the way! Thinking about it, there's no chance I would've thought of it to be the root of all evil. I think the only reason the expert noticed it is because he's not used to dealing with installations on Windows, it seems to me that even he didn't believe it will solve anything and suggested it just as a shot in the dark.
Proper disclosure. The expert suggested some other changes as well. Those were changes to DB parameters as specified in an Oracle document regarding DB parameters for an EBS installations.
The changes had effect only on the next day (the day everything worked) as well, so I might be wrong and what really fixed the problem was altering those parameters. Since I'm curious and somewhat suicidal I'll probably have the NTS returned and check if the problem reoccurs, promise to let know if my intuition here is wrong.

The funny thing is that just after discovering this issue with NTS, I found another one, working on an entirely different thing. Try googling for "NTS sqlnet.ora".

Well, one problem down, one more to go. Probably will tell about it in my next post, although it's still not solved.

Saturday, May 24, 2008

Widening My Horizons

Well, I've decided to take a break from my series of upgrade related posts and write about something more fun.

This week I got a new 22'' (wide) monitor as a replacement for my old 17'' monitor. No doubt this is a great improvement, so many things look different now:
Most emails fit the screen without scrolling, I can accommodate more tabs in my Total Commander, I can comfortably rearrange my OneNote pages - it actually feels like each page is twice as wide, I can open two Terminal Services side by side and work with them in parallel, it's easier to read documents - less scrolling, and those are just the things I could think of on the spot after one week's work. I'm sure I'll discover many more advantages. Oh, the joy of having a new toy!

Of course, there's a downside as well(not that I really care, but...), I'm going to get spoiled.
I think it's relevant to many aspects in life, when you get a taste of a better something, you can't really go back to what you previously did just fine with - and yes, I was pretty satisfied with my 17'', although getting an occasional "man, how can you work with it?!" from guys with bigger monitors. And now I'm going to become such a guy.
It reminds of when I was a kid and played Warcraft (the 1st one of course). I played it on my computer and was just fine with it, but then I was at a friend's house and he had a faster PC, after playing a few levels on his PC mine seemed so sloooow, it was impossible to play.

My conclusion is that such improvements are good only if you know you won't have to revert back to the previous situation, otherwise it's going to take a long time getting used to and might even be frustrating.
I guess I'm going to demand a 22'' monitor to be included in every contract I sign from now on...

Saturday, May 10, 2008

Upgrading Part III: Consistency

Well, I'm still battling problems after the database upgrade.
I think that the most annoying thing about the problems I'm facing is that they're not really consistent.

Of course, we'd all like a problem free world, but once we encounter a problem the first thing to do is to reproduce it, thus assuring ourselves we know what we're dealing with.
First you try to reproduce it on a development environment if that doesn't succeed, you at least expect some consistency on the production environment.
When you don't have this either you're in a trouble, you might do something one day and hope to see results on the next day and indeed the next day problem still occurs but in much lower rates and you think to yourself "I must be in the right direction" but then the next day business is as usual - the system is in a total mess. Then you understand you don't understand anything. The next step is to try and reproduce problem on your terms.
In my case the issue is with concurrents randomly getting stuck. So I first tried to see if the problem is with all executables or is EBS related, starting with simple cmd scripts running non-stop and going to submitting multiple concurrents using CONCSUB - at last one of my manually submitted concurrents hangs. But I got no real insights from that as well. No consistency whatsoever.
One of the main issues with inconsistent problems is that they're hard to describe - in a TAR for example - most directions for detecting the problem involve some major configuration problem, one that would prevent the system from working altogether, but remember, I don't have this kind of blessed consistency.
Another thing with inconsistent issues is that you come to work and hope things will get bad so you could poke the issue around a bit more and so you won't get that annoying voodoo feeling. Usually a sysadmin's hope is that nothing will get bad...

Actually, this week I had some progress, so hopefully the next post will contain a bit of advice that can potentially save a serious headache to someone if he stumbles upon this blog while trying to resolve a similar issue.

Saturday, May 3, 2008

Upgrading Part II: Bad Job

As with any upgrade/major installation I've ever made (at least as far as I can remember), after upgrading the database to 10g the application have demonstrated some interesting errors. One of the problems is related to concurrents and the executables they spawn crashing in mid-air - promise to tell about this in more detail when I myself have any idea.
At some point I suspected that the problem was with custom code running cmd scripts. Since some of those scripts are called from within a PL/SQL code using a Java stored procedure I thought that maybe I should try and use a "less custom" way to do that.

Luckily (or unluckily), I just read the previous week about a new feature in 10g - the dbms_scheduler package that's supposed to replace dbms_job and to be much more powerful, for instance it enables you to run cmd scripts. So I thought I'll try this out since it sounds exactly like the built-in method I was looking for.
Well, I have only one thing I can say about this: it's better to leave a feature out of the release than to keep it in when it sucks, totally.

Really, my keyboard is still soaked with sweat from my efforts to run a single script that echoes some text.
I've already written about this but I guess the message didn't get through, if I'm supposed to start the Oracle Scheduler service to run jobs then that's exactly what I expect to be written in the error message I get when not doing so - certainly not a "file not found error". I also expect this to be written in any (well, at least some) documentation describing the new feature, that's not the kind of things one supposed to dig up only in forums. By the way, some hilarious problem with similar symptoms I read about while searching for a solution to my own issue - I've actually tried this out since at first I thought this was the problem I was experiencing - it appears that for some users (maybe in earlier 10g versions) just supplying a cmd script didn't work, they had to run cmd.exe with parameters. Can't even begin to understand how you manage to create this bug and release it.
Well, after completing the POC (if at a the cost of health) I got to the real thing running a script with parameters. In some document I saw something like the following example:

dbms_scheduler.create_job(job_name => 'JOB', job_type => 'EXECUTABLE',
job_action => 'script.cmd', number_of_arguments => n);
dbms_scheduler.set_job_argument_value(job_name => 'JOB',
argument_position => 1, argument_value => '...');
...
dbms_scheduler.set_job_argument_value(job_name => 'JOB',
argument_position => n, argument_value => '...');
dbms_scheduler.enable (name => 'JOB');
dbms_scheduler.run_job(job_name => 'JOB');

Well, maybe it's just my system that is a freak but apparently the enable procedure erases the job. Exactly, I create a job, I run dbms_scheduler.enable and no job in the table - no error message, no nothing, I might have run drop_job instead. Well, apparently I don't need that line anyway. After some more struggle with cryptic error messages I got the package to do what I wanted it to do, but that's really not good enough.
I can't even start to imagine what kind of efforts are needed to bootstrap all the advanced scheduling features - windows, chains etc..

I'm willing to bet money on the fact that most developers would have given up much earlier than I did saying this stuff just doesn't work. You can develop useful and cool new features all you like but if you can't cut/paste an example and see it just working no one would use it, if all users get when something is wrong are misleading error messages they'll just get frustrated (and you can see I am one such user) .
I first read about this feature in a "10g Top 20 New Features" document, and indeed it sounded great but if it's impossible to use it, it's not really a new feature at all.
I'm sometimes not sure if I should account all those funny errors I deal with to the fact that my system is on MS Windows - a less common platform, maybe I should. But that's not a good enough reason, if I'm in possession of a disk labeled 10.2.0.3 for Windows I expect it to work, I don't really mind having it released half a year later than the corresponding Linux version, I just want it to function.

As for me, I know about this feature, I can even make it work, but there's no chance I'll suggest it as a solution to any need unless as a last resort. Too bad.

Thursday, April 24, 2008

Upgrading Part I: Human Error

In my previous post I've talked about a problem I experienced before upgrading my EBS database to 10g, well, the upgrade has been performed and I think it has supplied me with ideas for at least three posts. Of course the bug experienced on the test environment didn't show itself on production (I didn't make the fix intentionally to see if it re-occurs), and what I have now is a very "nice" variety of other bugs. But why rush to the aftermath? Lets start with the upgrade process itself.

I had all the upgrade procedure in soft and hard copies, have made about 4 upgrades on dev and test environments, was pretty sure I would successfully handle any problem that will pop up. Actually, many steps of the upgrade process came down to a Copy-Paste action from my upgrade document to the command-line. It's so frustrating that what eventually got me delayed for about five(!) hours is a totally stupid mistake on my behalf.

Of course, before such an upgrade a backup should be taken, I have a script that runs every night: it brings down the system, a backup is taken (we use EMC's BCV) and then the system is brought up, all I had to do is re-schedule this script. Being a smart ass I've decided to save some time and disable the startup script, I only need the database up for the upgrade process. I could think of plenty ways to do this but decided that emptying the startup script will be the most secure - this way I'm not touching the main backup script. So I've made a copy of the startup script and... emptied the shutdown script. Yes you read it right, no typo.
So the time comes, I'm watching the backup logs and WOW! that was fast, usually it takes about five minutes for the shutdown process to complete and now the system is up after one minute, great! Hmmm... maybe not so great after all. So I take a closer look at the logs and discover what I have just done, the meaning is that the BCV split was performed on an open database and it's worth nothing (well, maybe not nothing but that's a bad idea to start the upgrade with this kind of backup). Oh, I'm going to use some EMC terms but it's OK since many of them I myself discovered only during this upgrade.
Well, actually there's no reason for panic, the backup devices are mostly synchronized and all I have to do is shutdown the database and take the backup again. I've called the guy in charge of storage and asked him if I'm right, he said I'm right and guided me through the process of taking the backup from the command line. But, alas, That's not so simple!
Apparently when EMC tells you that a split is complete, well, it's lying. The split process (again, if you have no idea what I'm talking about, just imagine some backup process) continues in the background and until it's completed you can't take another backup. The more frustrating thing is that the storage guy didn't know about this (he's new, cut him some slack) and I didn't know how to check the progress of this background split. At some point - when only one disk was still splitting and the others were already re-synchronizing - I've discovered how to check the progress of the split. Great, I'm 99%, but boy this 1% took it's share of time to complete.
Phew! At least now I know where I'm standing. Well not exactly, EMC has another "great" feature, when it comes to synchronizing the last parts of disks it take a totally disproportional time (remember the 1%), so when there were only 50MB left to synchronize the machine had to work forever for every 0.1MB, no kidding. So I didn't really have any estimation as to when I'll finally start the upgrade process and I was already about 3.5 hours late, and this is a downtime we're talking about here, we usually like to minimize those.
So I waited. I guess this is what they call "paying for your mistakes".

Eventually the synchronization completed and I got to the upgrade process. The upgrade was perfect, no unexpected errors and the entire process was completed (long) before scheduled time - thank God for buffers.

Well, a nice prelude indeed...

Wednesday, April 9, 2008

The Persistent Bug

My system has a bug that has been around long before I became the sysadmin, it reincarnates each time in a slightly different form, but for me it's the same old bug.
I call it "the -107 bug" because it's main phenomena is an error message in the concurrent log that specifies a long number - -107... (I don't really remember all the digits, there are lots of them), it looks to me like a really big integer going over to the negative side or something like this, but I don't really know.

I was first introduced to this bug after installing some major patch (maybe ATG_PF.H) when QA alerts started failing in every execution, then, other (with no pattern I could pinpoint) concurrents started to fail with this error as well - but not all the time.
I've opened two separate SRs to handle those issues and over time some actions have been made to improve the situation and eventually a patch that was supposed to fix the problem for the concurrents had pretty much fixed the problem for the QA alerts. In the last half year or so those errors became really rare.
But then, yesterday it hit again. During the testing session before upgrading the production to a 10g database, QA alerts started to fail with this error (although sometimes doing their job), eventually the errors have stabilized themselves to target only specific transactions. Getting desperate I came to the point of telling my boss I believe the best action plan is to perform the upgrade anyway, I had several arguments:
1. This bug might skip the production.
2. We have a project that will alter the QA methodology to a new one in which this specific action won't be performed.
3. I know this long enough to know that it won't be solved in a reasonable time frame.
4. Schedule considerations.

But something was different this time, the pattern was too consistent, usually this bug is unpredictable and actually never was reproduced on a dev environment. Indeed, a few minutes after my conversation with the boss, one of the team members found the problem - total voodoo crap if you ask me.
The INSERT action this specific alert performs had a section that read "-1*&some_parameter" changing it to "-&some_parameter" solved the issue.
OK, let's say the new and shiny 10g doesn't know how to digest a -1*x expression, but the alert won't fail when it's the only one triggered as part of a transaction (I don't really know the technicalities of this whole process so I hope my explanations are clear enough), it only fails when other alerts are triggered as well.

Anyway, I guess that's another score for me in the endless struggle against this stubborn bug. On the other hand I'm pretty embarrassed, I did tell my boss there's no way to solve the issue...

Saturday, March 29, 2008

Wake Up Call

Growing up is the process of discovering that all the great stuff you believe in are just fictions and illusions. Santa isn't real, your daddy doesn't really know everything, guys who write Tetris(Tris) for the iPhone don't really block the possibility of moving the pieces upwards and real-world software isn't really developed and tested like it's described here.

Actually, I've just experienced an example of what I'm talking about in those few first lines of this post. How do you place a link in a post? Very simple, you mark some text, push the link button and enter the desired URL. But then, everything you type after the link is considered part of the link. My way of overcoming this obstacle is to edit the HTML code directly to get outside the link tag, I'll really appreciate simpler solutions. Now, I AM using Blogger and it's pretty good but this little bug is just pathetic.
When I was younger I really thought that these major software products (except Microsoft that always was a subject for mockery) are written and tested to the last bit and that they really do work - otherwise, why would someone use and even pay for them?
But when dealing with the internals of software like a sysadmin frequently does, you sometimes discover code sections that obviously weren't tested and maybe even not thought about for a single second. Just some of the examples I've encountered: code that exits instead of returning an error code, irrelevant localization code that slows down performance tremendously, code with zero error handling and much more.

It's a little sad, like every myth you discover is only a myth.
But in second thought, that's why guys like me have a job.

Thursday, March 20, 2008

New Technology

Usually my tasks as a sysadmin are pretty standard: upgrade the database, fix the bug, deal with the backup issues and so on. While I might have not upgraded any production database yet, I still pretty much know what I'm dealing with.
But every once in a while I'm required to deal with something (almost) totally new. For the last couple of weeks my team leader and me have been working on implementing EUS - or more simply: database user authentication via OID using Kerberos. While we both pretty much know our OID, we haven't implemented this specific solution before (at least not one that worked :-)).
I really like those new challenges, I learn something new each day, but this kind of stuff makes you feel like you can add a new bullet in your CV. Like my team leader mentioned, it makes you want to work only on this new and shiny "toy" until you finally get it up and running and forget about the more mundane tasks. Unfortunately, those mundane tasks are often more pressing and so we end up working between other tasks and making little (if in a steady pace) progress each time. Well, life's tough. At least we ARE making progress...

Saturday, March 15, 2008

Overtraining

Metalink is one of the most useful tools for an Apps' DBA, you can search it for articles or upgrade notes, download patches and search for bug resolution instructions.
The first time you search an error message and get many results, you go over nearly every note, you don't really know to distinguish the relevant from the irrelevant and since you're probably new to the job you're afraid to miss the desired solution. Over time you develop a sense for what is relevant, even if the note doesn't mention the exact error message, you sometimes know that the issue is the same. On the other hand, just like with machine learning, you might get a bit too confident in your ability to distinguish relevant from irrelevant and miss valuable notes.

That's what happened to me this week. I was looking for a solution to a problem with an engineering form, I found a note that mentioned a patch that was supposed to install a new version of the form, but applying the patch didn't upgrade the form. So after doing a bit more searching I've asked from one of the analysts to log an SR (I wouldn't really know how to describe an engineering issue). Luckily, he was wise enough to search Metalink himself first, and indeed he've stumbled upon a similar issue in which it was suggested to make a full installation of the engineering application. Now that I think about it, it's trivial and probably the first thing I should've thought about - a form might not get installed if the application is only in shared installation mode.
Anyway, this fresh perspective have saved logging an SR and probably a couple of days work.

Saturday, March 8, 2008

Less is More

This week I had to bring back to life an environment I'd kept "frozen" for more than a year because of a revived project, kinda like Demolition Man. I've cleared a server to host this environment and thought I had a creative idea:
Nowadays I use NetApp storage (through iScsi) for this server, mainly because of the snapshot option, but this server has also a local disk that can accommodate the database. So I've decided to use the local disk - I get to keep the environment I had on this server before, the robocopy is much faster and I won't need snapshots for this environment anyway since it's for reference purposes only, not for development.

Do you know this guy Murphy?
A few days after the environment was up I've discovered that a couple of ApEx applications that were developed on this environment weren't kept anywhere else, so I had to export them and then import to the environment where the rest of project will be developed. Yes, I know, someone has real bad development practices, but, as my boss told me, part of my job is to clean the mess others make. So I've tried to open ApEx - nothing, a blank page. Now, you see, when those ApEx applications were developed ApEx was relatively new, actually, this old environment had an ApEx 2.0 (when it was still referred to as HTMLDB) schema as well as an ApEx 2.2 schema, and the only version to ever reach production was 2.2, so at first I thought there might be some misconfiguration of the iAS server or something like this. After exhausting the few ideas I had regarding misconfiguration I've tried to export the applications using APIs and got a strange error message about a missing datafile.
Well, you learn new things each day, it turns out that even if you can see the tablespace's definition and see the package specs for packages in it, it doesn't necessarily mean that the datafile for the tablespace is present.
It turns out that when I recreated the control file for the database I didn't have the datafile for the ApEx tablespace in the template script (remember it was new at the time). I'd expect the database to shout at me something like "man! you're missing a datafile" instead of just generating some nonexistent path for the missing datafile and going around acting like everything's ok.
God I wish I'd used the NetApp option.

I had two options:
1. Try to re-integrate the additional datafile into the database.
2. Restore the database again.
Now I get to the meaning of the title of this post. Re-integrating the datafile, if successful, will save precious copy time and giving up on a challenge is against my nature (and against this of another DBA from my team who tried to help me). But, during copy time I can do other things and this re-integration thing didn't look like it has high odds, and so we would probably end up with the second option anyway.
So, what I'm trying to say here is that being a good sysadmin doesn't always mean you have to solve any single problem, sometimes you just have to pick the more efficient course of action even if it is the more trivial one, even if it means you leave a problem unsolved. In life you have to pick your fights wisely.

Tuesday, February 26, 2008

My Good Deed

Those of you who have keen eyes have noticed that I've added a new link in my favorite blogs section. I've stumbled upon Aviad's blog this week when I've encountered a problem while upgrading my EBS database to 10g. Actually that's the second time I encounter this blog, the first time was when Steven Chan mentioned it in his "Notable Apps Blogs" list. This first time I was just proud to see an Israeli in Steven's list and didn't linger too much in the blog (although it did strike me as quite an impressive blog).
The second time, even after I've discovered the specific problem I had on my hands wasn't mentioned in the blog (although many others were), I did some more thorough browsing. The guy really does excellent job, his blog might have saved me a lot of time if I was aware of it before.
Anyway, this blog inspired me to stop complaining for at least one post and share some useful information, the issue I had looks like a good place to start.

As mentioned, I'm in the process of upgrading my EBS(11i) database from a 9.2.0.5 database to a 10.2.0.3 database. During the upgrade phase (after the database software is already installed) I've encountered an "ORA-12631: Username retrieval failed" error message. The problem, apparently, was that I didn't have a listener up and running. Having a listener up might seem very basic to some of you, but I was working with my own upgrade notes that were compiled based two successful upgrades. More than that, DBUA starts the database on its own and I'm pretty sure it did the same thing with the listener on my previous attempts. I don't really know what was wrong this time but I did get back to the original notes by Oracle to see if I might have misinterpreted something.
There's definitely some confusion here: Note 362203.1 (EBS upgrade note) orders the stopping of any existent listener while the database upgrade document (not EBS related) mentions the need to create a new listener using NetCA if I don't already have a listener. So, does the upgrade manual mean I should have some listener up? And if so, why does the EBS note tells me to stop all listeners?
Again, if the stopped listener was indeed the issue, I'm sure there was some more sophisticated way to phrase the error message.

Well, had some bitterness in my post after all.
At least this might be useful to someone.

Saturday, February 16, 2008

Power To The Numbers

Our IT department (which I'm part of) has an internal Sharepoint based Wiki, our aspiration is to make it, as Wikipedia is, the place you start from when looking for information. It's supposed to include all kinds of information: general technological issues like how to install a VM, more specific issues like how to change the infrastructure for an IAS and even not work related topics like take aways to stay away from.
We're not at all there yet, and thinking about it I'm not sure we'll ever really get there. I can think of a several reasons:

1. A Wiki page is not really that good a replacement for a real document, a Wiki page can contain summarized information and maybe a link to the original document, but unless you have trouble finding the document, you don't really have a reason to look in the Wiki in the first place.
When looking for some arbitrary subject on the internet it's not always trivial to find a good document about it, so you use Wikipedia. In an IT department when looking for a documented issue it's simpler to just look for the document, since whoever documented the issue have probably written a document even if he wrote about it in the Wiki as well, and as mentioned, documents are better.

2. Specific issues like the mentioned example are too specific since they often address only one person, the one who wrote them down. Sure, people can forget and get replaced but still a document will do the job just as well.

3. The main reason as I see it.
A Wiki exemplifies the egg or chicken paradox. For people to use Wiki you obviously have to have some content, but for people to add content they usually have to consider the Wiki a source of information people use, not much point in writing something if nobody reads it.
Well, Wikipedia DOES work.
But let's say you have a blank new Wikipedia. Right, according to my assumption no one will
write anything, but after all, IT IS the internet. Even if one of every one thousand users writes something you get plenty of data, so now a higher percentile of the population is willing to donate their knowledge and so you get a plague-like effect where people use Wikipedia, find what they were looking for and add their own insights.
Now, lets get back to our department of about 30 people. Sure, we did write some pages at the beginning and even had a "Wiki week" in which we tried to add as much new content as possible, but as long as it's easier to ask the guy over the table or just browse for the right document, the Wiki usage seems pretty artificial. I've written some pages in our Wiki but I can't think of a situation where I'll actually think about looking there for answers.
That's why I'm trying to make the trainee to document everything he learns in the Wiki, this actually achieves some goals, first, as a newbie he has a better sense for what should be documented, second that's a good way to create a sort of a "training Wiki" and third it makes him go over what he have learned - everybody's in gain.

On the optimistic side, maybe it'll just take time, maybe in several years our Wiki WILL be considered a central source for information.
We'll just have to wait and see.

Friday, February 8, 2008

Syntax Confusion

There are many programming languages, each has it's strengths and weaknesses. And it's own syntax. I don't really understand what is it good for. Sure, each language has it's unique features, a strongly-typed language will have different syntax from a weakly-typed language, but I'm not talking about this "advanced" stuff, I'm talking about the very basics. And no better way to show what I mean than an example:
You already know I'm not a programmer, but I write PL/SQL code for the application, cmd and Perl scripts for administration purposes and this last semester I took Computer Vision and Image Processing courses and had a lot of assignments to implement in Matlab. Four languages, four different ways to write a comment. At some point I got confused with all the '%','#','--' tags (REM stands out more clearly), is this really necessary? Why should there be three(!) different ways to write a simple "else if" statement? Ok, let's say "elseif" somehow simplifies parsing, but "elsif"?! That's really ridiculous, it just looks like someone was trying to prove his creativity.

Why won't some standardization organization decide on how the basic syntax (comments, loops, if-then-else statements) should look like, and make - I don't really know how those things work - the vendors to have the next versions of their compilers to support the standard (unfortunately, backward compatibility is inevitable here)?

I don't really know if my idea is in any way feasible but this syntax confusion can get very... well, confusing.

Friday, February 1, 2008

Life According to Pinball

Since my days as a B.Sc. student I have the habit of playing pinball while doing assignments or studying for a test, no, I'm not saying I'm solving questions with one hand and hitting the ball with the other.
It's just that after solving a hard question or completing a test I have the urge to somehow "reward" myself, having a snack works as well but it's a bit more fattening. So when I want some short break (after all I have more questions to answer) I turn to pinball(Windows pinball). Well, I admit playing spider solitaire for the same purpose but pinball has a big advantage: after playing for a while it's possible to play mostly with your instincts letting your mind wonder around and sometimes creating a jogging-like effect, that's actually how I got the idea for this bizarre post.
I also think that pinball can be pretty educational. I'm not kidding (well maybe I'm not THAT serious).
Here are some lessons that I think can be taught playing pinball:

1. Think well about your options before leaping to the extreme solution.
Imagine the next situation: the ball is descending quickly and it seems like it's going for the middle (and your Center Post is not on), the reflex of most people is just to hit hard and fast with both flippers, WRONG! (at least most of the times). From my experience the best options are either to stay cool and it might be that the ball WILL hit one of the flippers or hitting only with one flipper hopefully diverting the ball onto the second flipper (remember, if you hit with both flippers the second flipper isn't there to catch the ball).

2. Timing is everything.
This one is pretty trivial in pinball, it's not enough to do the right thing, sometimes you must have perfect timing. It's not enough to hit the ball, if you're after the Reflex Award (and maybe completing the launching mission) you have to be very precise.

3. Live the now.
It doesn't matter if you've messed up the first two balls, you might still break a top score with you're third ball. And vice versa, scoring 10 mil. with your first ball doesn't guarantee you anything about your next two balls.

Sunday, January 27, 2008

Looking At The Mirror (2)

About two weeks ago I've started training a new Apps DBA (my replacement actually). However exciting, this mission (as I was warned) is very time consuming, it's not only the time spent on explaining what does a 'system' mean and the more specific niches of the system he will have to manage, but I also, as the new guy makes progress, have to show him how to deal with day to day issues, how to apply a document to the system and how to deal with annoying developers.

Again, as in the first part of Looking At The Mirror, I find myself recalling my own training a couple of years ago and, as reminded by a friend, I remember being pretty sure there's no way I'll be able to really do something productive like my predecessor did, I thought I might be able to bring the system up at most. Well, as it is with those things, I was able to do a bit more than that. One day you just realize you're able to handle the system on your own and even more than just 'handle'.
By the way, talking about bringing up systems, today the new guy brought up two development environments on his own, I had to explain some things I've skipped before as he went along but eventually the environments were alive and kicking.

And that's another thing worth mentioning. When you know what you know, everything looks trivial. It's only when you have to transfer your knowledge to someone else that you realize how much experience you've accumulated and how tricky this all system thing is.
The bottom line is that there's really no way of transferring all you know, you can pass part of this in documents you've written, you can give oral explanations and tips, but eventually experience is the best teacher and being thrown into the deep cold water of a system crash(which will probably happen the next time I'm on vocation( - Voodoo part)) is sometimes better than weeks of explanations.

I'm pretty curious to see how the new guy handles the real stuff. For instance, if everything goes on as planned, this week he will watch while I upgrade an EBS database to 10g and in the next upgrading phase I expect him to do it on his own (well, maybe with SOME help).
Things are definitely going to get interesting... for both of us.

Friday, January 18, 2008

Looking At The Mirror (1)

This week I've been interviewing candidates for a programmer's position.
Up until now I've always sat on the other side of the table, so it felt strange to be the one asking the questions and not the one trying to impress by answering them.

This is not a simple task.
When you grade a candidate at the end of the interview you(or at least I) not only consider his ability to answer your questions in a satisfactory way, but, inevitably, you also think whether you would grade yourself high given the considerations you've applied, I think this shouldn't be a factor and neither the fact that you tend to sympathize more with people who have a similar background to your own background. Well, we're all just human.
Actually the considerations I've applied in many of the interviews are even more "unprofessional":
The interview consists of two parts, in the first, an HR representative asks personality questions and in the second the professional interviewer(me) asks professional questions. But actually , by the time we get to the professional part you've already formed quite a firm opinion of the candidate since virtues like the ability to operate as part of a team or the ability to quickly grasp new technologies are all part of being a good programmer(and most other technological positions as well). So sometimes the professional questions are asked only in order to validate what you've already realized about a candidate.
You can probably say it's not fair to decide on a person's future (a bit dramatic but true sometimes nonetheless) based on a first impression, that's true, but life's tough and in reality our interactions with other people are often based merely on first impressions. Besides, the interview has only limited time so first impression is mostly what you have.

Thursday, January 10, 2008

Time To Let Go

In job interviews you sometimes get asked about your bad qualities, among other things I usually mention me being a "too hard trier". It might seem like me trying to be a smart ass and slip a good quality as a bad one, but I really do think I sometimes try too hard.

I bet everyone knows this thing when you sit for hours trying to solve a problem just to get frustrated, go home and wake up with the now trivial solution the next morning. And that's exactly what I'm talking about, of course it's great to come up with a solution just when you're about to give up, but isn't it even better to skip the frustration part?
I don't really know the exact neurological explanation but a good night's sleep or even setting your mind to something else definitely triggers some processes in this head of ours that help us see the problem in a different light, and that's always a good thing. It's like when you try to remember the name of a song/movie/book and can't pinpoint it no matter how hard you try, but it will always pop into your mind later without the slightest effort.
This approach actually goes along well with what I've written in a previous post about counting on luck but somehow when it comes to letting go and letting nature do it's magic I seem to stay stubborn.

I'll try to continue my self-education and my advice to everyone who has a similar issue is to do the same, sometimes it just doesn't worth the trouble (but sometimes it does...).

Friday, January 4, 2008

Just Tell Me What Did I Do Wrong

Know that thing when your girlfriend(or wife, or whatever) starts acting like a walking icicle because you've done something to upset her? So you also know that she will put you to the test of discovering your error yourself. No matter what exactly you did you will get the same chill tone. Well, who understands women anyway?
But I've expected code to be more friendly (it's mostly written by men after all).

OK, I get it, you supply me with a wonderfully written piece of code and if I'm unable to figure out the correct parameters or set the environment correctly then I'm the stupid one here and that's my problem to deal with. But really, isn't that what Error Handling is for?

Had a variety of error handling related issues lately, here are just some of them:

1. A few weeks ago I've tried to install an R12 Vision database. After copying the installation CD (the very first phase) the installer window have just vanished. The log did contain an error code but with no explanation whatsoever and I haven't found anything on Metalink, so I tried again and this time I saw that just before crashing the installer executes bootstrap.cmd. So I tried to run it myself and got a java error stack with an error message saying something about "Unable to open a window" (I don't really remember the exact formulation right now), with no other apparent choice I've started decompiling the java code together with my team leader (who used to be an Apps DBA as well). At last we've discovered a code section where an error wasn't handled, pleased with ourselves we've logged an SR.
An unhandled error is bad enough but apparently the real problem was that the Staging Area wasn't set correctly, that's definitely my bad but how complicated it is to check the validity of the Staging Area before launching the installer and throwing totally unrelated errors at me?

2. I've noticed that it doesn't matter what's wrong, if running adcfgclone on the database tier it will probably fail with the same error stating something about "Error returned while creating control files" and an RW-50004 error code. If you're lucky the log file will have some clues, if not you'll have to start debugging Oracle's code. I got this same error for problems with an Oracle script which is part of the whole process, for problems with starting the database listener and even for a corrupted database, non of which are really related specifically to the creation of control files. And I'm pretty sure this error shows itself in much more cases. I think it's totally reasonable to expect an error message when the problem appears, instead, all errors seem to propagate to a single line of code where the adcfgclone just won't take it anymore.

3. Had to modify the virtual IP(just the IP, host name stays the same) of my load balanced OID server - turns out that's not as simple as one would expect. From what I know up until now (haven't really resolved this issue yet) it seems that the IP is hardcoded in some configuration file or maybe even in the Meta-Repository database, that's odd. Maybe I'll write about it in another post when I'll have more idea. Anyway, in the process I've discovered a script that's supposed to do just that, it even has it's own folder directly under ORACLE_HOME. This fact made me suspicious at once since it makes this script look like a late add-on which wasn't considered with much thought. The code inside really stands up to the expectations, it seems no one have ever tested the code. It looks something like this:

...
if
echo bad thing 1 happened
goto end
...
if
echo bad thing 2 happened
goto end
...
...
:end
exit

Doesn't take a cmd expert to understand that running the script as it is won't show you any error messages (unless you're a really quick reader) and it will actually close your cmd window (why?!).

Well, I think that concludes my case.