Saturday, October 28, 2006

Case Q41: Case of False Sense of Confidence (Final)

I don't know if this is a bad luck or mis-alignment of certain planetary system. Just when I thought everything is fine and dandy, and have giving myself a breathing space, the storm was brewing over at the horizon. Got a call from the service manager on the night of first Raya day (Tues).

Out of the blue the system develop a problem again, after it has been running fine for one week since that eventful days. Or could this be one way the nature acting out and reaching out for a state of equilibrium. Sorting out it's own kinks. At first I didn't pay much attention, as I was told it was just a hardware problem. My thought was, so just replace the hardware and the system will be up and running. Anyway there are four machines running, so there should be three more machines running and taking in the load, maybe have to work harder than usual. But should still be okie.

Morning of second day of raya, I prepared myself some fried rice and stuffed myself silly with plenty of it. Knocked myself out after the meal. It wasn't until mid afternoon when wake up and found out there was a couple of miss calls on my mobile. Return the call to the service manager. There is another crisis. The remaining three services on the three machines decided to act up, and refuse to work properly. Darn!

He has escalated the issue to the platform provider, and this case is logged as Q41. After many calls placed to the top personel of the platform provider, we had finally managed to get them to set the serverity of this case to the highest level. Due to the serverity of this issue, resources all over the world has been roped in and the case is getting 24hours attention.

The personnels involved in this case are from Australia, China, Korea, Indonesia, Singapore, Malaysia, Pakistan and France. It goes according to time zone. By the end of the working day in one time zone, reports and case documents were prepared and handed over the personnel in the next timezone to continue. This way, virtually the file of this case had circulated planet earth almost three complete circles before we finnally nailed it down last night.

Like one weeks earlier, I wasn't roped in by the virtue of me being knowledgable the domain of this issue. Being fully aware of that, I didn't really talked too much in most of the conference call, but rather just trying to asked certain questions which in my opinion may lead to the root cause of the problem and eventually a solution.

I have choosen to let all the experts to lead the investigation, and trust them completely (that was the best option for me, considering the fact the other option for me was to bury myself into all the documentations of the platform, which may take at the very best 5 days. We can't wait that long). For which last night I had discovered may not be a good decision. All the earlier deductions and conclusions that I came upon throughtout the problem investigations and solving process are based on my assumption - they know their system best. But it may also due to some difficulties in communication, as all the personnels are from all sort of diverse mother tongues and culture backgrounds.

I initial and primary suspect was the application codes. I have a bad feeling that the codes are not correctly implemented. So my main push when I first join the conference calls among the platform support team was to get them to review the codes. So after that was being done with, or so I thought. With a false sense of confidence that the codes had been properly audited and all recomendation implemented yet the problem didn't go away, I trained my mind on other part of the system which maybe the root cause of the problem. Because of that false information, I have made a few wrong deductions which delayed me from finding the correct solution.

One whole day was wasted at looking for the problem at the wrong places. That was until, by certain turns of fortunate events, I observed there was a certain pattern on the behaviour of system yesterday. And decided to throw away all the earlier deductions, and restarted the investigation from zilth. Back to my primary suspect, which is the source codes of the application. But this time instead of finding out what was wrong with the existing code, which I can't find any because I hardly know where to look at because I don't have full knowledge of the API that was being used and I don't have time to study them all.

I started out all over again, and with certain luck and maybe help from heaven, got hold of a set of example codes which comes with the platform. Instead of trying to understand the whole intricate API, I just use the code as it is, and with my newly gained limited knowledge and insight, carried out tests on the live system (which is a taboo in this industry, but fuck care... the system is not running anyway... either way we will be fried). Nine hours and 9 iterations to the codes later, we have finally got the system up and running.

The implication of this action of mine is the works which was carried out by consultant in coming out with the problematic codes, goes completely down the drain, maybe except for a few portions which I ported to my new codes. I guess, I just gotta put feelings aside in crisis like this. I guess I sort of giving him many chances and oppurtunities redeem himself by fixing his own codes, yet when that is not forth coming, I have no other choice.

Everyone let out a sigh of relieves in the conference call last midnight, and glad that we do not have to attend any more conference call at ungodly hours.Yet, we couldn't be very certain that the new codes is complete correct until the system has been running for at least few months without any problem. Still keeping my fingers crossed. GodSpeed!