Saturday, December 17, 2005

Operation Clean $h*t IV

Yeah, it's finally done (but not over yet). We had passed the UAT, and done with migration to the production system. The migration was carried out on Thursday morning (farking morning of 12am) and maintenace window that we had for the migration was 12 am to 12pm. In total 12 hours. So far this is the longest migration I ever involved in.

Unfortunately, there was some glitches in KimChi team during the migration process. As for my team, by 10am everything are up and running and been tested completely. We could have finnish it much earlier but our systems are dependence on KimChi system. So we gotta wait for them. While waiting, I was too tired that I just lied flat on server rooms floor. Luckily the server room's temperature was not very cold. It was about 23C.

By 11am, the KimChi team was still having problem to bring up and test the standby unit. Seems like there are some configuration, and communication path through the firewall has been overlooked. That's what they claimed. By right by that time we should have initiated the rolls back sequence. The migration has failed. Even though they have managed to bring up and tested the active unit, the failure to bring up and test the standby unit means that this whole migration process has failed.

After an urgent meeting with the customer, they finally decided to just let the system run without a standby unit. The Doraemon agreed, I think they are just too desperately needed to get the upgrade and added feature up after many delays. That's why they are willing to take the risk to just run the system without a standby unit.

During the urgent meeting in the server room, KimChi promise to fix the problem with their standby unit by that night itself. I disagreee. Our team need rest, and I think his team needed rest too. Fatigue body and mind will raise the possibilities of mistake being made. Luckily one of Doraemon member has some sense to agree with me. So the next maintenance windows has been approved for 12am - 2am on Saturday morning.

Now the KimChi team was suppose to bring up the active unit, which had been brought down in order to test the standby unit. I hate Murphy, especially when his law was proven again. When shit can happen, it will happen. There was some complication in bringing up the active system again. KimChi team has to request for the maintenance window to be extended for another 2 hours. That would mean it's gonna be farking 2pm until the whole process is completed.

I have no eye see anymore, I decided to pack up my stuffs and go back to the vendor room where another group of my team was carrying out the testing. Asked another of my team in the server room to pack up to. Told the KimChi team, I was leaving and if they need another help, please call me and I will remotely access the system from the vendor room.

Luckily at 2 pm they finally manager to bring up the system with minimal time for our team to test again. Yet, we did some tests. Our service manager and me asked the rest of our team (3 of them) to go back first while we are waiting for the KimChi team to comeback and have a progress and informal post-mortem on what went wrong. Only after that did we go home. It was about 3pm when I got home.

We thought, and pretty confident that the migration can be completed by the day itself. I have even get my team to organize a Karaoke session that night, for a small celebration and as a way to unwind ourselves after a very damn hectics three weeks. So I only manage to nap for a megre 3 hours before I gotta wake up and head to the K.

Unfortunately, halfway through the K I'd got call from our service manager about an issue with the systems. I overlooked something during the migration. Most of the function works perfectly fine, but I didn't check one minor redirection page (which I never aware is required.) So for the next 1 hours plus, I was outside making phone calls to investigate the issue and get one of the KimChi team to help us to remotely changes a simple line, and voila! the issue resolve. Communicating with KimChi team in engerish proved to be difficult.

Went back into the K room and continue with booze till 2 am. Because of the second round of migration, I decided not to go to office yesterday. Take a rest at home instead, at noon woke up and get a few call from a colleague who needed a device that I had with me. She agree to come over to my place to pick it up. Went our lunch with her and another colleague who tag alone with her.

Got back home at 3 pm and just about when I wanted to take a nap, got another call from the service manager. Things was not well. The system was having problem around 2pm. Even though the KimChi team manage to bring up the system again I was still being asked to go back to investigate what cause the problem. F*ck!!!!

Drag my sorry arse back to the site, and found out that the problem might has caused by the customer load balancer. Had a small discussion with them on what can be done to avoid this problem again, and I get some clarification on their load balancer setup.

Then prepare for the morning migration, and went to have dinner. Luckily the second session of migration went through pretty smoothly. Reached home at around 3am hit the sack. I was suppose to be relief, but don't have that kind of feeling now. Probably the feeling have not sink in yet, and it might be because there are still some work needed to be done and delivered on in mid-Jan next year.