Summary …
Last week I upgraded a partitioned server in one of our remote office in the Asia-Pacific region, using Windows Remote Desktop, from Lotus Domino R7.0.2 to R8.5.1. Things went very smoothly during the upgrade but it took longer than I had planned. So, once the partitioned servers were upgraded, I started them and let them run because I didn’t want to bust my maintenance schedule window.
Because the upgrade took longer, I knew that I had to schedule some more downtime this week to be able to wrap things up and run a compact on the databases with the server down to upgrade them to the newest On Disk Structure (ODS). After some discussions with the IT folks over in the Asia-Pacific region, the next window of opportunity to schedule some downtime happened to be today, Friday, November 13th 2009.
Right away, something *should* have clicked in my head but I guess I’m so amazingly tired that none of my usual paranoia alarms went off. C’mon we’ve got movies such as “Friday the 13th” that clearly illustrate that it’s a bad day to do anything important (let alone server maintenance) … so something should have clicked in my head but alas … nothing … so read on for the horror story … or skip to the “How did I fix it?” part to read about what went wrong and how to fix it if you run into it.
But What Did I Really Need To Do On That Partitioned Server?
So, if you are still reading this, you may ask, “Sir, what did you really need to do on that particular server?”. And the answer is quite simple … I just needed to un-install one of the un-used partitions on the server and run a compact on some of the databases of the 2 other partitions to bring them up to the latest-and-greatest-omg-it-slices-and-dices-but-wait-theres-more ODS level (51). Simple enough right? Nothing that would scare the pants off your usual run-of-the-mill Lotus Domino Administrator as far as I know.
Ok, So … What Happened?
Wow, you’re still reading this? Thanks! Well, I “remote desktoped” into the server and un-installed the unused partition and that went well.
I also ran a compact on the databases to upgrade them to the latest ODS … and that also went well.
I then rebooted the server to complete the maintenance (I always like to do reboots just to clear up the Windows Server memory) … and I waited. And … I waited some more. And … some more. After 10 minutes of waiting without being able to remote desktop back into the server, it was clear that something was wrong. I tried to “ping” the server but it would not even respond … I thought “oh my, Windows must have Blue Screened” …
And It Got Worse … Right?
Yep, it did but not in the way that you’d expect. Long story short, I had (politely) asked for the login information for the ILO (HP’s Integrated Lights Out) to be added to the list of ILO information in a database that we have where I work. That usually covers our lower-back-part in case Windows crashes because ILO allows you to remote control the machine via another interface. I assumed, and trusted, that people would have done that already because I usually bend-over-backwards for them when they ask me something … and sadly, my assumption was horribly wrong because the ILO information wasn’t anywhere to be found!
Knowing full well that it was lunch time for me and midnight for the folks over in the Asia-Pacific region, I said “dammit (jim) this is an emergency, so I have the rights to wake one of guys over there up” … and I tried to call the cell phone of the LAN Admin whom I knew the ILO information. But … no answer! I then tried his home phone number: no answer either. I re-tried his cell. Still no answer. Plan B: I decided to punish call another LAN Admin in the Asia-Pacific region … also no answer!
At that point, I knew I was in deep doo doo! Finally, I asked the Senior LAN Admin for the Americas region of the company that I work for to try to find this info. Lucky for me, after about 30 minutes, he managed to find an old reference to it somewhere in his emails!
Phew … You Had Access To The Machine … What Was Wrong?
Once we ILO’ed back into the machine, we saw that it was stuck at “Applying Preferences”. After some more waiting, we ran out of patience and rebooted it. Too bad for us: it got stuck at the same place! After 2 more reboots for good measure and one final reboot in Safe Mode with Networking, the Senior LAN Admin for my region figured out what was wrong: the server was freezing when it was trying to start the Lotus Domino Server partitions!
So, he set them to run manually, rebooted and handed me back the control while he went back to fighting fires in the Americas region.
How did I fix it?
Once I was back into the server (again, via Remote Desktop), I went to the Windows Services panel and started one of the Lotus Domino Partitions. An error that I had previously ran into instantly reared its ugly head … and what was the error?
It was the good ol’ “An error occurred during license use management initialization. Ensure that you are running Domino with a valid license file” error. You can read about that nightmarish upgrade where I ran into this error for the first time on one of my 1st blog post here.
And the solution when you run into that nice ”An error occurred during license use management initialization …” error hasn’t changed since I last ran into it: simply re-run the Lotus Domino R8.5.1 installer and it will fix it automagically (see the IBM technote here).
So, now that everything is back up again … I promise never to do Server Maintenance on a Friday the 13th ever again …