Friday Started Out As A Quiet Day …
Then the phone rang … it was Patrick asking me to remove the cluster between ServerA and ServerB. The reason? We purchased a lot of shiny new video conferencing gear for all our offices and it does all the video in hi def. Unfortunately for us, it doesn’t down-scale nicely and the CEO, CIO, and many other people high-up complained (quite loudly) that they didn’t like the pixelization that happened during their video-conferences with remote offices.
After checking with the vendor, their only recommendation was to reduce the amount of traffic between the head office and the data center. The end result of the subsequent investigation of the traffic showed that the constant chatter of the cluster on ServerA and ServerB was taking the majority of the juice on the data line. ServerA is located in the head office and ServerB is located in our data center. The vpn entry points for a lot of our telco equipment and data lines is right in the data center and then goes out to the remote offices.
Are you starting to see where this is going? So this is why I got the order to remove the cluster (not just disable it temporarily).
The Untold Consequences of Un-Clustering a Cluster During the Day
Well, right out of the gate, if you look in the admin guide … the procedure is extremely simple. Too simple. 13 years of doing Domino should have made an alarm ring somewhere in my brain but alas … I plowed ahead. Bad idea.
Basically the admin guide says that you just need to open your Admin client, go the Config tab, then go to the Clusters view, select the servers to remove from the clusters and then hit the “Remove from Cluster” button. Simple right?
Too bad the Admin guide doesn’t say anything about what happens next and what you should do to fix it.
Now, keep in mind that this probably will not happen to your IF you schedule it properly and do the change at night and then reboot the servers before the next business day. But in the event that you are under duress and need to do this during the day, learn from my mistakes and see what’s going to go wrong … and how to fix it.
Lesson 1: Freetime Lookups Gone Bad … You Ain’t So Free Anymore
Yep, Freetime lookups which were driven by the “clubusy.nsf” within the cluster go “bye-bye” after a few minutes of the cluster stopping. Is the Domino server smart enough to rebuild a “busytime.nsf” right away? Nope. It’s not unless you reboot the server. So, how do you go around it during the daytime?
Simple: issue the following commands on *both* servers.
- RESTART TASK SCHED
- RESTART TASK CALCONN
- RESTART TASK RNRMGR
While certain tasks restart, you will see that it’s rebuilding the busytime.nsf … which is good … but if you have a gazillion users on your server … the Freetime lookups might be offline for a few minutes … or hours.
Also, for good measure, you should also issue those two commands after the dust has settled:
- TELL SCHED CHECK
- TELL SCHED VALIDATE
Btw, if you don’t do this, your users will receive some strange error messages when trying to check the availability of others like, for example, “Can’t find schedule record for requested user”.
Phew … ok … problem 1 fixed (if you do this during the day). Moving along …
Lesson 2: Replication Failures Galore When Using the Cluster Name …
Shortly after “un-clustering” your cluster, you’ll see a bunch of error messages on your servers about replication timing out … you might scratch your head a bit because everything is fine with the network.
What’s the problem in this case? Well, I think it’s written somewhere in the Admin guide (or it’s a tip that’s floating around the Yellowsphere) but you can set the “destination server name” field in any Replication Connection documents to the name of the cluster. So your “spoke” servers could replicate with either server of the cluster when replication ran. That was awesome when you *had* a cluster … but now that it doesn’t exist anymore you’ve got replication errors … see where this is going?
Yep, the solution to this is to open your Domino Directory and re-check your Connection Documents to make sure that the cluster name is not mentioned anywhere in the “Destination Server Name”. Once you’ve done that … replicate the domino directory to all the spoke servers and issue a good old RESTART TASK REPLICA on each server …
And now … the final lesson … the most painful one …
Lesson 3: The Support Calls … a.k.a. Your Phone Will Ring Off The Hook and Melt (… if you are Lucky)
If you had your cluster around for a while (it had been enabled 3 years ago if my memory serves me right) you’ve become accustomed to the fact that you could check your emails on ServerA or ServerB and it would be exactly the same all the time.
On top of that, the BlackBerry Server would, of course, check your “home” server all the time, but you could use the other server and no matter what you did, your BlackBerry and your Inbox always looked the same.
A few hours after you remove the cluster, the phone will start ringing. Trust me: it will.
No matter who calls you, the following lines (or a variation) will be said at one point during the conversation:
- “I’ve got emails on my BlackBerry that aren’t in my Inbox”.
- “Mister X says they’ve emailed me something but I haven’t received it yet … it’s been 30 minutes already”.
- “No matter how many times I replicate, there is always a 2 hour delay before I get my emails”.
- “When I go into iNotes, my emails aren’t the same as when I go into my Lotus Notes Inbox”.
What causes it? Well, most of those calls are because the employee’s mail file doesn’t point to the right “home” server (ServerB instead of ServerA for example). Or it’s because it’s not replicating with the right server (ServerA instead of ServerB).
In *all* those cases, it’s because the workspace icon, the bookmark (or whatever) is not pointing to the home server of the user and points to the “old” cluster mate.
What about that last line about iNotes? Well, in the case of that user, our iNotes server runs on ServerA and his home server is on ServerB … now that the cluster is removed, the good ol’ replication documents have a schedule of replicating every 30 minutes so there is a delay before his mail file is updated on the iNotes server.
Anyhow … what’s the solution to this issue? Well, if you have some sort of Workspace management tool like Desktop Manager from Cooperteam or Marvel Client from Axceler, I’m pretty sure you can fix it “remotely” for your users.
In my case unfortunately, it’s a manual process for each call … it’s sad but it’s life.
Conclusion
Pilots have this saying that goes “Learn from the mistakes of others … you won’t live long enough to try them all” so I hope someone, somewhere will learn from this blog post.
And I really hope that by sharing this, I won’t end up on the Worst Practices slides at Lotusphere 2011 …
Thanks for reading!
Marc