Filed Under (Lotus Domino Server) by Marc Champoux on April-22-2010

To Make A Server Crash, You Must Find The Right Tool For The Job
   

This is going to be a short post. I have been dealing with a few server crashes recently. Some of them having to do with Tivoli and others … well, the PMRs are still opened. However, I got sick and tired of always manually collecting the IBM_TECHNICAL_SUPPORT folder, the log.nsf, the notes.ini, the Event Viewer Files and the WinMSD NFO file each time I was opening a PMR for a crashed server.
   

So, I wrote myself a nice little batch file that does all that for me. Once I had the batch file done, I decided to try it out in the “Run this Script After Server Fault” field of one of the servers in the test environment. But, to test it out, I needed to “create” a server fault.
  

And here’s the gist of this post: for those that don’t know it, there is a small utility on the Lotus Developer Domain Sandbox on a page titled “Utilities to crash client and server”. You can download it from here. The page might say that the platform is “AIX 64 bits” but the zip file contains every possible flavor of OS you can imagine. The version for Windows servers is at the root of the unzipped directory structure.
  

And what happens when you run it? Well, you get a nice “PANIC” error … your server crashes and NSD fires off. Simple is beautiful (in most cases).
   

 
  

 
  

 
  

 
  

 
  

 
  

So, for a guy like me testing out and debugging his “post-server-crash” script, this is very usefull … and by posting it on this blog, maybe someone else will discover it too (hopefully not to wreak havoc in his own environment).
   

Side Notes
  

Well, truth be told, my script works fine when I run it manually from the command prompt. However, when the server fires it off … nothing appears on the screen but files are being copied to the right place by the batch file and then zipped. It’s all done silently.
   

The catch 22 however is that I wanted my script to delete the log.nsf and run a fixup -q on the usual databases that go bonkers after a server fault (admin4.nsf, events4.nsf, mail.boxes, names.nsf just to name a few) so that the server would come up a bit more clean than in cases where it starts up right away and is catching up doing the fixup on the databases while it’s trying to get back up.
  

So, because of that, I opened another PMR with Lotus Support to ask if there is some sort of switch or notes.ini that I should try to make my script run after NSD has done what it’s supposed to do but *before* the server is restarted … we’ll see what support has to say. However, if you are a guru in regards to “scripts to run after a server fault”, please feel free to post in the comments section your 2 cents on why this is happening to me (and your idea for solutions if you have any … thanks!).
  

Conclusion
  

Please use the utility responsibly! Don’t use it for your next April Fools prank … seriously, don’t. Friends don’t let Friends crash their Friends servers as an April Fool’s joke …
 

Thanks for reading!
  

Marc



Filed Under (Lotus Domino Server) by Marc Champoux on September-28-2009

Summary

 

Another late night programming session … your vision is getting blurry and you ran out of Red Bull a few hours before that. Somewhere in your LotusScript code, there’s 1 line with a call to the “EndSection()” method of the NotesRichTextItem class … but for one reason or another you didn’t call the “BeginSection()” beforehand (you thought you did but it’s late) … it dosen’t matter right?

 

So you test your agent on the server and it Panics and Faults right away! The code dosen’t even go into your ErrorHandler routine (you have one right?) … sooooo what gives?

 

Steps to Reproduce the Error

 

If you want to reproduce the error, simply create a new scheduled agent in a database on one of your Lotus Domino R8.5 Fix Pack 1 test servers (or on a production server if you enjoy the occasional lynch mob running after you around the office with torches and pitchforks – hey they say running is good for you) and paste this code into the Initialize section of the agent:

 

On Error Goto ErrorHandler
 
 Dim Session As New NotesSession
 Dim NewEmail As NotesDocument
 Dim NewBody As NotesRichTextItem
 
 Set NewEmail = Session.CurrentDatabase.CreateDocument
 Set NewBody = New NotesRichTextItem ( NewEmail , “Body” )
 
 Call NewBody.EndSection()
 
 Exit Sub
 
ErrorHandler:
 
 Print “An error occured in the agent MCXTestAgent”
 Exit Sub

 

Notice that there isn’t any call to the “BeginSection” method? Now, either let the agent run on it’s schedule and watch the server Fault OR issue a TELL AMGR RUN “YourDatabaseName.nsf” ‘YourAgentName’ command … and watch it Fault.

 

The Solution

 

While this is technically a problem with LotusScript and it should have gone into the ErrorHandler routine … it’s also, technically speaking, a problem with your code … i.e. you should have have called a “BeginSection” a couple of lines above somewhere in there. So just add the “BeginSection” call where it needs to be and enjoy.

 

To be safe, I opened a ticket with Lotus Support to report this “behavior”. The support rep who called me back said he was able to reproduce the error quite easily and that he opened SPR #JSHN7WBRPM in regards to this issue.



Filed Under (Lotus Domino Upgrade) by Marc Champoux on August-11-2009

Summary …

 

This post is about getting the the following errors right after doing a Lotus Domino 7.0.2 FP3 upgrade to Lotus Domino 8.5. You can read the sad sad sad story below or simply skip to the solution at the end (and read the addendum if you want to cry).

 

FAULT REPORT: SERVERNAME/ORG (Release 8.5FP1 June 15, 2009) process nHTTP faulted at 08/08/2009 08:23:00 AM with ‘ACCESS_VIOLATION’

 

FAULT REPORT: SERVERNAME/ORG (Release 8.5FP1 June 15, 2009) process nHTTP faulted at 08/08/2009 08:23:00 AM with ‘ACCESS_VIOLATION’

 

 

The Sad Sad Sad Story … (of cumulative epic failures) …

 

So … you’ve done your homework and configured a test environment that’s like your production environment. You’ve performed a test upgrade from your current version to Domino 8.5 with the Fix Pack 1. And the results? The upgrade of your test environment upgrade went super smoothly. Maybe too smoothly …

 

So, you confidently schedule some downtime and wake up early in the morning on a Saturday to upgrade your main hub. After running the Lotus Domino 8.5 installer and clicking Next a couple times (after double-checking that the right paths are detected and that you are installing the right type of server) … the installation goes super smoothly. You also install the Lotus Domino 8.5 Fix Pack 1 on top just like in your test environment and once again everything goes super smoothly. Remember I wrote “maybe too smoothly above” … read on …

 

Confidence runs high as you start your upgraded server for the 1st time … Services -> Lotus Domino (DLotusDominoData) -> Start …

 

Whoohoo … the console appears! You start to breathe again! And as usual, since this is your 1st server and the domain hub, you get asked to upgrade the domino directory and everyting goes smoothly.

 

The minutes tick by as you look at your server upgrading the design of the system databases (ddm, mail.boxes, etc, etc).

 

Then … all hell breaks loose: the server faults! NSD runs! Panic ensues!

Once the server starts back up, you get an email … the fault message is this:

 

FAULT REPORT: SERVERNAME/ORG (Release 8.5FP1 June 15, 2009) process nHTTP faulted at 08/08/2009 08:23:00 AM with ‘ACCESS_VIOLATION’

 

You wonder what the h*ck went wrong … but the server comes back online, then it runs fixup on a couple of opened databases (that server isn’t running Transaction Logging) and finaly, it goes back to doing the usual post-upgrade tasks …

 

But then it faults again! With the same error too! So you try something: you edit the notes.ini and remove http from the ServerTasks= line just for the fun of it and start the server again.

 

Unfortunately, the server faults again … but this time the message is different:

 

FAULT REPORT: SERVERNAME/ORG (Release 8.5FP1 June 15, 2009) process nrunjava faulted at 08/08/2009 08:33:17 AM with ‘ACCESS_VIOLATION’

 

So … nothing makes sense anymore. You’ve never seen this error before (in 12 years). For the fun of it, you Google the error … nothing “good” really comes up. Sure, a few hits and there but nothing that’s a 100% match to this issue. So, you try to check on the IBM Support web site, in the forums, on planetlotus.org and even on some of your favorite bloggers web sites … nothing … no solution to this weird problem.

 

But wait! You’ve got a support contract right? So, you open a SEVERITY 1 ticket with Lotus Support via the ESR web site … which turns out to be an epic failure: the system says that they can’t give you your PMR number yet because the system is being serviced/under maintenance or some sad excuse (the server is on fire maybe?).  Unphased by the epic failure of the Lotus Support ESR web site … you call up the 1-800 number to get some help. After a few minutes of punching keys on the phone to get to Lotus Support you get to speak to a rep that willl open the PMR. So, you speak with “Carol” (who happens to speak good English), she writes down your information, the topic for your ticket and the fact that it’s a SEVERITY 1 ticket (important fact). You start to see a glimmer of hope. But she she wait until the end of the call to crush that hope by saying that she can’t give you a number either because their system down. So, two epic failures in a row … it’s looking quite bleak at that point in time.

 

So, you politely ask Carol to escalate this to her manager and to find a way to get some help because this is a SEVERITY 1 ticket and you need help *now* … not in 4-5 hours like a Severity 3 or 4 ticket (it’s gotten that bad now that they are on call-back only). After what appears to be 30 minutes … you haven’t gotten a call back yet. So, you do what every good person-in-a-hurry does: you call back Lotus Support and, again, you speak with Carol who tries her best to act suprised that you haven’t gotten a call back yet.  After another discussion with Carol … you finaly get a call back from a nice lady named “Regina Muller” … of Passport Advantage! What-the-h*ck? So you explain to Regina that you have no idea how she got involved into this but that she can’t help me with this fault issue. So … back on the line with Lotus Support to speak again with Carol. After what appears to be another 30 minutes … you get a call back from Mr. Patrick Rowan … the person who appears to be in charge of the Lotus Support helpdesk during that weekend. Phew … hope who had completly fled the scene of the accident starts to peek it’s head around the corner … it’s kinda hoping it’s not going to get hit again … but it never bothers to look behind him for the 18 wheeler barreling down at full speed!

 

During the conversation, Mr. Patrick Rowan proceeds to explain that he can’t give you a PMR number yet and that since they can’t give you a PMR number, they can’t ask anybody to call you back! What a sad excuse coming from a big company. So you explain to Mr. Rowan that this is a SEVERITY 1 problem, that you are down and since this is your main hub, ldap isn’t running and it’s affecting other systems. Many other systems.

 

After what appears to be 1 more hour, Mr. Rowan calls you back …. hooray! He’s got a PMR number for you … ##### branch 999 … in 12 years of working with Lotus Domino, this is the first time ever in your short life that you’re given a PMR number with 999 as the “branch” instead of 005. Deep down inside, you feel this is going to end badly. Anyhow, you decide to ignore that little voice inside screaming because finaly … it feels like you’re going get some help (and you’ve been on-and-off with Lotus Support for 2 hours so far).

 

Finaly, at one point you get a call back from a gentlemen named Mr. Dale Cole … and after sending him various NSD files and trying a couple different things on the server … he can’t help you. So, Mr. Cole wakes up pulls another support rep named “Anur” (but you’re not sure about his name due to the accent and he never mentions his last name saddly). Both of them ask you to do various unholy things to your server … and nothing works. You are contemplating giving up and handing in your resignation letter (which you conviently updated a few days before and stored it on your network drive for such an occasion). You also consider declaring a disaster … but at this point you figure out that with the combined brain power of 3 people on this conference call … someone will eventually find out what the h*ck is wrong with the server.

 

Saddly that’s not the case … after trying so many different things … nothing works … so … the only solution left is quite simple (and sad by my standards).

 

The Solution … (the sad sad sad solution) …

 

After a lot of discussions with Dale and Amur … the only solution left is to:

 

(a) backup the notes.ini and server id file.

(b) Uninstall the Lotus Domino server.

(c) Rename the left-over Lotus\Domino folder to Lotus\Domino-Old .

(d) Install Lotus Domino 8.5 and then install the Lotus Domino 8.5 Fix-Pack 1.

(e) Copy back the notes.ini into the newly-recreated C:\Lotus\Domino.

(f) Edit the notes.ini to make sure that the “servicename=” line matches what’s in the registry.

(g) Start the server.

 

After this “clean” install … the server runs like a charm … no faults and no errors. Oh, I almost forgot to mention: http and runjava are also running fine.

 

Truth be told, Lotus Support was never able to tell me what went wrong or which file was “left-over” in the C:\Lotus\Domino\JVM folder and not overwritten by the Lotus Domino 8.5 Installer. I took a look inside the C:\Lotus\Domino-Old folder that I sill had on the hard drive but there were too many jar files left over … so which one was the cause? I have no clue. Why didn’t the 8.5 installer didn’t remove them or didn’t overwrite them is beyond me.
 

So … I hope this post helps someone somewhere.

 

Addendum

Wanna laugh? Or cry maybe? Remember the ticket that I *tried* to open via the ESR web site that was “in limbo” and that the system told me it couldn’t assign a number to it? Well, on Monday night … a full 2+ days after … I got a call from a lady (I didn’t get her name) from Lotus Support … she was asking me if my problem with our Lotus Domino 8.5 upgrade on our ZSeries server was fixed now! I explained to her that we don’t have any mainframe (ZSeries) and that I had opened a Sev 1 ticket for a Lotus Domino 8.5 upgrade issue on a Windows platform … and that a call back 48 hours later isn’t really good timing in my books. Oh well … it made me laugh.