Bug Story: Dig Deep

Vishal Chowdhary

Microsoft Corporation

November 2007

Sometimes, our tests fail during automation runs, and we do not seem to pinpoint the exact cause of failure accurately. We end up attributing the failures to "external causes," such as machine condition, configuration settings, and so forth. However, these failures actually might be the manifestation of a product issue that needs further investigation.

Say that your test timed out. Was it because the product that you were testing did not respond in a timely fashion? Rerunning the tests later to "make it pass" is usually not the right approach. A test is valuable when it fails and the failure uncovers a bug. The following bug describes a similar issue that we caught and fixed during the testing of Microsoft BizTalk Server 2006. It started out to be a simple machine-configuration issue, but it turned out to be much more than that.

When messages are processed by BizTalk Server (BTS), one can save the message body by explicitly choosing to track it. For example, one might choose to track all messages that are processed by an application. When a particular message is processed by this application, BTS will then mark the message for tracking. BTS uses two databases:

  • MessageBoxDb—Processes the live messages in the system
  • TrackingDb—Stores message bodies and properties of messages that have already been processed by BTS

When a message is marked for tracking, a Microsoft SQL Server job, which runs periodically, moves that message body from the MessageBoxDb database to the TrackingDb database. This job takes in as input parameters the server and database name of the TrackingDb database.

During Secure Deployment (SD) testing, when BTS is configured across firewalls and the config box is on a different domain from the SQL Server box, we noticed that the SQL Server job was failing to copy the tracked-message bodies to the tracking database. Upon further investigation, we found that, in this scenario, the firewall port mappings are different, which resulted in a misconfiguration of the job. The SQL Server job is assigned the IP address of the tracking server name on one side of the firewall, which isn't valid on the SQL Server side of the firewall. The bug that we had opened for this issue got resolved as a Doc issue; whereby, it was decided to document this issue as a KB article, with the resolution being to reconfigure the SQL Server job manually after configuration. (The tracked-message copy job must be edited, if NAT or SQL Aliasing is used with remote SQL Server.)

What troubled us is that the SQL Server job showed warnings instead of failure. Ideally, the Event log should have a failure event (event ID>20000?), instead of just a warning. Were we missing something here? What were the potential implications of this? We tried to reproduce the scenario in a normal environment, without any firewalls.

The tracking SQL Server was explicitly shut down, and then we let the application process messages. What we noticed was that, even after bringing up the tracking server after a while, the tracked-message bodies were still not getting copied to it. They were also no longer there in the MessageBox server, as the message had already been processed. The problem was that the message bodies had been lost from the system! This was a data-loss issue!

We now began a deep-down analysis of the problem. The SQL Server job seemed to be the culprit. Upon debugging, what we noticed was that the SQL Server job constructs the insert statements as strings and executes them. Now, if these inserts fail—which was happening in the SD test bed, because the configurations were incorrect—the stored procedure gladly ignores the error and marches ahead. Eventually, each of its inserts actually failed; but, at the end, it cleaned the message bodies from the MessageBox table. So, after an execution, the tracking information was cleaned and the data was lost. The fix was easy: After executing each of the insert statements, it should have checked the error explicitly by using the @@ERROR construct.

Thus, what started out as a firewall issue turned out to be a data-loss bug that could have happened very easily in a normal scenario in which the tracking database was inaccessible for any reason.

A couple of points about the bug:

  1. We had a negative test case for the feature, but we did not test it with tracked data inside the MessageBoxDb. Also, we were satisfied with the warnings in the Event log, where we should have inquired why it was not an error.
  2. The code for handling error scenarios was missing in the product.
  3. Most importantly, dig deep. Do not accept what you see at the surface and learn to debug the product code whenever it is required. If you sense that something is rotten, it usually is; hence, always look under the covers for the real issue.

About the author

Vishal Chowdhary is in his third year at Microsoft. He started with the BizTalk Server team, where he owned the new Operations OM API testing, along with parts of BizTalk Server tracking features. Presently, Vishal is involved with designing the test framework and test planning for future versions of BizTalk Server.