November 28, 2004
Getting The IP-PBX To Work: Fire, Ready, Aim
Part 1: "Fire!" or, "IP-PBXs are just Ethernet devices."
mplementing IP-PBXs is easy, but getting them to work consistently, all the time and for all the right reasons isn't. That's not the answer most people want to hear, but the reality can be seen in the planning and initial homework that each organization completes before any purchase.Consider the tribulation and testimony of a real-life IP-PBX customer.
My customer is a professional office and has a remote site linked with a point-to-point T1 from MCI, Cisco routers at both ends, and a 3Com NBX in Maryland with a remote chassis at the far end (Virginia). The customer is what we call an "Orphaned Account," meaning the data VAR that sold them the system went out of business.
In October 2001, we agreed to take on the responsibility- with reservations, since our core strength lies in traditional telecom. Right away, the issues we encountered were numerous and complaints were high.
Our contact at the customer was the office manager, "Ms. One." I'll refer to two successors of Ms. One later.
Ms. One was instructed to find out what it would take to make the phone system right. The original dealer selling the NBX believed the marketing hype that converged cabling would save customers money, and that managing the NBX would be easier, too. The LAN consisted of hubs and one LAN switch-unmanagcd. All telephones used local power supplies, and the only protection on the NBX was a small uninterruptible power supply (UPS)-no circuit (telecom), AC, Ethernet or other protection whatsoever.
The customer data network was plagued by most IP-PBX telephone and system issues. Problems were reported weekly and sometimes daily. On top of these issues was an older complaint with the customer telephone lines connected to their system. Their lines would go into a Hung status and eventually all lines would go Hung, leaving them of no use to inbound or outbound users until the system was rebooted.
Our solution was simple. Separate the IP telephones from the data network to increase network performance. To accomplish this, we installed separate cable drops where needed for the IP telephones. Then, we installed power protection using a managed UPS, managed 3Com LAN switches, 3Com Ethernet power supplies and separate cabling for the IP telephones.
We also hard-coded all the switch ports, not leaving them to auto- negotiation. With VOIP, leaving auto-negotiation turned on in LAN switches creates CRC errors, which increases as the network traffic increases. Then, retransmissions of TCP traffic increase, resulting in lost audio packets. The bottom line: auto-negotiation simply does not work well.
The next round of work was mostly programming and testing. We found numerous telephones with defective dial pads, handmade patch cords, and cabling mismatches, with terminations using AT&T 568A on one end and AT&T 568B on the other end.
Next, we attacked the old issue of Hung lines plaguing the customer on their previous telephone systems. The customer telephone lines are analog channels provided by an AT&T T1 from older channel banks. After a day of looking, testing and re-testing, we found the problem was that "call forward disconnect" wasn't being sent to the 3Com NBX for the duration expected. If the call forward disconnect signal is not properly sent, the phone system will not release the call and the line or channel used goes into a Hung state-not available for in- or outbound calling.
Ms. One gladly called AT&T, but fixing the problem took repeated calls, site visits (one visit by me followed by several visits by AT&T), and even 3Com forwarding detailed tech notes to me and then to the customer to pass on to AT&T management. The customer agreed to pay for AT&T's service call ilthey found this to be in error. AT&T replaced the older channel banks with an Adtran unit capable of a programmable and longer "call forward disconnect," solving the problem.
I don't fault AT&T or anyone else, but this series of events made it clear that telecom is changing, thus so must we.
The Next Round Of Problems
By Jan. 1, 2002, the customer had a working telephone system requiring minimal support for programming and setting up some improved features and use of the dial plan. Then, in May 2002, our customer called us and their IT contractor to discuss implementing a remote office in Virginia. In August 2002, 3Com support and I spent most of the day providing the IT contractor with information to program the Cisco 1720 routers at the remote site. 3Com provided details above and beyond the call of duty, and what I'd deem more than acceptable support to convince the IT contractor that he needed more work on his end, programming the routers to support the IP telephones. My company checked the local cabling and found the same issues in the remote site as we'd seen in the corporate office regarding termination mismatches, i.e., mixing 568A and 568B.
In the months following, Ms. One moved to another company and Ms. Two came aboard briefly. Ms. Two required a lot of support in coming to terms with an "it's just an Ethernet device" telephone system. Ms. Two left for another company and Ms. Three took over and remains. (For her first-person account of the ordeal, see "I'm Ready, Ready, Ready! Will This End?")
Everything was stable for the next year until the customer relocated their remote office to a new site less than two miles away from the previous location. The relocation did not involve us, since it's just an "Ethernet device" telephone system.
From May 2003 until June 2004, two issues arose that couldn't be resolved even with repeated site visits and conference calls among Ms. Three (the customer). MCI, Cisco and 3Com support, the IT contractor and the voice contractor (yours truly).
The issues were noise on voice calls and garbling of voice during calls only between the far end and the host site, or calls that jumped off at the host site (head end hop-off). The previous remote location had run fine with the same equipment and configuration.
We did the check/update and replace/double-check exercise, to no avail. We assumed our IP-PBX configuration was wrong. The IT contractor put in his hours and efforts, even replaced a router at the Maryland (headquarters) end, implemented numerous "test" configurations, and still found no solution. The local telco dispatched technicians at either end at various times; finally, after a lot of pressure from the customer, the telco did end-to-end testing.
MCI and the IT contractor agreed that the issue must be at the main office in Maryland, since I reported "delayed and out of order packets," and their logic was that Maryland is the source of transmission (originating packets). My reasoning-opposite of theirs- was just plain basic logic: You just moved your "Ethernet device' without adequate protection, there were storms in the Virginia site's area over the weekend, and the following week there's a problem."
Let me say here that the converged demarcation is ugly because nothing really changes in the sense of responsibility, and it's definitely not easier when too many chefs are in the kitchen, so watch out. Ultimately the customer bears the burden, and until the problem is resolved, the finger-pointing doesn't subside.
Determining the cause of the problems in this particular case was hindered since the packet traces that 3Com requested proved fruitless, as did the system logs, network discovery and configurations. Cisco routers at either end had no errors of any kind (not even a dropped packet? I wondered). MCI test center had nothing to report other than the circuit is running "error free." The same issues were identified early on-symptoms reported and the drill of checking, rechecking and hounding the configurations and questioning every detail from call detail records, reports from the users, wiring, software, hardware revisions, firmware revisions, and all the details that went into turning a once working configuration into a nightmare that just doesn't want to end. The obvious was ignored, and why a Cisco router was such a big deal to replace (other than it did not show any errors) in the Virginia office still remains a mystery.
The persistence of this "grey matter demarc" proved once again that the industry is going to go through another long and testy learning curve of VOIP and IP-PBXs.
We never found the precise solution, but this is what we discovered and reported to the customer:
The symptom was clear: "Virginia has poor audio-static, noise and garble on the receive end only of the WAN." The WAN connection is a point-to-point T1 (1.544 Mbps) with a Cisco router on either end. The isolation (cause) and solution to the problem were not identified in a timely manner since everyone reported "all good on my end" even after several end-to-end tests, vendor meets and conference calls.
Seemingly small discrepancies were noted and corrected. For example, the original installation of the span was sloppy: The cable used was Category 3 unshielded and terminated on multiple blocks in a wiring closet feeding eventually to the riser and onto the smart jack in the basement. Soon after the T1 was installed, another local technician split one unused pair from the cable to run a new line for a fax. (A violation of basic installation practices.)
We did not expect the issue to go away once the cable was replaced with a shielded Category 5 dr\op-we simply wanted to remove one more possible area of trouble. (Note: Insist that the smart- jack be installed in the same room as your IP-PBX when dealing with multi-tenant riser cables, and be sure to use Cat5 Plenum or better.)
Another small fix: The remote hardware had been relocated without adequate power and circuit protection. The same previous configuration at the former location worked without issue for nearly a year.
This issue was now isolated to the two Cisco routers and a span between them. In February 2004, MCI claimed that they made no changes with a technician on each site performing end to end testing, but immediately after they left, delayed packets were down from 7-10 percent to around 5 percent, as logged by Qovia.net, the hosted service that we used to provide management/monitoring data.
The Virginia LAN gear (Cisco router) had not been swapped. In the previous remote location, there were only four 3Com telephones and one non-IP conference telephone. In the new location, we implemented a managed switch, eight 3Com telephones and the non-IP conference phone after the move. The issues remained before we deployed the new switch, after implementing the new switch, after adding the additional hardware, and when we added the remote chassis to provide local dial tone. Symptoms were present regardless of the configuration of telephone hardware.
Other Points Of View
As mentioned, the IT contractor and MCI support folks felt that the source of the problem was the host end (Maryland). However, replacing the host end Cisco router did not correct the problems.
By chance, the customer emailed our office alerting us to the issue, and we immediately logged into their site using Qovia.net. We showed the UPS status recovering from a power failure lasting approximately 30 minutes. The UPS also showed an overload condition of nearly 70 percent utilization, which is too high for a mission- critical telephone system; our other sites run no more than 30 percent utilization.
Within an hour of the power event, the customer audio at the far end degraded. It turns out that, ever since the customer relocated, the UPS was barely staying online, with a low output voltage condition. Not even Cisco routers are immune to low input current coming from the UPS.
Once the routers were rebooted and then the IP-PBX rebooted afterwards, the audio remained acceptable but still not stable or consistent for long periods of time. The synchronization (router to router) appears to be an issue as a result of overloading the UPS, but the noise/static problem remained until the router at the remote end was replaced and reprogrammed.
Voice quality immediately improved. Then, somehow, somewhere in between, with the IT contractor changing the router and software configuration, all the problems disappeared. Out-of-order packets immediately decreased from 7-10 percent to about 2-3 percent, which is what we've come to expect as normal for this particular IP-PBX. This was in June 2004.
The initial poor installation practices, combined with the expectations that the IP-PBX is just an Ethernet device, helped my company to the tune of $20,000, but cost the customer not only that much money, but significant downtime, expenditure of resources to manage a problem for 13 months and a disruption to their core business. The customer reports their initial purchase price of the originally installed system had been approximately $40,000.
In addition, my company's resources were also sorely burdened, as were those of all parties involved-all for something obvious, yet it became so complicated. The IP-PBX may be "just an Ethernet device," but simplicity is relative.
Our next challenge was to replace the system hard drive our Qovia.net logs revealed to us as failing. When it did fail, we restored the previous clay's backup and discovered the Ethernet protection device was missing from the Maryland site. The customers' contractor had felt it was unnecessary and removed the protector some months before the hard drive failure. Unfortunately for the customer, within two weeks, storms whipped through the area once again and the new hard drive failed during a weekend.
Then, after a few months of running smoothly, on September 29, the call came from Ms. Three: "Virginia is down!" After 15 minutes on a conference call with Ms. Three (customer), MCI support (T1), the Virginia customer contact, and myself being logged into Qovia.net, I easily isolated the cause being the UPS offline due to- once again-the storms from the night before. In less than five minutes, we had the Virginia customer contact move the power cords from the failed UPS to the protector to bypass the dead UPS and restore local power. This restored the LAN and telephones until we could arrive to replace the UPS.
Not every story has a happy ending. There will be no shortage of service work in an industry segment begging for punishment. Customers will continue to be orphaned, vendors will walk, and manufacturers and dealmakers binding someone else's software to another's hardware will continue to create smoke.
There are many lessons herein. Discovery in the legal sense is finding out about all parties involved; in the same sense, discovery must be invoked when it comes to adopting an IP-PBX. Know the manufacturer, find out from other sources, and get to know your dealer/VAR and their work. Vendors that set customer expectations must then meet or exceed them, and there's no excuse for bad behavior.
Realistically look at the configuration before purchase, and fully exercise the decision making process. Many issues stem from the original configuration before delivery, when customers and vendors hash out the details of the planned installation. Giving the impression that the IP-PBX is "just an Ethernet device" leaves too much open and subjects the customer to unnecessary agony.
Next, decide when to call the telephone system dealer/VAR instead of using internal IT staff and IT contractors to move, relocate and/ or reconfigure an IP-PBX. When implementing IP-PBX across the network, pull together the necessary resources to ensure success. Then, in dealing with issues, always look at the obvious first and don't read too much into the problem until the obvious solutions are exhausted. Simple basic troubleshooting skills such as swapping out an unprotected router after a thunderstorm should be a nobrainer and not a last resort.
Lastly, IP is here to stay, and there remains a lot of work to be done by everyone involved. The process must improve
Companies Mentioned In This Article
The repairs and reconfiguration cost half as much as the entire system at installation
The voice contractor wasn't involved with the Virginia office's move-after all, the system was "just" an Ethernet device
We fixed lots of little things, but didn't find the big problem
Matt Brunk is CEO of Telecomworx, specializing in telephony, traffic engineering and management. He can be reached at 301/865- 8800 or [email protected]
Copyright Business Communications Review Nov 2004