How collecting, correlating, and surfacing the proper data identifies and solves issues faster
December 15, 2015
Cheesy Business School Anecdote
There is an old business school anecdote about a gentleman, we’ll call him Nick Tar, who retired after working at the same factory for decades. No one knew the assembly line or machinery better than Nick. Within a week of his retirement, the assembly line ground to a halt. Something was broken and no one could find the problem. After days of searching for the issue and losing money with an idle assembly line, the company finally asked Nick to come in and help.
Nick Tar walked onto the factory floor, immediately approached one of the machines, pulled a piece of chalk out of his pocket, put an “X” on the machine and walked out. The factory foreman opened the machine and quickly found the problem and fixed it. The assembly line was back up and running within minutes.
A week later, the company received an invoice from Nick for $10,000. This seemed like a lofty price tag for what amounted to about 5 minutes of work, so they thanked Nick again for his efforts, but asked him to itemize the invoice so they could justify the cost.
Nick Tar sent an invoice with two line items:
• Piece of Chalk: $1.00
• Knowing Where to Put it: $9,999.00
Real Life Example – The Setting
A larger enterprise customer in the New York metro area had been experiencing issues with outbound PSTN calls going over their SIP trunks for months. Users were complaining that calls were either taking forever to complete or were failing all together.
The customer’s on premise UC deployment was managed by the services division of one US-based tier-1 carrier. A rival US-based tier-1 carrier provided the SIP trunks; definitely not the ingredients for collaborative troubleshooting. A poor little Session Border Controller (SBC) sat in the middle, routing traffic and doing what SBC’s do.
The Finger Pointing
When users would complain, the customer would reach out to their managed service provider who assured them their environment was clean, it must be the SIP trunks. When the customer reached out to the SIP trunk provider, they assured them the trunks were fine, it must be the customer’s internal environment. It was going to take a lot of work to recreate calls until one failed and capture all signaling to get to the bottom of the issues. Neither tier-1 carrier was especially motivated to spend the time to resolve it.
Several months passed with calls continuing to fail. User satisfaction was quickly devolving.
This customer’s internal environment had Voice over IP (VoIP) calls that would pass through a Session Border Controller (SBC) on their way to WAN connection that carried the SIP trunks.
SBC – The Session Border Controller (SBC) sits on the border between two environments (in this case the private enterprise network and the public Internet network) to help control the signaling and media streams for VoIP traffic. Control includes setting up and tearing down voice or video calls. SBC’s also provide security, protocol consolidation/routing, and various quality of service functions.
SDP – Session Description Protocol (SDP) is the signaling component of SIP traffic. It includes the instructions for setting up maintaining, and tearing down multimedia sessions. It also communicates any anomalies/errors in the session.
Enter, the Nectar
Nectar was engaged to collect, correlate and surface the data necessary to identify the root cause. They needed to know where to put the chalk. First, Nectar deployed Perspective to generate synthetic RTP traffic to test the customer’s internal network. After a day, the solution was able to validate that all internal VoIP traffic was routing across the network as expected.
So, the internal network was not likely the issue. We turned our attention outward.
We placed a Nectar UC Diagnostics (UCD) analyzer and bracketed the SBC that separated the internal (private) environment from the external (public) SIP trunks. Within an hour, the analyzer captured a number of failed calls. Within minutes, we were able to capture actual signaling, correlate the legs of a failed call, and surface a SIP ladder diagram. When you put it into a single sentence, it seems simple.let’s unpack it a bit:
Nectar analyzer bracketing customer SBC
Capture – The Nectar UCD analyzer was capturing signaling information from EVERY ACTUAL call the customer’s users were making to the PSTN. We were not trying to recreate the issue on our own and capture only that call. Only with a device at the SBC could we capture every call and filter through to find those that failed. Again, actual calls by users that failed.
Correlate – Technically speaking, a call through an SBC to a SIP trunk is at least 2 sessions. There is the session on the internal private network and the session over the public external network. The Nectar UCD analyzer is able to correlate these sessions to know which external SIP call correlates with the user’s failed call from the internal network. Only with this correlation can you produce the actual signaling between the carrier and the SBC that corresponds to the poor user experience.
Surface All the capture and correlation in the world is useless unless you can produce output that articulates the actual problem. Nectar’s Packet Capture (PCAP) generation and export to Wireshark gives carriers the information they need to troubleshoot any signaling errors they may be having that are impacting their customers’ user experience.
Sample Wireshark SIP Ladder Diagram Produced from Nectar PCAP
It took Nectar’s UC Diagnostics analyzers about an hour to identify, capture, correlate, and surface all the errors that were causing user-impacting issues that two tier-1 carriers could not identify over the course of months, even with of the resources at their disposal.
Some monitoring tools companies, who coincidentally cannot produce an analyzer, argue that analyzers are unnecessary and that they can do everything without “probes”. I would encourage you to ask these companies how they would have captured, correlated, and surfaced the data necessary to solve this enterprise’s very real issue.
Anybody can look at the assembly line and tell you what machines do what and agree it’s not working, it takes a special level of capture, correlation, and surfacing to know where to put the chalk to actually fix the issue.