Obviously you know you have a problem when the support desk phones start ringing, or link monitor statistics begin to display wildly errant values such as high latency or SNR. But what or who is the culprit? The wireless troubleshooting techniques described here should only be followed once you have ruled out problems at the tower side such as flakey cabling, flapping switch or router ports, routing issues, false positives (ie. an unusual, but normal increase in user traffic), or unscheduled upstream or backhaul provider outages.
Following a logical troubleshooting procedure limits the downtime needed to test a wireless issue and will help you arrive at the solution sooner. So first, think about and note the following. By answering the questions below, you will find one area remains for which to focus your attention. And as I always stress, if you feel overwhelmed, do not feel shy calling your favorite IT consulting firm for help.
Is the problem affecting the whole network, or a specific tower, sector or link? Or maybe a small group of customers in a single area are affected?
Are you aware of any recent configuration or equipment changes made to the affected portion of the network? If so, what are they?
What’s going on in the environment? What is the adjacent competition doing, if anything? Are you aware of any new construction that could be presenting a new obstacle within the coverage area or link experiencing the issue?
Is there a certain time that the problem occurs or started to occur? Is there a pattern to its occurrence or is it more random in nature?
- Baseline analysis: check RSS, SNR, scan, latency and performance baseline data (log files, MRTG, Solarwinds, etc.). Note the average value for each, as well as the dates and times any dramatic changes in these values occurred.
- Changes: check with colleagues to determine if any changes in equipment and/or configuration have occurred since the times noted in step #1 above.
- Customer feedback (placed at step 3, but you may want to move it further down on the list, depending on your situation): If you’re a WISP operator, you are already fully aware of how slippery the slope can get when gaining customers’ perspectives. However this should not be overlooked, especially if the problem is affecting multiple customers. Correlating affected customers’ stories can quickly eliminate false information while solidifying the facts. Particular attention should be given to asking customers questions regarding the use of wireless equipment within their own home and immediate vicinity.
- Check radio configuration: access the affected radio(s) using the connection method with which you are familiar (telnet, HTTP). Check the obvious areas such as firmware version, power level, frequency, IP configuration (address, gateway, DNS, SNMP, SNTP, etc.), modulation schemes (if applicable), SSIDs, authentication (static, RADIUS, TACACS, keys), encryption, QoS, VLANs, etc.
- Check radio operation and monitoring tools’ data: link activity (TCP/IP stats such as Rx and Tx packets), errors (discarded and retransmitted packets), conduct wireless performance and ping tests and collect current RSS and SNR readings; compare these to the baseline in step 1.
- Spectrum scanning: requires taking the production system offline for the time it takes to perform the scans. Using the radio’s built-in spectrum analyzer is the most practical and unobtrusive means of sampling what’s going on in the RF environment. If the radio does not include a spectrum analyzer but the installation consists of a RF cable run into an enclosure or customer premises, a stand-alone spectrum analyzer (Anritsu, HP) can usually be attached to the antenna with the correct adapters (be sure to compensate for any loss added to the path between the radio and antenna). The benefit of using a stand-alone spectrum analyzer is its flexibility and ‘real-time’ viewing aspect.
- Hardware: inspect all connections for corrosion, water ingress, tearing, bending, cracking, etc. If suspect components are found, begin swapping out the easiest pieces (RF jumpers, connectors, filters, lightning arrestors, PoEs), retesting as per step 5 and 6 above after each change. Last to try swapping is the radio itself; verify the correct configuration and firmware is loaded on the swap unit prior to the exchange.
- Parallel system testing: at this point there are only a couple of components that remain to be tested; the tower or customer RF cable (LDF4-50, LMR400) and antenna. Since these items are not easily swappable, parallel system testing makes sense. It is best to conduct this testing using the same equipment as is installed in the production system. However, many operators and field services make use of a pre-configured ‘test rig’ that is relatively easy to deploy to save time. In this case, calculations should be done to compensate for the loss or gain in the path as a result of testing with dissimilar equipment before comparing to a baseline. Also, the active system’s radio(s) should be disabled during testing using the parallel system to prevent masking the results. If possible, it is highly recommended to perform tests in both vertical and horizontal polarities and at different elevations. If the parallel test system shows similar characteristics to the affected installed system, noise mitigation will be necessary. Make sure you have recorded all stats, scans and other test results so as to determine the course of action (ie. swapping out a faulty sector antenna) that will best reduce the effects of noise or interference as well as minimize downtime.