Designing a High Availability Asterisk Infrastructure at RMU

This past year, one of my largest projects was to take ownership of RMU's Asterisk-based IP PBX and upgrade it from version 1.2 Business Edition to 1.8 Open Source (Long Term Support). This was no small task, with nearly every config file needing completely rewritten, including the University's entire dialplan. This was both a blessing and curse. There was significant time involved, but because of that, it allowed me (with some help from my boss) to rethink and reengineer the system from the ground up, ultimately providing a less buggy, more reliable phone system with more features for our users and myself as an admin. With the upgrade, we also wanted to be able to provide guaranteed high availability for 98% of all the functions of the phone system. Below, I've detailed some of the ways we made that possible.

High Level Overview

Our Asterisk phone system consists of seven servers. One is an FTP server that serves phone configuration data, and the other six actually run Asterisk and handle calls in one way or another. Two of these servers are dedicated to call routing to the PSTN. One is considered the primary, and the other is considered the secondary. The other four servers are dedicated to what we consider the application layer of the phone system. One server is for Admin/Faculty/Staff endpoint registrations, call queues, menus, etc., and the other server is for student dorm room endpoint registrations through AudioCodes VoIP gateways. The other two are virtual servers in our VMware environment that each serve, respectively, as a virtual counterpart for the admin/faculty/staff and student dorm physical servers. Running Asterisk out of a virtual environment isn't recommended by Digium (nor RMU), but in a worst-case scenario, it can be done and will function to a moderate degree. These virtual servers contain exact copies of the Asterisk and DAHDI configuration files from their physical counterparts. Any settings that are machine/host specific in our configuration files are set via file includes that are not synchronized between servers.

Multiple Registrations Per Endpoints

As mentioned above, because we have two servers for each application layer user group (admin/faculty/staff and students), the admin/faculty/staff endpoints register to both the primary and secondary server. If the primary were to go down for any reason, the phones would simply send their calls off to the virtual secondary. The AudioCodes VoIP gateways in the student dorms are configured in a hot-swap proxy mode with homing enabled. This allows them to failover to the secondary virtual server if needed, but they will always try and fall back to the primary if/when it becomes available again.

Separation of Call Routing and Application Layer Servers

Separation of call routing to the PSTN from the application layer servers allows us to handle issues at each layer independently. Because administrative inter-campus calls never even reach the physical call routing layer, if we were to experience a total failure at the routing layer, administrative inter-campus phone calls would continue to function. This separation also allows us to integrate seamlessly with our legacy Nortel PBX. Connections to the Nortel are handled at the routing layer and call flows between Asterisk and the Nortel work as expected with just a slight reduction in caller ID information.

Geographically Dispersed Call Routing and Multiple Trunk Groups

The secondary server at the call routing layer lives about 25 miles away from the primary by nature of it being placed in a secondary data center. It's connected via our MAN that runs from our Moon campus to the city and back following different paths in each direction. This difference in geolocation adds significant failover capability in case of a disaster in or near our datacenter on our Moon campus. We worked with our PRI provider to enable trunk group failover between our trunk groups so that if our Moon trunk groups were to go down for any reason, University calls would immediately begin flowing into our backup trunk group where the lines are connected to the secondary server downtown. Our primary call routing server also includes PRIs from two different trunk groups that leave the CO from different switches. If our provider were to have a failure with one of its switches, we would still receive calls on the other trunk group.

SIP Provider as a Last Resort

We utilize a registration to a SIP Provider to prepare for a worst-case scenario in which both of our call routing servers would become unavailable or our PRI provider would have a total failure on all of our trunk groups. In this worst-case scenario, a limited number of calls would still flow into and out of the University.

High Availability Voicemail

Before the upgrade, if we had to fail over the application layer servers for any reason, voicemail message waiting indicators and subscriptions became confused because the servers weren't sharing a central storage location for voicemail. For this most recent upgrade, we decided to place our voicemail on an NFS mounted volume located on Tier 1 storage. We process an incredible amount of voicemail per day, so placing voicemail on a tier 1 volume just makes sense. We haven't experienced a single hiccup since, and failover between the physical application layer servers and their virtual counterparts results in absolutely no voicemail confusion.

Intelligent Call Routing

We made major changes to dialplan code with the upgrade, with me personally being responsible for writing most of it. The separation of call routing from the application layer allows the application layer servers to simply direct numbers for which they are not authoritative down to the routing layer servers to have them routed to the correct place. Without disclosing too much information, we also make heavy use of nesting and Gotos to avoid repeating dialplan. Take the following as an example. Instead of writing three lines per local pattern match (one for each routing layer server and one for the SIP provider), shown below is a smarter solution reducing the overall number of lines. (This has been modified for this blog posting.)

[local]
exten => _412NXXXXXX,1,Goto(dial-out,9${EXTEN},1)
exten => _1412NXXXXXX,1,Goto(dial-out,9${EXTEN},1)
exten => _9412NXXXXXX,1,Goto(dial-out,${EXTEN},1)
exten => _91412NXXXXXX,1,Goto(dial-out,${EXTEN},1)

exten => _724NXXXXXX,1,Goto(dial-out,9${EXTEN},1)
exten => _1724NXXXXXX,1,Goto(dial-out,9${EXTEN},1)
exten => _9724NXXXXXX,1,Goto(dial-out,${EXTEN},1)
exten => _91724NXXXXXX,1,Goto(dial-out,${EXTEN},1)

[dial-out]
exten => _[0-9].,1,Dial(${GLOBAL(SERVER1)}/${EXTEN})
exten => _[0-9].,n,Dial(${GLOBAL(SERVER2)}/${EXTEN})
exten => _[0-9].,n,Dial(${GLOBAL(SIPProvider)}/${EXTEN})

There are numerous other examples found within our dialplan, but unfortunately I cannot share many of them. If there is a particular piece of dialplan that you're having trouble writing, please comment or send me an e-mail and I may be able to help you.

CDR and Queue Logging to MySQL

A massive improvement with the upgrade to Asterisk 1.8 came the ability to reliably store our call detail records and queue data in MySQL databases in addition to flat files on the servers themselves. This allows for an incredible amount of analysis to be performed against the data. Our administrative users can also now use applications such as Lowry QueueMetrics to monitor the performance of our call queues. As an admin armed with a bit of SQL, I enjoy the ease in which I can collect call and usage statistics from the CDR database.

Summary

Every possibility for failure in our phone system was analyzed and we engineered ways to overcome all of those possibilities. High availability for all of our systems is always our top priority and our phone system is no exception. By using Asterisk, we're free to design and engineer a system that meets our needs and functions consistently and reliably within our environment. Considering the stability that Asterisk has provided us over the years and especially so now, it would be silly to not consider Asterisk for any new phone system deployment, even if integrating it with an older legacy system is a requirement. Digium provides technical support for a modest fee to cover the open source version of Asterisk, similar to how Red Hat provides support for RHEL. Digium also now offers what they call "Certified Asterisk" which is a branch of Asterisk supported by Digium for commercial, SLA customers, which entitle them to certain guaranteed support offerings.

If you have any questions or comments, please feel free to comment on this post.

1692 views and 0 responses