Monday, May 5, 2014

Juniper SRX High Availability (HA) & clustering: Part 1

SRX is an enterprise grade firewall solution build by Juniper, one of the largest network equipment providers. In our organisation we have used SRX extensively as a firewall and IPS device. One of the great features of SRX is the ability to run it in a cluster, working in tandem.

High Availability (HA)

Failure of devices, links getting damaged or bugs crashing the OS are only a tiny fraction of things that terrorize a network/system administrator. Remember- things will fail. That is the mantra that one has to live by. Systems will go down, CPU spikes will crash your aggregate switch. For this reason the concept of HSRP(Cisco), VRRP (Open source)  was introduced in routers. The concept of Virtual chassis is extensively used in Juniper switches to bundle the switches together.

Juniper's SRX chassis cluster is a similar feature. Most of these features described in these articles have been tested on SRX 550 and 650. There are some things which differ from an SRX model to the other and therefore needs to be consulted in its manual. I have posted important links at the bottom.

This particular article will discuss the theory behind clustering and how SRX uses fab and control links to manage a failover.

When can clustering be useful:

Protecting from routing engine failures:

You can configure the routing engine to be shifted from one SRX to another. Routing failure can occur because of a buggy codebase of the JUNOS. Sudden power surges or outages can also cause routing engine failure (like pulling the plug out of SRX). High CPU spikes for a long time can also freeze it.

Upstream and downstream link failures:

Suppose you are using an SRX as an inline firewall and the upstream link to your core router fails or becomes choked? Without clustering all your traffic will start getting dropped.

Internet connectivity failure:

With ip monitoring feature (not supported in every SRX) you can have ICMP probes to multiple ips. If those probes fail the traffic get automatically shifted to the other SRX.

SRX redundancy

In the worst case if the SRX device goes down, the backup SRX will assume the mastership and the traffic will automatically get shifted to the other device. I have tested this and this hardly caused any ping loss.

Before getting into the actual configuration there are a few terms that you absolutely need to know.

Control Links:

The control plane of the two SRX is synced over this link. Different ports are designated as control ports in different SRX models. For example, in SRX 650 and 550 ge-0/0/1 is the control port. Remember this port will behave as a control port only if you enable clustering.

The two SRX's control port have to be physically connected to ge-0/0/1. Logically SRX1 will be connected from ge-0/0/1 to ge-9/0/1. This is how SRX behaves. These are hardcoded values and cannot be changed. Once they become part of the cluster, the primary SRX will continuously synchronize all the control plane information via this link.

Note that only in select models (high end SRX) can you have two links configured as control links in bundle. Otherwise you can have only one link as the control link. More about this in the next article.

FAB links:

Fab links (fabric links) are responsible for syncing the routing engine and passing the transit traffic across the SRX. Refer to the above diagram. If the link from SRX1-VC1 has failed all the traffic coming from RTR1 will reach SRX1, move via the fab link and pass downstream via SRX2.
Fortunately you can configure two data links as a bundle by defining fab0 and fab1 parameters. This will be configured in the next article.

Redundancy groups

This is where Juniper SRX's HA feature scores heavily. Redundancy group is a set of objects and properties which move to the backup SRX in case of a failure. You get the flexibility to define what is meant by a failure. It can be a link failing on either SRX or a ping check failing (known as ip-monitoring). Each redundancy group can specify which SRX node gets to be primary. For example, you can configure traffic for one subnet to go via SRX1 by making it as a primary node in a redundancy group. In another redundancy group you can specify SRX2 to be primary and associate a different subnet to it. This way traffic for subnets can be differentially preferred.

Reth interface

A reth (Redunant ETHernet) interface is a pseduo interface which has "child interface" from both nodes. This is important to maintain high availability. A reth interface belongs to a particular redundant group. All the traffic goes to the primary node of the the redundant group via the configured child interface of the corresponding reth interface. Too hard to digest?

Refer to the network diagram above.

For high availability I need atleast two interfaces, one connecting to each SRX node. Since these two interface serve a specific purpose-redundancy , Juniper classifies them as a virtual interface known as reth interface.

Remember they are not in bonding/lag. In lag all the interfaces are used simultaneously. Not in this case. Traffic will only travel via one particular interface. (We can have reth lags but that is an advanced topic).

Now how does SRX decide which interface to send traffic through? We can specify a primary node (node refers to SRX) and a secondary node. But that is the purpose of a redundancy group. Thus, we can configure a reth interface to be part of a redundancy group and it will inherit all the properties.

For example, in the network diagram if reth1 belongs to redundancy group 1 whose primary node is SRX1 all the traffic coming from RTR1 will travel the left most link and reach SRX1.

If you are still confused the configuration in the next article should solve it.

I will be using the following network topology to set up a redundant SRX cluster which will automatically failover if any of the upstream or downstream link fails. Also some basic troubleshooting steps will be discussed.

SRX provides a robust chassic clustering feature. It is fairly easy to configure once the key concepts are understood.

Important links:

1. Configuring SRX chassis clustering
2. Blog post on SRX clustering
3. Juniper techpub on SRX chassis clustering