One of the most deployed features in my experience with NSX-V was Cross vCenter NSX, which allows a multisite deployment.
To reduce the cost of licensing and for automated failover you can instead deploy a stretched cluster thus only requiring a single vCenter and NSX license but with the added cost of a stretched network and potentially stretched storage unless of course you use a DR recovery solution such as SRM to failover your vCenter and NSX Manager.
What prevented a lot of customers from deploying NSX-T until now was the lack of a fully functioning Cross site deployment.
NSX-T Multisite has been around for a while and was adopted by some however it follows either the stretched cluster model or requires a manual backup restore of the NSX Manager on the DC 2 site and so this has some limitations, that is until NSX-T 3.0 where we now have NSX-T Federation, this is a game changer and so there is nothing stopping customers from moving to NSX-T.
NSX-T Multisite is still a valid deployment option for NSX-T 3.0 and as I understand it only requires an Advanced license so it’s cheaper than going for a Federation build however that does have it’s limitations due to how multisite functions.
NSX-T Data Center supports Multisite deployments where you can manage all the sites from one NSX Manager cluster. There are two types of multisite deployments that are supported:
- Disaster recovery
The diagram belows show a disaster recovery deployment.
In a disaster recovery deployment, NSX-T at the primary site handles networking for the enterprise. The secondary site is standing by to take over if a catastrophic failure occurs at the primary site.
All traffic egresses via the Primary site.
The diagram below shows an active-active deployment.
In an active-active deployment, all the sites are active and layer 2 traffic crosses the site boundaries. Each site is configured with a Tier-0 and Tier-1 Gateway VM’s are connected to either the Primary Site Tier-1 gateway and egress only out of the Primary Site, or they are connected to the Secondary site Tier-1 Gateway and egress only out of the Secondary site.
In both instances the system can be configured in two different ways one allows an automated failover of the management and data plane the other requires a manual/scripted failover.
Automated failover Management Plane
To allow an automated failover of the Management plane the system must be configured as follows
- A stretched vCenter cluster with HA across sites configured.
- A stretched management VLAN.
The NSX Manager cluster is deployed on the management VLAN and is physically in the primary site there is a single vCenter server for Management also in the primary site.
If there is a primary site failure, vSphere HA will restart the NSX Managers and the vCenter Server in the secondary site.
All the transport nodes will reconnect to the restarted NSX Managers automatically. This takes about 10 minutes and during this time, the management plane is not available but the data plane is not impacted.
The diagram below shows an automatic recovery of the management plane.
Automated failover Data Plane
To allow an automated failover of the Data plane the system must be configured as follows
- The maximum latency between Edge nodes is 10 ms.
- The HA mode for the tier-0 gateway must be active-standby, and the failover mode must be preemptive.
Note: The failover mode of the tier-1 gateway can be preemptive or non-preemptive.
Most of the configuration is done via the API including creation of Failure Domains, an Edge cluster that is stretched across sites such that the cluster has Edge nodes EdgeNode1A and EdgeNode1B in the primary site, and Edge nodes EdgeNode2A and EdgeNode2B in the secondary site.
The active Tier-0 and Tier-1 gateways will run on EdgeNode1A and EdgeNode1B.
The standby Tier-0 and Tier-1 gateways will run on EdgeNode2A and EdgeNode 2B.
Each Edge nand subsequent Edge cluster is then associated with the relevant Failure Domain.
Finally Tier-0 and Tier-1 gateways are deployed via either API or the UI.
The diagram below shows the automatic recovery of the data plane.
Since the T0 and T1 gateways are configured in Active/Standby mode they failover automatically during the Primary site failure.
Manual/Scripted failover Management Plane
To allow a manual/scripted failover of the Management plane the system must be configured as follows
- DNS for NSX Managers with a short TTL (for example, 5 minutes).
- Continuous backup.
vSphere HA, or a stretched management VLAN, is NOT required.
Each site has its own Management vCenter servers which stay in the same site these vCenters are added as compute managers to NSX-T
NSX-T Managers must be associated with a DNS name with a short TTL.
All transport nodes (Edge nodes and hypervisors) must connect to the NSX Manager using their DNS name. To save time, you can optionally pre-install an NSX Manager cluster in the secondary site.
The recovery steps are:
- Change the DNS record so that the NSX Manager cluster has different IP addresses.
- Restore the NSX Manager cluster from a backup.
- Connect the transport nodes to the new NSX Manager cluster.
The diagram below shows manual/scripted recovery of the management plane.
Manual/Scripted failover Data Plane
To allow an manual/scripted failover of the Data plane the system must be configured as follows
- The maximum latency between Edge nodes is 150 ms.
The Edge nodes can be VMs or bare metal.
The tier-0 gateway can be active-standby or active-active.
Edge node VMs can be installed in different vCenter Servers. No vSphere HA is required.
The recovery steps are:
- Move the Tier-1 gateways that were connected to the Primary site Tier-0 gateway to the Tier-0 gateway in the DR site.
- Connect the Tier-1 gateway to the Secondary Tier-0
The diagram below shows the manual/scripted recovery of the data plane.
NSX -T Federation
NSX-T Federation allows you to manage multiple NSX-T Data Center environments with a single pane of glass view.
You can create gateways and segments that span one or more locations, and configure and enforce firewall rules consistently across locations.
This is basically what we have in NSX-V cross vCenter NSX.
One of the major differences is with how distributed firewall works. with NSX-V we had the concept of Universal objects and rule whereby the object or rule was created and managed only from the primary site and was replicated to all secondary sites. The main use for this was using Universal Security Tags for DR failover. There were other use cases but the security group could only really use the VM Name as a reference and as VM names can be changed by any vCenter admin this can lead to the firewall not functioning as desired.
With NSX-T federation we no longer have or need Universal objects as the Global Manager is connected to all Local managers and has visibility of all VM’s across the sites, as such Distributed firewall rules and objects can be referenced from anywhere and rules are maintained when VM’s migrate.
Once you have installed the Global Manager and have added locations, you can configure networking and security from Global Manager.
There are two types of NSX Managers in a federation deployment, both are deployed from the same OVA file.
- Global Manager: A system this is similar to NSX Manager that federates multiple Local Managers.
- Local Manager: An NSX Manager system in charge of network and security services for a location.
You can create a networking object from the Global Manager, it can span one or more locations. You can also of course still create a networking object directly on the local NSX manager but this configuration will not appear on the Global Manager
- Local: the object spans only one location.
- Stretched: the object spans more than one location.
You do not directly configure the span of a segment. A segment has the same span as the gateway it is attached to.
Security objects have a region. The region can be one of the following:
- Location: a region is automatically created for each location. This region has the span of that location.
- Global: a region that has the span of all available locations.
- Custom Region: you can create regions that include a subset of the available locations.
In a Federation environment, there are two types of tunnel endpoints.
- Tunnel End Point (TEP): the IP address of a transport node (Edge node or Host) used for Geneve encapsulation within a location.
- Remote Tunnel End Points (RTEP): the IP address of a transport node (Edge node only) used for Geneve encapsulation across locations.
A Global Manager cluster is deployed in the Primary site and a Standby cluster in the Secondary site.
A Local Manager cluster is deployed in each site and will synch data with the Global Manager and some data with the other Local Managers.
Configurations made on the Global Manager can be pushed out to all Local Managers
Or they can be pushed out to only some local managers or just a single Local Manager
Security Objects function in the same way
Security groups can contain VM’s that span more than one location and can included dynamic membership, for this we need to synch information from each location to list the group membership.
For example, Group1 has the following members:
- VM1 in Location 1
- VM2 in Location 2
- VM3 in Location 3
Each Local Manager syncs its dynamic group membership with the other Local Managers. As a result, each Local Manager has a complete list of group members. This is a big change from Cross vCenter NSX-V where each NSX Manager could only see and reference it’s local objects!
Tier-0 gateways, Tier-1 gateways, and segments can also span one or more locations in the Federation environment.
- Tier-0 and Tier-1 gateways can have a span of one or more locations.
- The span of a Tier-1 gateway must be equal to, or a subset of, the span of the Tier-0 gateway it is attached to.
- A segment has the same span as the Tier-0 or Tier-1 gateway it is attached to.
You can configure Tier-0 gateways that exist in only a single location this is the same as configuring them locally on the Local manager but deployed via the Global manager they can only be managed from there.
Tier-0 Gateways can be configured as Active/Active in a primary and secondary location this is similar to Active/Active NON local Egress deployment in NSX-V whereby all traffic egresses via a single site.
Or they can be configured in an Active/Active in all locations configuration think Local Egress in NSX-V where traffic will leave via the local site. Care must be taken to ensure that traffic returns to the correct site as this is asymmetric routing and firewalls may drop the traffic if it enters back via a different site.
You can deploy a Tier-1 gateway to provide distributed routing only, or you can configure services on it.
You can create a Tier-1 gateway in Federation for distributed routing only. This gateway has the same span as the Tier-0 gateway it is linked to.
The Tier-1 does not use Edge nodes for routing. All traffic is routed from host transport nodes to the Tier-0 gateway. However, to enable cross-location forwarding, the Tier-1 allocates two Edge nodes from the Edge cluster configured on the linked Tier-0 to use for that traffic.
If you need one of the following configurations you will need to configure the Tier-1 gateway with Edge clusters:
- You want to run services on the tier-1 gateway.
- You want to deploy a Tier-1 gateway that has a different span than the linked Tier-0 gateway. You can remove locations, but you cannot add locations that are not already included the span of the Tier-0 gateway.
You select one of the locations to be the primary location. All other locations are secondary. The HA mode for the Tier-1 gateway is Active Standby. All traffic passing through this Tier-1 gateway passes through the active edge node in the primary location.
If both the Tier-1 gateway and the linked Tier-0 gateway have primary and secondary locations, configure the same location to be primary for both gateways to reduce cross-location traffic.