Friday, February 28, 2014

SDN - A look at Openflow running on white box switches.

In my quest to understand SDN, I attended a presentation on Openflow running on whitebox switches by Pica8 Open Networking. Their goal is to commoditize switching hardware and allow the control plane decisions to be made by a controller using Openflow. Pica8 has a cheap whitebox switch that runs Open Vswitch. The switches can be setup through Zero Touch provisioning, meaning you take an unconfigured box, drop it into the network and it will communicate with a server and automagically configure itself. They run a very light weight open source operating system with hardly any functionality which keeps costs low. I've read on the internet that their switches sell at half the cost of other vendor equipment. 

         While this may sound enticing, you get what you pay for. The size of their TCAM is around 1K-2K entries for some of their switches and 10k entries for others. This is pretty low! Why is the TCAM important? Think of it as the table size for an ACL entry. This is the 5 tuple flow where you "program" the switch based on the source and destination IP address, source and destination port and protocol. This is the gist of how Openflow programs the switches in your datacenter.

The low number of entries means that Pica8 has to limit the size of a data center. They resolve this limitation by recommending grouping switches into different "clusters". (Note: clusters is my name for it, as they called it a unit of calculation) They've calculated that a typical cluster can scale to a maximum of 12 racks of servers with two TOR switches, two AGG switches and two CORE boxes. When you want to expand, you create another "cluster" of switches/servers and interconnect them through the CORE boxes. This seems to be a waste of ports at the AGG layer.

          I believe the reason for this is based on the TCAM problem. If the AGG layer runs out of TCAM space, you are forced to build another cluster. I can imagine that each cluster is managed by a separate SDN controller. Theoretically a single SDN controller could manage all the clusters, but they didn't really talk about it.
           
          Pica8 at the time of writing has 200 customers, but ZERO deployments. Which means that even though SDN and Openflow is a cool technology, no customer in their right mind is going to put this in their production environment until the technology is more mature.

Now some of the problems with this implementation is that you literally have to program your network. You first have to create drop flow profiles. If you don't want IPv6 traversing your switch, create a drop flow. No multicast, create a drop flow. Then you create your forwarding entry. Need to go from a VM on port 1 to a VM on port 2. Create a forwarding entry. Now imagine that you have 48 ports per switch and two TOR switches, that right there means 96 entries in a single direction. Add the reverse direction and you have 192 entries to program. As you can see this could get very tedious. 
    
         Hopefully someone has a controller that can do this automatically. Once you run out of TCAM space you'll have to move VMs to a new cluster. Which brings me to the second issue, which is support for VMotion. Moving VMs between servers requires you to reprogram flows. But VMotion will have to integrate with the SDN Controller for this to work. This is also a flow based mechanism which means that this can be susceptible to DDOS attacks. Just send a bunch of arps and the switch will punt this to the controller. Get enough ARPs and you can overwhelm a controller and bring down a cluster.

Next they discussed network diagnostics. This was a very interesting topic. Where do you put this? On a normal switch you have counters and can retrieve them typically through SNMP. But on an Openflow switch, the hardware is supposed to be dumb. You need to put this on the controller. But how do you access a switch's counters without compromising performance. Do you retrieve this though Openflow? Is there another Northbound connection that will be both lightweight and scalable? Also Pica8 mentioned that some counters such as ingress and egress port statistics were not easily accessible. Another issue was when an upstream AGG switch did not have a proper flow entry it blackholed the packet and sent flow control packets down stream to the TOR, which filled up the buffers, thus preventing no packets to be forwarded from the TOR. They had to drop down to the switch's debug level to figure this out. In this scenario, where is the troubleshooting? You think it's a TOR issue, but in fact it's an AGG issue. This is a big concern I have about SDN, lack of diagnostics to troubleshoot issues. There is not enough visibility into the network to trace down the problem.

          While Openflow is an interesting technology this implementation is not yet mature and requires a lot of customization. Because of the limitation in a switches hardware this is not a scalable solution. You also need an intelligent controller that can automate your flow entries in a simple manner. 


Now to resolve this hardware issue I can imagine building a switch like building a bare metal server. Make the parts swappable. Running out of TCAM? Pull out the current one and install a new one just like you can swap out RAM and CPU Cores. White boxes need to be built so that their network connections stay in place and you swap out the FRUs around it. There also needs to be a way to get the optics to the point where they are tri-rate like copper links. Need to upgrade from 1G to 10G to 40G to 100G? Just update the flash. However this will also need some kind of black plane to upgrade the switch fabric.. Commoditized hardware needs to be built modularly. But this may be a Chassis based switch, not a TOR. However technology is constantly shrinking things down while packing more punch so I can imagine that eventually this will happen.

Thursday, February 27, 2014

My research into SDN and Openvswitch (OVS)

So I've been researching SDN with openvswitch (OVS) and I'm not impressed. I'm all for network virtualization and SDN, but this implementation is rather poor. For one, why are they building antiquated technologies.

On their features page http://openvswitch.org they mention:

  • STP (IEEE 802.1D-1998)
WTF? Why? It's even an older version of spanning tree to boot. If you're going to put spanning tree in OVS why not use the much faster RSTP instead. What's the intention of STP? Backwards compatibility? So are you going to deploy brand new technology (i.e. OVS and Openstack) on a network that has old switches?

If you're going to create an SDN network, why not use a new technology like Shortest Path Bridging or TRILL that allows all paths to be active. Better yet, program the network so all paths are active so when a failure occurs, reroute around the failure. Isn't that what SDN is all about?

Also OVS is flow based. Flow setup rate for GRE tunnels are low ~24K connects/sec utilizing >80% of the CPU. And this makes it vulnerable to DOS attacks. Compromise a VM and have the instance create BUM frames (Broadcast, unknown Unicast, and Multicast) will bring OVS to it's knees. And if you're going to use STP on OVS, beware of a hacker crafting packets with a superior BPDU. You wouldn't want to have ports go into a blocking state for no reason.

There has to be a different approach to doing this without copying old technology and porting it to something new. 

Friday, February 14, 2014

How to summarize external routes learned from another IGP into the OSPF backbone.

For example, if I had an ASBR in Area 0 redistributing 192.168.10.0/24, 192.168.11.0/24, and 192.168.12.0/24  from RIP into OSPF, I only want to advertise the 192.168.8.0/21 into the backbone.

Junos can summarize at the ABR using "area-range" or "nssa area-range" commands, but this is used to summarize an external route from another Area into the backbone, not from an ASBR. 

You can either do the following:

1. move the router into another area. (not practical)

or

1. Create an aggregate/generated route covering the desired routes
2. Instead of importing the IGP routes in the routing policy, import the aggregate/generated route

-----------------------------
user@R3# run show route protocol rip                                               

inet.0: 40 destinations, 40 routes (40 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.10.0/24     *[RIP/100] 22:17:43, metric 2, tag 0
                    > to 192.168.0.30 via ge-0/0/3.0
192.168.11.0/24     *[RIP/100] 22:17:43, metric 2, tag 0
                    > to 192.168.0.30 via ge-0/0/3.0
192.168.12.0/24     *[RIP/100] 22:17:43, metric 2, tag 0
                    > to 192.168.0.30 via ge-0/0/3.0
224.0.0.9/32       *[RIP/100] 02:05:02, metric 1
                      MultiRecv

inet6.0: 36 destinations, 39 routes (36 active, 0 holddown, 0 hidden)

[edit]
user@R3# show protocols ospf 
export RIP->OSPF;
area 0.0.0.0 {
    interface ge-2/0/0.0;
    interface ge-2/0/1.0;
    interface lo0.0;
}
area 0.0.0.2 {
    nssa
    interface vlan.100 {
        bfd-liveness-detection {
            minimum-interval 100;
        }
    }
}

jnpr@R3# show routing-options 
aggregate {
    route 192.168.8.0/21;
}

jnpr@R3# show policy-options policy-statement RIP->OSPF 
term AGG {
    from {
        protocol aggregate;
        route-filter 192.168.8.0/21 exact;
    }
    then accept;
}
term LAST {
    then reject;
}




[edit]
user@R3# run show route protocol aggregate detail 

inet.0: 40 destinations, 40 routes (40 active, 0 holddown, 0 hidden)
192.168.8.0/21 (1 entry, 1 announced)
        *Aggregate Preference: 130
                Next hop type: Reject
                Address: 0x1147eec
                Next-hop reference count: 4
                State: <Active Int Ext>
                Age: 2:36:04 
                Task: Aggregate
                Announcement bits (2): 0-KRT 2-OSPF 
                AS path: I (LocalAgg)
                Flags:          Depth: 0 Active
                AS path list:
                AS path: I Refcount: 3
                Contributing Routes (3):
                192.168.10.0/24 proto RIP
                192.168.11.0/24 proto RIP
                192.168.12.0/24 proto RIP

looking at the prefix from another router in area 0

user@R2# run show route 192.168.8/21  

inet.0: 37 destinations, 38 routes (37 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.8.0/21      *[OSPF/150] 02:32:53, metric 0, tag 0
                    > to 192.168.0.22 via ge-1/1/2.0

Looking at the prefix from another router in area 2 (NSSA)

user@R4# run show ospf database nssa    

    OSPF database, Area 0.0.0.2
 Type       ID               Adv Rtr           Seq      Age  Opt  Cksum  Len 
NSSA     0.0.0.0          192.168.255.1     0x80000004  1209  0x20 0xec70  36
NSSA     192.168.8.0       192.168.255.3     0x80000001    16  0x20 0x809e  36
NSSA    *192.168.20.0      192.168.255.7     0x8000001d   360  0x28 0x7d95  36
NSSA    *192.168.21.0      192.168.255.7     0x8000001c  1563  0x28 0x749e  36
NSSA    *192.168.123.252   192.168.255.7     0x8000001c   961  0x28 0x2c83  36

user@R4# run show route 192.168.8/21 

inet.0: 37 destinations, 37 routes (37 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.8.0/21      *[OSPF/150] 00:02:12, metric 0, tag 0

                    > to 192.168.0.60 via vlan.100

Thursday, February 13, 2014

SDN - Unbundling platforms via Cumulus Networks

I went to a Meetup the other night and saw a presentation by JR Rivers (Co-founder) of Cumulus Networks.

He had an interesting point that most Enterprise networks have hardware that run a single OS. For instance a customer Data Center will run Cisco hardware running Cisco IOS. As it is, the Network Admin is stuck using that vendor and cannot move to another vendor without a forklift upgrade.

Who wants to reconnect links?

JR's proposition is to basically take what's currently going on in the server market and redefine the networking industry.


If you don't like your current vendor's OS, then you can re-image the gear and run the Network OS flavor of your choice.

I can see why Cumulus has taken this approach. In order to disrupt an incumbent, you need a way to remove them from the environment without disrupting the environment.

I can see this working on whitebox gear, but not really much on incumbent gear. A question from the audience was raised, who does the support? JR said that they would support software related issues. But what happens if Cumulus OS is running on top of a Cisco box? Good luck trying to call Cisco. They would automatically void the warranty in a situation like this. The one good thing about having a single vendor environment is that there is support. You know that the vendor will fix their bugs. In this new environment, you're not sure who's going to take responsibility to fix the bug. Also Cumulus is currently only targeting TOR boxes, so this unbundling of platforms will currently only happen in the Data Center.

The value that Cumulus adds is that their OS is Linux. With Linux, DevOps is very familiar with this OS and can program both the APP and the network, hence the SDN part. If Cisco doesn't have a feature in IOS that you need, as a customer you have to ask for a feature request, give your SE a dollar value tied to this feature and then wait x months for the new version to come out. Of course that software upgrade costs money.

As a Demo, JR showed us a BASH script that solved a problem. His point, he created the feature himself. It didn't cost him any money or require him to upgrade his OS.

Now for some vendors like Cisco you could do scripting off box using screen scrapes. But other vendors already have this capability. For instance, Arista has Python built into their OS, so JR's value add doesn't really hold a lot of weight. Juniper has a scripting language called SLAX which is based on XML/XLST.

A problem I see is that the OS is Linux and Security is a big problem. The reason why some vendors don't want to open up a scripting language on their box is because of this very reason. If a hacker is able to take control of your OS, they could create malicious code to do all sorts of things. I wouldn't want to be the Network Admin who has to figure out why their network is all of a sudden DDOSing their own servers and other networks. Which is why Juniper removed Perl from their early implementation of Junos and also why they support SLAX which basically prevents users from messing around with the kernel.

An audience member asked about JR's take on OpenFlow.  He was mostly against this. His response was that there would be lots of problems getting all the vendors to support this. This is an incorrect assumption. Networking vendors always update their OSes to support new protocols. As a vendor, you have to adapt to new things or you die.

I believe JR's reluctance is the threat OpenFlow can create. A SDN controller using OpenFlow can basically take the brains out of a switch. If you have 100 switches, you basically have 100 OSes,  a separate OS running on each switch. With Openflow, you can theoretically control all 100 switches with a single SDN controller and not rely on the underlying OS to do the decision making. Program the whole network, not just each individual switch.

While Cumulus does make an interesting point in unbundling platforms, SDN using a controller that is able to program your whole network is way more interesting.