[Crowbar] Support for multiple Nova zones

Kevin Bringard kbringard at atti.com
Mon Jan 16 08:57:57 CST 2012

There are a few things at work here… first off there are a few different single points of failure, each of which have their own degrees of impact. The network controller is arguably the largest SPOF. Because in a "default" OpenStack setup, it's the router for everything and there's just one of them; if you lose it, you lose connectivity to your entire cloud. The HA stuff in crowbar (a lot known as "the Vishy method") puts a network controller on each compute node, with the idea being that if you lose that network controller, the VMs on it are gone until it comes back up anyway, so you're minimizing the impact (as an aside, this is a good article about networking in Nova: http://unchainyourbrain.com/openstack/13-networking-in-nova). Then you have the DBs and the messaging queue… setting up replication on the DB and using Rabbit's clustering stuff (http://www.rabbitmq.com/ha.html) will work to make those more resilient to failure. You can do this manually now (and I'd recommend it if you're planning to go to production). I think what Andi was saying is that they're working to add recipes to make the HA automatic in crowbar.

Finally, to your direct question about compute nodes. Unfortunately even if you're using multiple zones, a single VM only runs on a single compute node. There's really no way to 100% guard against it, at least not that I'm aware of today. You can, however, do a few things to minimize the impact:

 *   Use a NAS or SAN to mount the instances directory (/var/lib/nova/instances by default, but it can be changed in the nova.conf). This will allow you to nova-manage live migrate instances if you are experiencing a hardware problem… I don't believe you can migrate them with the live-migrate command once a compute node is unreachable, however it is possible to do manually.
 *   If you bring back up a dead compute node, the instances that were living on it will still be there (again, in in /var/lib/instances, unless you changed it). You can run a euca-reboot-instances, or its nova equivalent, loop to reboot them and they should come back up
 *   To Andi's point about application level HA (and this is a really big one): train your users to setup their applications with the understanding that VMs are volatile. Amazon looses compute nodes (and the VMs running on those nodes) all the time. If user's applications are designed to be HA then the loss of N number of VMs shouldn't be a problem. Rightscale does some of these things if you want to pay for it, but it's a good idea for application/project/tenant owners to have some monitoring that keeps an eye on the overall capacity, usage and performance of their application(s) in the cloud. It then spins up and terminates VMs as necessary to maintain performance thresholds. If they have something like that in place, if you lose a compute a node their monitoring will realize it and spin up more to automatically to keep the performance levels. I know it sounds like a tough sell to tell someone they should work harder in case you lose a VM, but the fact remains that VMs should be considered even more volatile than bare metal and they need to design/retrofit their applications to take that into account.

Anyway, those are some of the things we're doing… the DR doc Andi references is very useful as well, so if you've not done so already you should look over it and use it for inspiration.

Hope that helps!

-- Kevin

From: "i3D.net - Tristan van Bokkem" <tristanvanbokkem at i3d.nl<mailto:tristanvanbokkem at i3d.nl>>
Date: Mon, 16 Jan 2012 01:51:33 -0800
To: "Andi_Abes at Dell.com<mailto:Andi_Abes at Dell.com>" <Andi_Abes at Dell.com<mailto:Andi_Abes at Dell.com>>, "csanburn at redwoodit.com<mailto:csanburn at redwoodit.com>" <csanburn at redwoodit.com<mailto:csanburn at redwoodit.com>>
Cc: "crowbar at lists.us.dell.com<mailto:crowbar at lists.us.dell.com>" <crowbar at lists.us.dell.com<mailto:crowbar at lists.us.dell.com>>
Subject: Re: [Crowbar] Support for multiple Nova zones


Do I understand it correctly that there will be a better HA solution with the essex release? Because I am missing the point why you would want to lose VM's if a node goes down.

Best regards,

Tristan van Bokkem
Datacenter Operations

E-mail Personal: tristanvanbokkem at i3d.net<mailto:tristanvanbokkem at i3d.net>
E-mail Support: info at i3d.net<mailto:info at i3d.net>
E-mail NOC: noc at i3d.net<mailto:noc at i3d.net>
Website: http://www.i3d.net Office:
Interactive 3D B.V.
Meent 93b
3011 JG Rotterdam
The Netherlands

Visit www.smartdc.net – SmartDC is our in-house 36,000 sq. ft. datacenter in Rotterdam, The Netherlands. High density hosting – multiple fiber carriers in-house – Level3 PoP.

Interactive 3D (i3D.net) is a company registered in The Netherlands at Meent 93b, Rotterdam. Registration #: 14074337 - VAT # NL 8202.63.886.B01. Interactive 3D (i3D.net) is CDSA certified on content protection and security. We are ranked in the Deloitte Technology Fast 50 as one of the fastest growing technology companies.
From: Andi_Abes at Dell.com<mailto:Andi_Abes at Dell.com>
To: csanburn at redwoodit.com<mailto:csanburn at redwoodit.com>
Cc: crowbar at lists.us.dell.com<mailto:crowbar at lists.us.dell.com>
Sent: Thu, 05 Jan 2012 22:02:59 +0100
Subject: Re: [Crowbar] Support for multiple Nova zones

You’re right – if you lose a nova-compute node, then you lose only the VM’s that are running on it.
The impact of losing other services (network, api, schedulers etc) depends on on your deployment options.

If you deploy nova-network in HA mode (with the crowbar scheme), effectively you get multiple nova-api and nova-network instances (1 per compute node). So there’s no central dependency.
We’re working on making other infrastructure services (mysql, rabbit etc) be more HA.

Re: automatic recovery of VM’s running on a failed compute node, openstack offers various options, but they’ll need to be customized to your specific situation.
This might be a good start:


From: Chris Sanburn [mailto:csanburn at redwoodit.com<mailto:csanburn at redwoodit.com>]
Sent: Thursday, January 05, 2012 12:10 PM
To: Abes, Andi
Subject: RE: Support for multiple Nova zones

Great explanation Andi, thanks! I know it’s not actually a crowbar issue but when you said:
“if you lose that zone, and whatever  is running in that VM doesn’t have application level high-availability measures – you’re hosed.”
If just one nova node in the zone goes down then it’s just the VMs that were assigned to that node that are hosed, assuming they had no application level HA,  and the rest of the nodes continue on with their assigned VMs?

I’m just trying to confirm what I suspect is true. I believe my boss would like to, ideally, have us deploy an openstack cloud that can continue running when one or two nodes goes down. But it appears to me that it doesn’t support that in all aspects. I haven’t seen any evidence that you can have a nova compute node drop and the VMs it was hosting automatically get activated on a remaining nova compute node.

If your novel is about openstack & crowbar I’ll have to get a copy :) Four months ago I’d never worked with either one, so I’ve got a lot to learn yet.


From: Andi_Abes at Dell.com<mailto:Andi_Abes at Dell.com> [mailto:Andi_Abes at Dell.com<mailto:Andi_Abes at Dell.com>]
Sent: Thursday, January 05, 2012 11:31 AM
To: Chris Sanburn
Cc: crowbar at lists.us.dell.com<mailto:crowbar at lists.us.dell.com>
Subject: RE: Support for multiple Nova zones

You are right that in swift, zones are the basic unit of availability – swift will guarantee that a different replicas of a file is present in different zones (so if you have 3 replicas, each copy will be in a different zone). So, if you have a zone represent a rack and you have 5 racks, and swift has 3 replicas – a file will exist on 3 racks. If you lose a whole rack, you will still have 2 separate copies on the surviving racks.

Nova works a bit differently….
A VM will be running on 1 compute node.  If that compute node fails, you’ve lost the VM (there some stuff around live-migration and various discussions on getting high-availability at the VM level – but that’s WIP).
What availability zones in nova allow you to do is choose which zone you put your VM in. if you lose that zone, and whatever  is running in that VM doesn’t have application level high-availability measures – you’re hosed.

But lets assume for a second that you’re running a cluster of web-servers. If you setup your cloud in a way that lets you put different instances of the web servers in different zones (i.e. racks). If you lose a rack, there’re still servers running in the other one. For this setup, you’d use nova zones. Each zone is

If you’re looking just to have high availability for a single VM, you don’t really need separate zones. You’d need to make your 1 nova zone as resilient as possible. i.e. deploy each of the components in a highly available fashion – e.g. 2 nova API, nova-network in HA mode, Rabbit MQ, Mysql and such. You will still not be protected for the failure of a compute node, but your overall cluster will be resilient.

During the Essex design session, there was an interesting discussion about nova-ha – generalizing the description above into 2 classes of cloud users, and their expectations:

-          Legacy workloads – expect the same HA that high quality servers provide.

-          Cloud Friendly workloads -  e.g. web-server clusters or swift – where the application is designed to deal with failure

Nova currently doesn’t fully handle the legacy workloads too nicely.  There were various arguments for and against handling them better and some folks felt strongly that better support should be provided. There are options out there, and at an extreme you can setup some hypervisors to execute the same VM on more than 1 physical machine (VMWare calls it fault-tolerance, in some other contexts it’s referred  to as lockstep execution).

I think I’ll go write a great American novel next….it’d be shorter.
Hope this helps.

From: Chris Sanburn [mailto:csanburn at redwoodit.com]<mailto:[mailto:csanburn at redwoodit.com]>
Sent: Thursday, January 05, 2012 9:44 AM
To: Abes, Andi
Subject: RE: Support for multiple Nova zones

Thatnks for the information. Basically I think my boss is wanting to do this test to see if zones will give us redundancy when some of the compute nodes fail or are taken offline. Our end goal is to build a cloud that is highly available and continues to work in the event that a node fails or needs to be taken offline for some other reason. Perhaps we’re mistaken in our assumption that you need zones to accomplish this?
For instance, I’ve already tested my Swift by copying files to it and then removing one of my 3 nodes, one at a time, and confirming I could still access the files stored on it.

Chris Sanburn

From: Andi_Abes at Dell.com<mailto:Andi_Abes at Dell.com> [mailto:Andi_Abes at Dell.com]<mailto:[mailto:Andi_Abes at Dell.com]>
Sent: Wednesday, January 04, 2012 8:52 PM
To: Chris Sanburn; crowbar at lists.us.dell.com<mailto:crowbar at lists.us.dell.com>
Subject: RE: Support for multiple Nova zones

You are right that there’s no explicit support for  nova zones  - zones in nova can be used for lots of different reasons (as the page you’re pointing to lists). And as opposed to swift, where zones are integral to the operation of the system, for nova they’re mostly optional.

I’m curious what you’re trying to achieve w/ zones in nova?

All that said and done, you could probably achieve a multi-zone nova deployment with crowbar. Crowbar allows you to deploy multiple clusters of any openstack component – you would deploy 2 independent nova clusters, and then link them in a parent-child relationship.

You would need to futz with some of the configuration to add capabilities and use a zone aware scheduler….. all based on the page you referenced.
I’d be interested to hear of your experience.

From: crowbar-bounces On Behalf Of Chris Sanburn
Sent: Wednesday, January 04, 2012 1:14 PM
To: crowbar
Subject: [Crowbar] Support for multiple Nova zones

I was asked to setup child zones and test that on my crowbar openstack test deployment. But when I read the Zone documentation here:
It states:
“At the very least a Zone requires an API node, a Scheduler node, a database and RabbitMQ.”

It occurs to me that I’ve not seen anything in the Nova barclamp to setup zones, like there is in the Swift barclamp. Has anyone done this? Is it possible with Crowbar?

-Chris Sanburn

More information about the Crowbar mailing list