FAILOVER management

FAILOVER management¶

designations:

gnt# - command exec on master node
gntX# - command exec on ordinary node
gntY# - command exec on other node
# - command exec on any node

Start instances on one (master) node where other is down¶

When node starts and can't find other node, cluster management daemon ganeti-masterd don't start automatically, even on a master-node.
It is because of not able to find out if second node down or there is a link problem when instances on other node is still running.

Cluster management daemon start:¶

gnt# ganeti-masterd --no-voting

Transfer instances from faled (or offline node)¶

This operation is activated only once. It will change the operation node for the instance.
The transferred instances will be automatically started if desired by default.

Better is to migrate or failover the instances in regular mode, before their node will go offline.
But if it happens accidentally, then you should issue:

gnt# gnt-node failover --ignore-consistency gntX

Master node change in normal mode¶

Both of the nodes are online, master node changing is in normal mode
On master-candidate (gntX):

gntX# gnt-cluster master-failover

Failure of master node¶

Master node(in this example gnt1) is down by hardware failure.

Start management daemon on master-candidate(gntX):

gntX# ganeti-masterd --no-voting

Activate new master node:

gntX# gnt-cluster master-failover --no-voting

Set broken node to offline so master node don't try to connect it.
-С = master-candidate
-O = offline

gnt# gnt-node modify -C no -O yes gntY

Start all instances from broken node on backup node:

gnt# gnt-node failover --ignore-consistency gnt1

Set broken node to online¶

Old master node will not start management daemon on boot:

if it will not find other node,
if it will find other node that is in master mode.

If data on this node is ok then to readd it in claster:
Copy new configuration on it from new master node:

gnt# gnt-cluster redist-conf

Restart ganeti daemons:

gntX# /etc/init.d/ganeti restart

Planned node turning off for maintenance¶

Migrate all instances from this node to another

gnt# gnt-instance migrate INSTANCE

If turning off node is master then you must to assign new cluster master(see above Master node change in normal mode)

Set node to offline and not master-candidate:

gnt# gnt-node modify -C no -O yes УЗЕЛ

Now you can simply turn off this node

Returning node to online¶

After boot set node online and master-candidate:

gnt# gnt-node modify -C yes -O no УЗЕЛ

However, if you have any doubt about node's health, you would rather exec this:

gnt# gnt-node add --readd УЗЕЛ

Anyway, you must wait about 5 minutes until watcher daemon set up drbd resources or initiate set up by hand:

gnt# gnt-cluster verify-disks

Node replace to new¶

Add node to cluster:

gnt# gnt-node add --readd gntX

For all instances which new node is secondary:

gnt# gnt-instance replace-disks --submit -s INSTANCE

Readd node to puppet:

gnt# gnt-instance console sci
sci# puppetca --clean gnt1.fqdn

gntX# /var/lib/puppet/ssl/*
gntX# /etc/init.d/puppet restart

Hard disk replace¶

Copy partitions from existent hdd(allowed only for same mode disks):

# sfdisk -d /dev/sda|sfdisk /dev/sdX

Check:

# fdisk -l

add to RAID:

# mdadm --manage /dev/md0 --add /dev/sdX1
# mdadm --manage /dev/md1 --add /dev/sdX2
# mdadm --manage /dev/md2 --add /dev/sdX3