Failover management » История » Версия 5
Версия 4 (Владимир Ипатов, 24.10.2012 21:13) → Версия 5/6 (Dmitry Chernyak, 13.11.2012 09:27)
.
{{>toc}}
h1. FAILOVER management
designations:
<pre>
gnt# - command exec on master node
gntX# - command exec on ordinary node
gntY# - command exec on other node
# - command exec on any node
</pre>
h2. Start instances on one node where other is down
When node starts and can't find other node, cluster management daemon ganeti-masterd don't start automatically, even on a master-node.
It is because of not able to find out if second node down or there is a link problem when instances on other node is still running.
h3.
Cluster management daemon start:
<pre>
gnt# ganeti-masterd --no-voting
</pre>
h3. Transfer instances from faled (or offline node)
This operation is activated only once. It will change the operation node for the instance.
The transferred instances will be automatically started if desired by default.
Better is to migrate or failover the instances in regular mode, before their node will go offline.
But if it happens accidentally, then you should issue:
<pre>
gnt# gnt-node failover --ignore-consistency gntX
</pre>
h2. Master node change in normal mode
Both of the nodes are online, master node changing is in normal mode
On master-candidate (gntX):
<pre>
gntX# gnt-cluster master-failover
</pre>
h2. Failure of master node
Master node(in this example gnt1) is down by hardware failure.
Start management daemon on master-candidate(gntX):
<pre>
gntX# ganeti-masterd --no-voting
</pre>
Activate new master node:
<pre>
gntX# gnt-cluster master-failover --no-voting
</pre>
Set broken node to offline so master node don't try to connect it.
-С = master-candidate
-O = offline
<pre>
gnt# gnt-node modify -C no -O yes gntY
</pre>
Start all instances from broken node on backup node:
<pre>
gnt# gnt-node failover --ignore-consistency gnt1
</pre>
h2. Set broken node to online
Old master node will not start management daemon on boot:
* if it will not find other node,
* if it will find other node that is in master mode.
If data on this node is ok then to readd it in claster:
Copy new configuration on it from new master node:
<pre>
gnt# gnt-cluster redist-conf
</pre>
Restart ganeti daemons:
<pre>
gntX# /etc/init.d/ganeti restart
</pre>
h2. Planned node turning off for maintenance
Migrate all instances from this node to another
<pre>
gnt# gnt-instance migrate INSTANCE
</pre>
If turning off node is master then you must to assign new cluster master(see above *Master node change in normal mode*)
Set node to offline and not master-candidate:
<pre>
gnt# gnt-node modify -C no -O yes УЗЕЛ
</pre>
Now you can simply turn off this node
h3. Returning node to online
After boot set node online and master-candidate:
<pre>
gnt# gnt-node modify -C yes -O no УЗЕЛ
</pre>
However, if you have any doubt about node's health, you would rather exec this:
<pre>
gnt# gnt-node add --readd УЗЕЛ
</pre>
Anyway, you must wait about 5 minutes until watcher daemon set up drbd resources or initiate set up by hand:
<pre>
gnt# gnt-cluster verify-disks
</pre>
h2. Node replace to new
Add node to cluster:
<pre>
gnt# gnt-node add --readd gntX
</pre>
For all instances which new node is secondary:
<pre>
gnt# gnt-instance replace-disks --submit -s INSTANCE
</pre>
Readd node to puppet:
<pre>
gnt# gnt-instance console sci
sci# puppetca --clean gnt1.fqdn
</pre>
<pre>
gntX# /var/lib/puppet/ssl/*
gntX# /etc/init.d/puppet restart
</pre>
h2. Hard disk replace
Copy partitions from existent hdd(allowed only for same mode disks):
<pre>
# sfdisk -d /dev/sda|sfdisk /dev/sdX
</pre>
Check:
<pre>
# fdisk -l
</pre>
add to RAID:
<pre>
# mdadm --manage /dev/md0 --add /dev/sdX1
# mdadm --manage /dev/md1 --add /dev/sdX2
# mdadm --manage /dev/md2 --add /dev/sdX3
</pre>
Check:
<pre>
cat /proc/mdstat
</pre>
{{>toc}}
h1. FAILOVER management
designations:
<pre>
gnt# - command exec on master node
gntX# - command exec on ordinary node
gntY# - command exec on other node
# - command exec on any node
</pre>
h2. Start instances on one node where other is down
When node starts and can't find other node, cluster management daemon ganeti-masterd don't start automatically, even on a master-node.
It is because of not able to find out if second node down or there is a link problem when instances on other node is still running.
h3.
Cluster management daemon start:
<pre>
gnt# ganeti-masterd --no-voting
</pre>
h3. Transfer instances from faled (or offline node)
This operation is activated only once. It will change the operation node for the instance.
The transferred instances will be automatically started if desired by default.
Better is to migrate or failover the instances in regular mode, before their node will go offline.
But if it happens accidentally, then you should issue:
<pre>
gnt# gnt-node failover --ignore-consistency gntX
</pre>
h2. Master node change in normal mode
Both of the nodes are online, master node changing is in normal mode
On master-candidate (gntX):
<pre>
gntX# gnt-cluster master-failover
</pre>
h2. Failure of master node
Master node(in this example gnt1) is down by hardware failure.
Start management daemon on master-candidate(gntX):
<pre>
gntX# ganeti-masterd --no-voting
</pre>
Activate new master node:
<pre>
gntX# gnt-cluster master-failover --no-voting
</pre>
Set broken node to offline so master node don't try to connect it.
-С = master-candidate
-O = offline
<pre>
gnt# gnt-node modify -C no -O yes gntY
</pre>
Start all instances from broken node on backup node:
<pre>
gnt# gnt-node failover --ignore-consistency gnt1
</pre>
h2. Set broken node to online
Old master node will not start management daemon on boot:
* if it will not find other node,
* if it will find other node that is in master mode.
If data on this node is ok then to readd it in claster:
Copy new configuration on it from new master node:
<pre>
gnt# gnt-cluster redist-conf
</pre>
Restart ganeti daemons:
<pre>
gntX# /etc/init.d/ganeti restart
</pre>
h2. Planned node turning off for maintenance
Migrate all instances from this node to another
<pre>
gnt# gnt-instance migrate INSTANCE
</pre>
If turning off node is master then you must to assign new cluster master(see above *Master node change in normal mode*)
Set node to offline and not master-candidate:
<pre>
gnt# gnt-node modify -C no -O yes УЗЕЛ
</pre>
Now you can simply turn off this node
h3. Returning node to online
After boot set node online and master-candidate:
<pre>
gnt# gnt-node modify -C yes -O no УЗЕЛ
</pre>
However, if you have any doubt about node's health, you would rather exec this:
<pre>
gnt# gnt-node add --readd УЗЕЛ
</pre>
Anyway, you must wait about 5 minutes until watcher daemon set up drbd resources or initiate set up by hand:
<pre>
gnt# gnt-cluster verify-disks
</pre>
h2. Node replace to new
Add node to cluster:
<pre>
gnt# gnt-node add --readd gntX
</pre>
For all instances which new node is secondary:
<pre>
gnt# gnt-instance replace-disks --submit -s INSTANCE
</pre>
Readd node to puppet:
<pre>
gnt# gnt-instance console sci
sci# puppetca --clean gnt1.fqdn
</pre>
<pre>
gntX# /var/lib/puppet/ssl/*
gntX# /etc/init.d/puppet restart
</pre>
h2. Hard disk replace
Copy partitions from existent hdd(allowed only for same mode disks):
<pre>
# sfdisk -d /dev/sda|sfdisk /dev/sdX
</pre>
Check:
<pre>
# fdisk -l
</pre>
add to RAID:
<pre>
# mdadm --manage /dev/md0 --add /dev/sdX1
# mdadm --manage /dev/md1 --add /dev/sdX2
# mdadm --manage /dev/md2 --add /dev/sdX3
</pre>
Check:
<pre>
cat /proc/mdstat
</pre>