Like I stated in my previous article – MySQL InnoDB Cluster – Recovering and provisioning with mysqldump :
“As the administrator of a cluster, among other tasks, you should be able to restore failed nodes and grow (or shrink) your cluster by adding (or removing) new nodes”.
Well, I still agree with myself 🙂
MySQL customers using a Commercial Edition have access to MySQL Enterprise Backup (MEB) which provide enterprise-grade physical backup and recovery for MySQL.
MEB delivers hot, online, non-blocking backups on multiple platforms including Linux, Windows, Mac & Solaris.
More details here.
Note:
If you want to know how to recover a node and/or how to provision nodes with mysqldump please read this blog post.
Context
Let’s make it as simple as possible 🙂
I’m using MySQL Enterprise 8.0.16, available for MySQL customers on My Oracle Support or on Oracle Software Delivery Cloud.
I have an InnoDB Cluster setup, up and running.
So my main assumption is that you already know what is MySQL Group Replication & MySQL InnoDB Cluster.
Additionally you can read this tutorial and this article from my colleague lefred or this one on Windows Platform from my colleague Ivan.
All nodes must have the right MySQL Enterprise Backup privileges.
Details here.
All nodes must have same values respectively for log-bin & relay-log:
For example: log-bin=binlog & relay-log=relaylog (on all nodes)
Note:
Depending on how you configured your MySQL InnoDB Cluster, some steps could be slightly different.
Scenario 1 – Node Recovering
- A 3 nodes MySQL InnoDB Cluster – M1 / M2 / M3, in single primary mode
- MySQL Router is configured to enable R/W connections on 3306 and RO connections on 3307
- M1 is currently the primary (that is in Read/Write mode)
- M2 & M3 are currently the secondaries (that is Read Only mode)
- M1 failed! Assuming it is irreconcilably corrupted :'(
- M2 & M3 are now the (new temporary) cluster
The goal then is to rebuild M1 and put it back to the cluster.
So like I said before we have a 3 nodes MySQL Enterprise 8.0.16 InnoDB Cluster up and running:
$ mysqlsh clusterAdmin@{mysqlRouterIP}:3306 --cluster
...
MySQL JS> cluster.status()
{
"clusterName": "pocCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "M1:3306",
"ssl": "REQUIRED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"M1:3306": {
"address": "M1:3306",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M2:3306": {
"address": "M2:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M3:3306": {
"address": "M3:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "M1:3306"
}
Then node M1 crashed… (status is “MISSING“) :
MySQL JS> cluster.status()
{
"clusterName": "pocCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "M2:3306",
"ssl": "REQUIRED",
"status": "OK_NO_TOLERANCE",
"statusText": "Cluster is NOT tolerant to any failures. 1 member is not active",
"topology": {
"M1:3306": {
"address": "M1:3306",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)"
},
"M2:3306": {
"address": "M2:3306",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M3:3306": {
"address": "M3:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "M2:3306"
}
M1 was the primary.
The cluster initiated an automatic database failover to elect a new primary… blablabla
Well you already know the story 🙂
After a while M1 is fixed and ready to be part of the cluster again.
To minimize the recovery time instead of using the last backup we prefer to take a fresh one.
Speaking of backup, I recommend reading the excellent blog post from my colleague Jesper – MySQL Backup Best Practices.
Let’s take a fresh backup on a secondary node (we could also used the primary).
MySQL Enterprise Backup is a very versatile tool and has many different configuration options.
For clarity I’ll use a simplistic command. Please read the MEB documentation for a more “production style” commande.
The backup will roughly looks like :
$ mysqlbackup --defaults-file=/etc/my.cnf --with-timestamp --messages-logdir=/data/backups/ --backup-image=/data/backups/db.mbi backup-to-image
...
-------------------------------------------------------------
Parameters Summary
-------------------------------------------------------------
Start LSN : 44603904
End LSN : 44607630
-------------------------------------------------------------
mysqlbackup completed OK! with 1 warnings
Please note that it is highly recommended, in addition to the my.cnf to include in your backup process a copy of the auto.cnf and mysqld-auto.cnf configuration files for all nodes.
If you “lose” your auto.cnf file, don’t worry the server will generate a new one for you.
However the recovery process will be slightly different… (more on that below).
Now it’s time to restore this backup on node M1.
Because this server is part of a MySQL InnoDB Cluster, obviously there are some extra steps compare to a standalone server restoration.
Node Recovering
The node recovering process is simple:
- Delete all contents of the MySQL Server data directory
- Restore the backup
- Restore the auto.cnf file
- Restore the mysqld-auto.cnf file (if there is one)
- Start the MySQL instance
This gives us on M1 , something like (simplified version, please adapt to your context) :
# Delete all contents of the MySQL Server data directory
$ rm -rf /var/lib/mysql/*
# Restore the backup
$ mysqlbackup --backup-dir=/exp/bck --datadir=/var/lib/mysql --backup-image=/data/backups/db.mbi copy-back-and-apply-log
# Restore the auto.cnf file
$ cp -p /data/backups/auto.cnf /var/lib/mysql
# Restore the mysqld-auto.cnf file
$ cp -p /data/backups/mysqld-auto.cnf /var/lib/mysql
# Start the MySQL instance
service mysql start
Then you can connect to the cluster and… see that the node M1 is recovering (“status: RECOVERING”) or if you’re not fast enough that the node is again part of the cluster (“status: ONLINE”):
$ mysqlsh clusterAdmin@{mysqlRouterIP}:3306 --cluster
...
MySQL JS> cluster.status()
{
"clusterName": "pocCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "M2:3306",
"ssl": "REQUIRED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"M1:3306": {
"address": "M1:3306",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"status": "RECOVERING",
"version": "8.0.16"
},
"M2:3306": {
"address": "M2:3306",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M3:3306": {
"address": "M3:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "M2:3306"
}
// After a while
MySQL JS> cluster.status()
{
"clusterName": "pocCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "M2:3306",
"ssl": "REQUIRED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"M1:3306": {
"address": "M1:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M2:3306": {
"address": "M2:3306",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M3:3306": {
"address": "M3:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "M2:3306"
}
Lost the auto.cnf file
As promised, the case when the auto.cnf configuration file is not restored.
In fact, in this case the cluster would see this node as a new node (because the server will have a new UUID).
So the process for putting it back is different.
Also note that if you loose the mysqld-auto.cnf file you’ll probably need to configure (again) the server to be Group Replication aware.
To begin, you must stop the Group Replication plugin on the node that needs to be restored (M1):
$ mysqlsh root@M1 --sql -e"STOP GROUP_REPLICATION;"
Then if necessary you can check the configuration and/or configure the node:
MySQL JS> dba.checkInstanceConfiguration("root@M1:3306")
...
MySQL JS> dba.configureInstance("root@M1:3306")
...
You need to remove the old node from the InnoDB Cluster metadata:
MySQL JS> cluster.rescan()
Rescanning the cluster...
Result of the rescanning operation for the 'default' ReplicaSet:
{
"name": "default",
"newTopologyMode": null,
"newlyDiscoveredInstances": [],
"unavailableInstances": [
{
"host": "M1:3306",
"label": "M1:3306",
"member_id": "6ad8caed-9d90-11e9-96e5-0242ac13000b"
}
]
}
The instance 'M1:3306' is no longer part of the ReplicaSet.
The instance is either offline or left the HA group. You can try to add it to the cluster again with the cluster.rejoinInstance('M1:3306') command or you can remove it from the cluster configuration.
Would you like to remove it from the cluster metadata? [Y/n]: Y
Removing instance from the cluster metadata...
The instance 'M1:3306' was successfully removed from the cluster metadata.
Add the “new” node:
MySQL JS> cluster.addInstance("clusterAdmin@M1:3306")
A new instance will be added to the InnoDB cluster. Depending on the amount of
data on the cluster this might take from a few seconds to several hours.
Adding instance to the cluster ...
Validating instance at M1:3306...
This instance reports its own address as M1
Instance configuration is suitable.
The instance 'clusterAdmin@M1:3306' was successfully added to the cluster.
Check – and after the recovery stage, the “new” node is online:
MySQL JS> cluster.status()
{
"clusterName": "pocCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "M2:3306",
"ssl": "REQUIRED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"M1:3306": {
"address": "M1:3306",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"status": "RECOVERING",
"version": "8.0.16"
},
"M2:3306": {
"address": "M2:3306",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M3:3306": {
"address": "M3:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "M2:3306"
}
// After a while...
MySQL JS> cluster.status()
{
"clusterName": "pocCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "M2:3306",
"ssl": "REQUIRED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"M1:3306": {
"address": "M1:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M2:3306": {
"address": "M2:3306",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M3:3306": {
"address": "M3:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "M2:3306"
}
Et voilà!
An simple alternative to deal with this “unpleasantness“, if you don’t need to configure the node, is basically to remove the node and add it again.
Below an example with M3:
MySQL JS> cluster.status()
{
"clusterName": "pocCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "M2:3306",
"ssl": "REQUIRED",
"status": "OK_PARTIAL",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"M1:3306": {
"address": "M1:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M2:3306": {
"address": "M2:3306",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M3:3306": {
"address": "M3:3306",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "M2:3306"
}
MySQL JS> cluster.removeInstance("root@M3:3306")
The instance will be removed from the InnoDB cluster. Depending on the instance
being the Seed or not, the Metadata session might become invalid. If so, please
start a new session to the Metadata Storage R/W instance.
Instance 'M3:3306' is attempting to leave the cluster...
The instance 'M3:3306' was successfully removed from the cluster.
MySQL JS> cluster.status()
{
"clusterName": "pocCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "M2:3306",
"ssl": "REQUIRED",
"status": "OK_NO_TOLERANCE",
"statusText": "Cluster is NOT tolerant to any failures.",
"topology": {
"M1:3306": {
"address": "M1:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M2:3306": {
"address": "M2:3306",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "M2:3306"
}
MySQL JS> cluster.addInstance("root@M3:3306")
A new instance will be added to the InnoDB cluster. Depending on the amount of
data on the cluster this might take from a few seconds to several hours.
Adding instance to the cluster ...
Validating instance at M3:3306...
This instance reports its own address as M3
Instance configuration is suitable.
The instance 'root@M3:3306' was successfully added to the cluster.
MySQL JS> cluster.status()
{
"clusterName": "pocCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "M2:3306",
"ssl": "REQUIRED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"M1:3306": {
"address": "M1:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M2:3306": {
"address": "M2:3306",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M3:3306": {
"address": "M3:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "M2:3306"
}
Scenario 2 – Node Provisioning
- A 3 nodes MySQL InnoDB Cluster – M1 / M2 / M3 in single primary mode
- MySQL Router is configured to enable R/W connections on port 3306 and RO connections on port 3307
- M2 is currently the primary (that is Read/Write mode)
- M1 & M3 are currently the secondaries (that is Read Only mode)
The goal then is to add 2 new nodes: M4 & M5
So we have the 3 nodes MySQL 8.0.16 InnoDB Cluster that we used in the first part of this article. And it is up and running.
Actually adding new nodes is very close to what we have done previously.
The process is :
- Deploy the new MySQL instance preferably already configured for Group Replication
- Restore the data in the way that we have seen previously
Check the configuration and the configuration itself can be done respectively with dba.checkInstanceConfiguration() and dba.configure() functions (and it could also be useful to use checkInstanceState() – see this article).
e.g. on node M4:
$ mysqlsh clusterAdmin@M4:3306 -- dba checkInstanceConfiguration
Validating MySQL instance at M4:3306 for use in an InnoDB cluster...
This instance reports its own address as M4
Checking whether existing tables comply with Group Replication requirements...
No incompatible tables detected
Checking instance configuration...
Instance configuration is compatible with InnoDB cluster
The instance 'M4:3306' is valid for InnoDB cluster usage.
{
"status": "ok"
}
Restore the backup on M4, the new node:
# Restore the backup on M4
$ mysqlbackup --backup-dir=/exp/bck --datadir=/var/lib/mysql --backup-image=/data/backups/db.mbi copy-back-and-apply-log
An finally add the new node (M4):
// Add the new instance
MySQL JS> cluster.addInstance("clusterAdmin@M4:3306")
// Check
MySQL JS> cluster.status()
{
"clusterName": "pocCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "M2:3306",
"ssl": "REQUIRED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"M1:3306": {
"address": "M1:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M2:3306": {
"address": "M2:3306",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M3:3306": {
"address": "M3:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M4:3306": {
"address": "M4:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "M2:3306"
}
Same process for the last node, M5.
You end up with a 5 nodes MySQL InnoDB Cluster \o/:
MySQL JS> cluster.status()
{
"clusterName": "pocCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "M2:3306",
"ssl": "REQUIRED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to 2 failures.",
"topology": {
"M1:3306": {
"address": "M1:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M2:3306": {
"address": "M2:3306",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M3:3306": {
"address": "M3:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M4:3306": {
"address": "M4:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
},
"M5:3306": {
"address": "M5:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.16"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "M2:3306"
}
This is one way to do node recovery and provisioning using MySQL Enterprise Backup.
You may also take a look on this article: InnoDB Cluster: Recovering an instance with MySQL Enterprise Backup – from my colleague Keith.
If you are not yet a MySQL customer, and therefore you are not able to enjoy our advanced features/tools and technical support, so probably that mysqldump could fit here. Please read: MySQL InnoDB Cluster – Recovering and provisioning with mysqldump.
Note that some new features are coming in this area… 🙂
Stay tuned!
References
Misc
Node 3 – Group Replication configuration
SQL> SHOW VARIABLES LIKE 'group_replication%';
+-----------------------------------------------------+--------------------------------------+
| Variable_name | Value |
+-----------------------------------------------------+--------------------------------------+
| group_replication_allow_local_lower_version_join | OFF |
| group_replication_auto_increment_increment | 7 |
| group_replication_autorejoin_tries | 0 |
| group_replication_bootstrap_group | OFF |
| group_replication_communication_debug_options | GCS_DEBUG_NONE |
| group_replication_communication_max_message_size | 10485760 |
| group_replication_components_stop_timeout | 31536000 |
| group_replication_compression_threshold | 1000000 |
| group_replication_consistency | EVENTUAL |
| group_replication_enforce_update_everywhere_checks | OFF |
| group_replication_exit_state_action | READ_ONLY |
| group_replication_flow_control_applier_threshold | 25000 |
| group_replication_flow_control_certifier_threshold | 25000 |
| group_replication_flow_control_hold_percent | 10 |
| group_replication_flow_control_max_quota | 0 |
| group_replication_flow_control_member_quota_percent | 0 |
| group_replication_flow_control_min_quota | 0 |
| group_replication_flow_control_min_recovery_quota | 0 |
| group_replication_flow_control_mode | QUOTA |
| group_replication_flow_control_period | 1 |
| group_replication_flow_control_release_percent | 50 |
| group_replication_force_members | |
| group_replication_group_name | 28f66c86-9d66-11e9-876e-0242ac13000b |
| group_replication_group_seeds | M1:33061,M2:33061 |
| group_replication_gtid_assignment_block_size | 1000000 |
| group_replication_ip_whitelist | AUTOMATIC |
| group_replication_local_address | M3:33061 |
| group_replication_member_expel_timeout | 0 |
| group_replication_member_weight | 50 |
| group_replication_message_cache_size | 1073741824 |
| group_replication_poll_spin_loops | 0 |
| group_replication_recovery_complete_at | TRANSACTIONS_APPLIED |
| group_replication_recovery_get_public_key | OFF |
| group_replication_recovery_public_key_path | |
| group_replication_recovery_reconnect_interval | 60 |
| group_replication_recovery_retry_count | 10 |
| group_replication_recovery_ssl_ca | |
| group_replication_recovery_ssl_capath | |
| group_replication_recovery_ssl_cert | |
| group_replication_recovery_ssl_cipher | |
| group_replication_recovery_ssl_crl | |
| group_replication_recovery_ssl_crlpath | |
| group_replication_recovery_ssl_key | |
| group_replication_recovery_ssl_verify_server_cert | OFF |
| group_replication_recovery_use_ssl | ON |
| group_replication_single_primary_mode | ON |
| group_replication_ssl_mode | REQUIRED |
| group_replication_start_on_boot | ON |
| group_replication_transaction_size_limit | 150000000 |
| group_replication_unreachable_majority_timeout | 0 |
+-----------------------------------------------------+--------------------------------------+
Thanks for using MySQL!
Follow me on Linkedin
Watch my videos on my YouTube channel and subscribe.
My Slideshare account.
My Speaker Deck account.
Thanks for using HeatWave & MySQL!
Like this:
Like Loading...