- The storage for HA datbases is problematic. There is several ways to organize storage. I list major characteristics here (INNODB is generally faster, but takes about 20% more disk space. Initially it significantly faster and takes 5x disk space, but it normalizes...) Method Database Performance Clnt/Cache MySQL Gluster HA HostMount MyISAM/INNODB 8 MB/s fast 250% - Nope. But otherwise least problems to run. Gluster MyISAM (no logs) 1 MB/s unusable 150% 600-800% Perfect. But too slow (up to completely unusable if bin-logs are on). Slow MyISAM recovery! Gluster/Block MyISAM (no logs) 5 MB/s slow, but OK 200% ~ 50% No problems on reboot, but requires manual work if node crashes to detach volume. Galera INNODB 3.5 MB/s fast 3 x 200% - Should be perfect, but I am not sure about automatic recovery... MySQL Slaves INNODB 6-8 exp. fast Available data is HA, but caching is not. We can easily turn the slave to master. DRBD MyISAM (no logs) 4-6 exp. ? I expect it as an faster option, but does not fit complete concept. Gluster is a way too slow for anything. If node crashes, MyISAM tables may be left in corrupted state. The recovery will take ages to complete. The Gluster/Block is faster, but HA suffers. The volume is attached to the pod running on crashed node. It seems not detached automatically until the failed pod (in Unknown state) is killed with oc -n adei delete --force --grace-period=0 pod mysql-1-m4wcq Then, after some delay it is re-attached to the new running pod. Technically, we can run kind of monitoring service which will detect such nodes and restart. Still, this solution is limited to MyISAM with binary logging disabled. Unlike simple Gluster solution, the clients may use the system while caching is going, but is quite slow. The main trouble is MyISAM corruption, the recovery is slow. Galera is slower when Gluster/Block, but is fully available. The clients have also more servers to query data from. The cluster start-up is a bit tricky and I am not sure that everything will work smoothely now. Some tunning may be necessary. Furthermore, it seems if cluster is crashed, we can recover from one of the nodes, but all the data will be destroyed on other members and they would pull the complete dataset. The synchronization is faster when caching (~ 140 MB/s), but it wil still take about 10 hours to synchronize 5 TB of KATRIN data. So, there is no realy a full HA capable solution at the moment. The most reasonable seems compromising on caching HA. - MySQL with slaves. The asynchronous replication should be significantly faster when Galera. The passthrough to source databases will be working (i.e. status displays), current data is available. And we can easily switch the master if necessary. The other reasonable options have some problems at the moment and can't be used. - Galera. Is a fine solution, but would need some degree of initial maintenance to work stabily. Furthermore, the caching is quite slow. And the resync is a big issue. - Gluster/Block would be a good solution if volume detachment is fixed. As it stands, we don't have HA without manual intervention. Furthermore, the MyISAM recovery is quite slow. - HostMount will be using our 3-node storage optimally. But if something crashes there is 1 week to recache the data. Gluster/Block ============= The idea is pretty simple. A standard gluster file system is used to store a 'block' files (just a normal files). This files are used as block devices with single-pod access policy. GFApi interface is used to access the data on Gluster (avoiding context switches) and is exposed over iSCSI to the clients. There are couple of problems with configuration and run-time. - The default Gluster containers while complain about rpcbind. We are using host networking in this case and the required ports (111) between container and the host system conflicts. We, however, are able just to use the host rpcbind. Consequently, the rpcbind should be removed from the Gluster container and the requirements removed from gluster-blockd systemd service. It is still worth checking that the port is accessible from the container (but it should). We additionally also need 'iscsi-initiator-utils' in the container. - Only a single pod should have access to the block device. Consequnetly, when the volume is attached to the client, other pods can't use it any more. The problem starts if node running pod dies. It is not perfectly handled by OpenShift now. The volume remains attached to the pod in the 'Unknown' state until it manually killed. Only, then, after another delay it is detached and available for replacement pod (which will struggle in ConteinerCreating phase until then). The pods in 'Unknown' state is not easy to kill. oc delete --force --grace-period=0 pod/mysql-1-m4wcq - Heketi is buggy. * If something goes wrong, it starts create multitudes of Gluster volumes and finally crashes with broken database. It is possible to remove the volumes and recover database from backup, but it is time consuming and unreliable for HA solution. * Particularly, this happens if we try to allocate more disk-space when available. The OpenShift configures the size of Gluster file system used to back block devices. It is 100 GB by default. If we specify 500Gi in pvc, it will try to create 15 such devices (another maximum configured by openshift) before crashing. * Overall, I'd rather only use the manual provisioning. - Also without heketi it is still problematic (may be it is better with official RH container running on GlusterFS 3.7), but I'd not check... We can try again with GlusterFS 4.1. There are probably multiple problems, but * GlusterFS may fail on one of the nodes (showing it up and running). If any of the block services have problems communicating with local gluster daemon, most requests (info/list will still work, but slow) to gluster daemon will timeout. Galera ====== - To bring new cluster up, there is several steps. * All members need to initialize standard standalone databases * One node should perform initialization and other nodes join after it is completed. * The nodes will delete their mysql folders and re-synchronize from the first node. * Then, cluster will be up and all nodes in so called primary state. - The procedure is similar for crash recovery: * If a node leaves the cluster, it may just come back and be re-sycnronized from other cluster members if there is a quorum. For this reason, it is necessary to keep at le ast 3 nodes running. * If all nodes crashed, then again one node should restart the cluster and others join later. For older versions, it is necessary to run mysqld with '--wsrep-new-cluster'. The new tries to automatize it and will recover automatically if 'safe_to_bootstrap' = 1 in 'grstate.dat' in mysql data folder. It should be set by Galera based on some heuristic, but in fact I always had to set it manually. IMIMPORTANT, it should be set only on one of the nodes. - Synchrinization only works for INNODB tables. Furthermore, binary logging should be turned on (yes, it is possible to turn it off and there is no complains, but only the table names are synchronized, no data is pushed between the nodes). - OpenShift uses 'StatefulSet' to perform such initialization. Particularly, it starts first node and waits until it is running before starting next one. * Now the nodes need to talk between each other. The 'headless' service is used for that. Unlinke standard service, the DNS does not load balance service pods, but returns IPs of all service members if appropriate DNS request is send (SRV). In Service spec we specify. clusterIP: None - old version For clients we still need a load-balancing service. So, we need to add a second service to serve their needs. * To decide if it should perform cluster initialization, the node tries to resolve members of the service. If it is alone, it initializes the cluster. Otherwise, tries to join the other members already registered in the service. The problem is that by default, OpenShift only will add member when it is ready (Readyness check). Consequently, all nodes will try to initialize. There is two methods to prevent it. One is working up to 3.7 and other 3.8 up, but it is no harm to use both for now). The new is to set in Service spec: publishNotReadyAddresses: True The old is to specify in Service metadata.annotations: service.alpha.kubernetes.io/tolerate-unready-endpoints: true * Still, we should quickly check for peers until other pods had chance to start. * Furthermore, there is some differneces to 'dc' definition. We need to specify 'serviceName' in the StatefulSet spec. serviceName: adei-ss There are few other minor differences. For instance, the 'selector' have more flexible notation and should include 'matchLabels' before specifying the 'pod' selector, etc. - To check current status of the cluster SHOW STATUS LIKE 'wsrep_cluster_size';