- The storage for HA datbases is problematic. There is several ways to organize storage. I list major characteristics here (INNODB is generally faster, but takes about 20% more disk space. Initially it significantly faster and takes 5x disk space, but it normalizes...) Method Database Performance Clnt/Cache MySQL Gluster HA HostMount MyISAM/INNODB 8 MB/s fast 250% - Nope. But otherwise least problems to run. Gluster MyISAM (no logs) 1 MB/s unusable 150% 600-800% Perfect. But too slow (up to completely unusable if bin-logs are on). Slow MyISAM recovery! Gluster/Block MyISAM (no logs) 5 MB/s slow, but OK 200% ~ 50% No problems on reboot, but requires manual work if node crashes to detach volume. Galera INNODB 3.5 MB/s fast 3 x 200% - Should be perfect, but I am not sure about automatic recovery... Galera/Hostnet INNODB 4.6 MB/s fast 3 x 200% - MySQL Slaves INNODB 5-6 MB/s fast 2 x 250% - Available data is HA, but caching is not. We can easily turn the slave to master. MySQL S.+Zlib INNODB + ZLib 2-4 MB/s normal 2 x 300% - At about 35% compression level. DRBD MyISAM (no logs) 4-6 exp. ? I expect it as an faster option, but does not fit the OpenShift concept that well. Optimized (Large buffers, transactions in ADEI, etc.) Method Database Perf (Mst/Slv) Clnt/Cache MySQL Gluster HA MySQL Slaves INNODB 12 / 14 MB/s fast 600-800% - Gluster is a way too slow for anything. If node crashes, MyISAM tables may be left in corrupted state. The recovery will take ages to complete. The Gluster/Block is faster, but HA suffers. The volume is attached to the pod running on crashed node. It seems not detached automatically until the failed pod (in Unknown state) is killed with oc -n adei delete --force --grace-period=0 pod mysql-1-m4wcq Then, after some delay it is re-attached to the new running pod. Technically, we can run kind of monitoring service which will detect such nodes and restart. Still, this solution is limited to MyISAM with binary logging disabled. Unlike simple Gluster solution, the clients may use the system while caching is going, but is quite slow. The main trouble is MyISAM corruption, the recovery is slow. Galera is slower when Gluster/Block, but is fully available. The clients have also more servers to query data from. The cluster start-up is a bit tricky and I am not sure that everything will work smoothely now. Some tunning may be necessary. Furthermore, it seems if cluster is crashed, we can recover from one of the nodes, but all the data will be destroyed on other members and they would pull the complete dataset. The synchronization is faster when caching (~ 140 MB/s), but it wil still take about 10 hours to synchronize 5 TB of KATRIN data. So, there is no realy a full HA capable solution at the moment. The most reasonable seems compromising on caching HA. - MySQL with slaves. The asynchronous replication should be significantly faster when Galera. The passthrough to source databases will be working (i.e. status displays), current data is available. And we can easily switch the master if necessary. The other reasonable options have some problems at the moment and can't be used. - Galera. Is a fine solution. The caching is still quite slow. If networking problem is solved (see performance section in network.txt) or host networking is used, it more-or-less on pair with Gluster/Block, but provides much better service to the data reading clients. However, extra investigations are required to understand robustness of crash recovery. In some cases, after a crash Galera was performing a full resync of all data (but I was re-creating statefulset which is not recommended practice, not sure if it happens if the software maintained properly). Also, at some point one of the nodes was not able to join back (even after re-initializing from scratch), but again this hopefully not happening if the service is not pereodically recreated. - Gluster/Block would be a good solution if volume detachment is fixed. As it stands, we don't have HA without manual intervention. Furthermore, the MyISAM recovery is quite slow. - HostMount will be using our 3-node storage optimally. But if something crashes there is 1 week to recache the data. Gluster/Block ============= The idea is pretty simple. A standard gluster file system is used to store a 'block' files (just a normal files). This files are used as block devices with single-pod access policy. GFApi interface is used to access the data on Gluster (avoiding context switches) and is exposed over iSCSI to the clients. There are couple of problems with configuration and run-time. - The default Gluster containers while complain about rpcbind. We are using host networking in this case and the required ports (111) between container and the host system conflicts. We, however, are able just to use the host rpcbind. Consequently, the rpcbind should be removed from the Gluster container and the requirements removed from gluster-blockd systemd service. It is still worth checking that the port is accessible from the container (but it should). We additionally also need 'iscsi-initiator-utils' in the container. - Only a single pod should have access to the block device. Consequnetly, when the volume is attached to the client, other pods can't use it any more. The problem starts if node running pod dies. It is not perfectly handled by OpenShift now. The volume remains attached to the pod in the 'Unknown' state until it manually killed. Only, then, after another delay it is detached and available for replacement pod (which will struggle in ConteinerCreating phase until then). The pods in 'Unknown' state is not easy to kill. oc delete --force --grace-period=0 pod/mysql-1-m4wcq - Heketi is buggy. * If something goes wrong, it starts create multitudes of Gluster volumes and finally crashes with broken database. It is possible to remove the volumes and recover database from backup, but it is time consuming and unreliable for HA solution. * Particularly, this happens if we try to allocate more disk-space when available. The OpenShift configures the size of Gluster file system used to back block devices. It is 100 GB by default. If we specify 500Gi in pvc, it will try to create 15 such devices (another maximum configured by openshift) before crashing. * Overall, I'd rather only use the manual provisioning. - Also without heketi it is still problematic (may be it is better with official RH container running on GlusterFS 3.7), but I'd not check... We can try again with GlusterFS 4.1. There are probably multiple problems, but * GlusterFS may fail on one of the nodes (showing it up and running). If any of the block services have problems communicating with local gluster daemon, most requests (info/list will still work, but slow) to gluster daemon will timeout. Galera ====== - To bring new cluster up, there is several steps. * All members need to initialize standard standalone databases * One node should perform initialization and other nodes join after it is completed. * The nodes will delete their mysql folders and re-synchronize from the first node. * Then, cluster will be up and all nodes in so called primary state. - The procedure is similar for crash recovery: * If a node leaves the cluster, it may just come back and be re-sycnronized from other cluster members if there is a quorum. For this reason, it is necessary to keep at le ast 3 nodes running. * If all nodes crashed, then again one node should restart the cluster and others join later. For older versions, it is necessary to run mysqld with '--wsrep-new-cluster'. The new tries to automatize it and will recover automatically if 'safe_to_bootstrap' = 1 in 'grstate.dat' in mysql data folder. If cluster was shat down orderly, the Galera will set it automatically on the last node to stop the service. In case of a crash, however, it has to be configured manually on the most up to date node. IMIMPORTANT, it should be set only on one of the nodes. Otherwise, the cluster will get nearly unrecoverable. * So, to recover failed cluster (unless automatic recovery works) we must revert to manual procedure now. There is 'gmanager' pod which can be scalled to 3 nodes. We recover a full cluster in this pods in required order. Then, we stop first node and init a statefulSet. As first node in the statefulSet is ready, we stop second node in 'gmanager' and so on. - IMPORTANT: Synchrinization only works for INNODB tables. Furthermore, binary logging should be turned on (yes, it is possible to turn it off and there is no complains, but only the table names are synchronized, no data is pushed between the nodes). - OpenShift uses 'StatefulSet' to perform such initialization. Particularly, it starts first node and waits until it is running (and ready) before starting next one. * Now the nodes need to talk between each other. The 'headless' service is used for that. Unlinke standard service, the DNS does not load balance service pods, but returns IPs of all service members if appropriate DNS request is send (SRV). In Service spec we specify. clusterIP: None - old version For clients we still need a load-balancing service. So, we need to add a second service to serve their needs. * To decide if it should perform cluster initialization, the node tries to resolve members of the service. If it is alone, it initializes the cluster. Otherwise, tries to join the other members already registered in the service. The problem is that by default, OpenShift only will add member when it is ready (Readyness check). Consequently, all nodes will try to initialize. There is two methods to prevent it. One is working up to 3.7 and other 3.8 up, but it is no harm to use both for now). The new is to set in Service spec: publishNotReadyAddresses: True The old is to specify in Service metadata.annotations: service.alpha.kubernetes.io/tolerate-unready-endpoints: true * Still, we should quickly check for peers until other pods had chance to start. * Furthermore, there is some differneces to 'dc' definition. We need to specify 'serviceName' in the StatefulSet spec. serviceName: adei-ss There are few other minor differences. For instance, the 'selector' have more flexible notation and should include 'matchLabels' before specifying the 'pod' selector, etc. - IMPORTANT: If we use hostPath (or even hostPath based pv/pvc pair), the pods will be assigned to the nodes randomly. This is not ideal if we want to shutdown and restart cluster. In general, we always want the first pod to end-up on the same storage as it will be likely the one able to boostrap. Instead, we should use 'local' volume feature (alpha in OpenShift 3.7 and should be enabled in origin-node and origin-master configurations). Then, openshift 'pvc' to specific node and the 'pod' executed on the node where its 'pvc' is bounded. - IMPORTANT: StatefulSet ensures ordering and local volume data binding. Consequently, we should not destroy StatefulSet object which save the state information. Otherwise, the node assignments will chnage and cluster would be hard to impossible to recover. - Another problem of our setup is slow internal network (since bridging over Infiniband is not possible). One solution to overcome this is to run Galera using 'hostNetwork'. Then, however, the 'peer-finder' is failing. It tries to match the service names to its 'hostname' expecting that it will be in the form of 'galera-0.galera.adei.svc.cluster.local', but with host networking enabled the actual hostname is used (i.e. ipekatrin1.ipe.kit.edu). I have to patch peer-finder to resolve IPs and try to match the IPs. - To check current status of the cluster SHOW STATUS LIKE 'wsrep_cluster_size'; Master/Slave replication ======================== - This configuration seems more robuts, but strangely has a lot of performance issues on the slave side. Network is not a problem, it is able to get logs from the master, but it is significantly slower in applying it. The main performance killer is disk sync operations triggered by 'sync_binlog', INNODB log flashing, etc. Disabling it allows to bring performance on reasonable level. Still, the master is caching at about 6-8 MB/s and slave at 4-5 MB/s only (sometimes drops bellow 2 MB/s). - The trouble I think is that Slave performs a lot of disk writes 'mysql-relay-bin.*', 'mysql-bin.*'. If compared all together we get ~ 18 MB/s. The solution is to disable binary logging on the slave side. We need 'relay' log to perform replication, but binary-log on the client will only be needed if another slave would chain replicate for it. However, it is better to disable just logging of data replicated from master by disabling 'log_slave_updates'. Then, if the slave is converted to master it will automatically start logging. - Further improvements with significant increases of main buffers MYSQL_INNODB_BUFFER_POOL_SIZE and MYSQL_INNODB_LOG_FILE_SIZE + plus disabling FS caching MYSQL_INNODB_FLUSH_METHOD=ODIRECT. At maximum I got about 12 MB/s on master and 14 MB/s on the slave (using round-robin access to the source MSSQL databases). Both ROW and MIXED binlogs give more-or-less the same performance and memory footprint, but it seems the CPU usage is signifanctly higher (500-800%) in MIXED mode. In ROW mode it was about 200%.