summaryrefslogtreecommitdiffstats
path: root/docs/databases.txt
diff options
context:
space:
mode:
Diffstat (limited to 'docs/databases.txt')
-rw-r--r--docs/databases.txt118
1 files changed, 118 insertions, 0 deletions
diff --git a/docs/databases.txt b/docs/databases.txt
new file mode 100644
index 0000000..254674e
--- /dev/null
+++ b/docs/databases.txt
@@ -0,0 +1,118 @@
+- The storage for HA datbases is problematic. There is several ways to organize storage. I list major
+ characteristics here (INNODB is generally faster, but takes about 20% more disk space. Initially it
+ significantly faster and takes 5x disk space, but it normalizes...)
+
+ Method Database Performance Clnt/Cache MySQL Gluster HA
+ HostMount MyISAM/INNODB 8 MB/s fast 250% - Nope. But otherwise least problems to run.
+ Gluster MyISAM (no logs) 1 MB/s unusable 150% 600-800% Perfect. But too slow (up to completely unusable if bin-logs are on). Slow MyISAM recovery!
+ Gluster/Block MyISAM (no logs) 5 MB/s slow, but OK 200% ~ 50% No problems on reboot, but requires manual work if node crashes to detach volume.
+ Galera INNODB 3.5 MB/s fast 3 x 200% - Should be perfect, but I am not sure about automatic recovery...
+ MySQL Slaves INNODB 6-8 exp. fast Available data is HA, but caching is not. We can easily turn the slave to master.
+ DRBD MyISAM (no logs) 4-6 exp. ? I expect it as an faster option, but does not fit complete concept.
+
+
+ Gluster is a way too slow for anything. If node crashes, MyISAM tables may be left in corrupted state. The recovery will take ages to complete.
+The Gluster/Block is faster, but HA suffers. The volume is attached to the pod running on crashed node. It seems not detached automatically until
+the failed pod (in Unknown state) is killed with
+ oc -n adei delete --force --grace-period=0 pod mysql-1-m4wcq
+Then, after some delay it is re-attached to the new running pod. Technically, we can run kind of monitoring service which will detect such nodes
+and restart. Still, this solution is limited to MyISAM with binary logging disabled. Unlike simple Gluster solution, the clients may use the system
+while caching is going, but is quite slow. The main trouble is MyISAM corruption, the recovery is slow.
+
+ Galera is slower when Gluster/Block, but is fully available. The clients have also more servers to query data from. The cluster start-up is a bit
+tricky and I am not sure that everything will work smoothely now. Some tunning may be necessary. Furthermore, it seems if cluster is crashed, we
+can recover from one of the nodes, but all the data will be destroyed on other members and they would pull the complete dataset. The synchronization
+is faster when caching (~ 140 MB/s), but it wil still take about 10 hours to synchronize 5 TB of KATRIN data.
+
+So, there is no realy a full HA capable solution at the moment. The most reasonable seems compromising on caching HA.
+ - MySQL with slaves. The asynchronous replication should be significantly faster when Galera. The passthrough to source databases will be working
+ (i.e. status displays), current data is available. And we can easily switch the master if necessary.
+
+The other reasonable options have some problems at the moment and can't be used.
+ - Galera. Is a fine solution, but would need some degree of initial maintenance to work stabily. Furthermore, the caching is quite slow. And the
+ resync is a big issue.
+ - Gluster/Block would be a good solution if volume detachment is fixed. As it stands, we don't have HA without manual intervention. Furthermore, the
+ MyISAM recovery is quite slow.
+ - HostMount will be using our 3-node storage optimally. But if something crashes there is 1 week to recache the data.
+
+Gluster/Block
+=============
+ The idea is pretty simple. A standard gluster file system is used to store a 'block' files (just a normal files). This files are used as block devices
+ with single-pod access policy. GFApi interface is used to access the data on Gluster (avoiding context switches) and is exposed over iSCSI to the clients.
+
+ There are couple of problems with configuration and run-time.
+ - The default Gluster containers while complain about rpcbind. We are using host networking in this case and the required ports (111) between container
+ and the host system conflicts. We, however, are able just to use the host rpcbind. Consequently, the rpcbind should be removed from the Gluster container
+ and the requirements removed from gluster-blockd systemd service. It is still worth checking that the port is accessible from the container (but it
+ should). We additionally also need 'iscsi-initiator-utils' in the container.
+
+ - Only a single pod should have access to the block device. Consequnetly, when the volume is attached to the client, other pods can't use it any more.
+ The problem starts if node running pod dies. It is not perfectly handled by OpenShift now. The volume remains attached to the pod in the 'Unknown' state
+ until it manually killed. Only, then, after another delay it is detached and available for replacement pod (which will struggle in ConteinerCreating
+ phase until then). The pods in 'Unknown' state is not easy to kill.
+ oc delete --force --grace-period=0 pod/mysql-1-m4wcq
+
+ - Heketi is buggy.
+ * If something goes wrong, it starts create multitudes of Gluster volumes and finally crashes with broken database. It is possible to remove the
+ volumes and recover database from backup, but it is time consuming and unreliable for HA solution.
+ * Particularly, this happens if we try to allocate more disk-space when available. The OpenShift configures the size of Gluster file system used
+ to back block devices. It is 100 GB by default. If we specify 500Gi in pvc, it will try to create 15 such devices (another maximum configured by
+ openshift) before crashing.
+ * Overall, I'd rather only use the manual provisioning.
+
+ - Also without heketi it is still problematic (may be it is better with official RH container running on GlusterFS 3.7), but I'd not check... We
+ can try again with GlusterFS 4.1. There are probably multiple problems, but
+ * GlusterFS may fail on one of the nodes (showing it up and running). If any of the block services have problems communicating with local gluster
+ daemon, most requests (info/list will still work, but slow) to gluster daemon will timeout.
+
+Galera
+======
+ - To bring new cluster up, there is several steps.
+ * All members need to initialize standard standalone databases
+ * One node should perform initialization and other nodes join after it is completed.
+ * The nodes will delete their mysql folders and re-synchronize from the first node.
+ * Then, cluster will be up and all nodes in so called primary state.
+
+ - The procedure is similar for crash recovery:
+ * If a node leaves the cluster, it may just come back and be re-sycnronized from other
+ cluster members if there is a quorum. For this reason, it is necessary to keep at le
+ ast 3 nodes running.
+ * If all nodes crashed, then again one node should restart the cluster and others join
+ later. For older versions, it is necessary to run mysqld with '--wsrep-new-cluster'.
+ The new tries to automatize it and will recover automatically if 'safe_to_bootstrap' = 1
+ in 'grstate.dat' in mysql data folder. It should be set by Galera based on some heuristic,
+ but in fact I always had to set it manually. IMIMPORTANT, it should be set only on one of
+ the nodes.
+
+ - Synchrinization only works for INNODB tables. Furthermore, binary logging should be turned
+ on (yes, it is possible to turn it off and there is no complains, but only the table names are
+ synchronized, no data is pushed between the nodes).
+
+ - OpenShift uses 'StatefulSet' to perform such initialization. Particularly, it starts first
+ node and waits until it is running before starting next one.
+ * Now the nodes need to talk between each other. The 'headless' service is used for that.
+ Unlinke standard service, the DNS does not load balance service pods, but returns IPs of
+ all service members if appropriate DNS request is send (SRV). In Service spec we specify.
+ clusterIP: None - old version
+ For clients we still need a load-balancing service. So, we need to add a second service
+ to serve their needs.
+ * To decide if it should perform cluster initialization, the node tries to resolve members
+ of the service. If it is alone, it initializes the cluster. Otherwise, tries to join the other
+ members already registered in the service. The problem is that by default, OpenShift only
+ will add member when it is ready (Readyness check). Consequently, all nodes will try to
+ initialize. There is two methods to prevent it. One is working up to 3.7 and other 3.8 up,
+ but it is no harm to use both for now).
+ The new is to set in Service spec:
+ publishNotReadyAddresses: True
+ The old is to specify in Service metadata.annotations:
+ service.alpha.kubernetes.io/tolerate-unready-endpoints: true
+ * Still, we should quickly check for peers until other pods had chance to start.
+ * Furthermore, there is some differneces to 'dc' definition. We need to specify 'serviceName'
+ in the StatefulSet spec.
+ serviceName: adei-ss
+ There are few other minor differences. For instance, the 'selector' have more flexible notation
+ and should include 'matchLabels' before specifying the 'pod' selector, etc.
+
+ - To check current status of the cluster
+ SHOW STATUS LIKE 'wsrep_cluster_size';
+ \ No newline at end of file