docs/databases.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118

- The storage for HA datbases is problematic. There is several ways to organize storage. I list major 
  characteristics here (INNODB is generally faster, but takes about 20% more disk space. Initially it
  significantly faster and takes 5x disk space, but it normalizes...)

 Method         Database                Performance     Clnt/Cache      MySQL           Gluster         HA
 HostMount      MyISAM/INNODB           8 MB/s          fast            250%            -               Nope. But otherwise least problems to run.        
 Gluster        MyISAM (no logs)        1 MB/s          unusable        150%            600-800%        Perfect. But too slow (up to completely unusable if bin-logs are on). Slow MyISAM recovery!
 Gluster/Block  MyISAM (no logs)        5 MB/s          slow, but OK    200%            ~ 50%           No problems on reboot, but requires manual work if node crashes to detach volume.
 Galera         INNODB                  3.5 MB/s        fast            3 x 200%        -               Should be perfect, but I am not sure about automatic recovery...
 MySQL Slaves   INNODB                  6-8 exp.        fast                                            Available data is HA, but caching is not. We can easily turn the slave to master.
 DRBD           MyISAM (no logs)        4-6 exp.        ?                                               I expect it as an faster option, but does not fit complete concept.
 

 Gluster is a way too slow for anything. If node crashes, MyISAM tables may be left in corrupted state. The recovery will take ages to complete.
The Gluster/Block is faster, but HA suffers. The volume is attached to the pod running on crashed node. It seems not detached automatically until
the failed pod (in Unknown state) is killed with 
    oc -n adei delete --force --grace-period=0 pod mysql-1-m4wcq
Then, after some delay it is re-attached to the new running pod. Technically, we can run kind of monitoring service which will detect such nodes
and restart. Still, this solution is limited to MyISAM with binary logging disabled. Unlike simple Gluster solution, the clients may use the system
while caching is going, but is quite slow. The main trouble is MyISAM corruption, the recovery is slow. 

 Galera is slower when Gluster/Block, but is fully available. The clients have also more servers to query data from. The cluster start-up is a bit
tricky and I am not sure that everything will work smoothely now. Some tunning may be necessary. Furthermore, it seems if cluster is crashed, we 
can recover from one of the nodes, but all the data will be destroyed on other members and they would pull the complete dataset. The synchronization
is faster when caching (~ 140 MB/s), but it wil still take about 10 hours to synchronize 5 TB of KATRIN data.
 
So, there is no realy a full HA capable solution at the moment. The most reasonable seems compromising on caching HA.
 - MySQL with slaves.  The asynchronous replication should be significantly faster when Galera. The passthrough to source databases will be working 
 (i.e. status displays), current data is available. And we can easily switch the master if necessary.

The other reasonable options have some problems at the moment and can't be used.
 - Galera. Is a fine solution, but would need some degree of initial maintenance to work stabily. Furthermore, the caching is quite slow. And the 
 resync is a big issue.
 - Gluster/Block would be a good solution if volume detachment is fixed. As it stands, we don't have HA without manual intervention. Furthermore, the
 MyISAM recovery is quite slow.
 - HostMount will be using our 3-node storage optimally. But if something crashes there is 1 week to recache the data.

Gluster/Block
=============
 The idea is pretty simple. A standard gluster file system is used to store a 'block' files (just a normal files). This files are used as block devices
 with single-pod access policy. GFApi interface is used to access the data on Gluster (avoiding context switches) and is exposed over iSCSI to the clients.

 There are couple of problems with configuration and run-time.
 - The default Gluster containers while complain about rpcbind. We are using host networking in this case and the required ports (111) between container
 and the host system conflicts. We, however, are able just to use the host rpcbind. Consequently, the rpcbind should be removed from the Gluster container
 and the requirements removed from gluster-blockd systemd service. It is still worth checking that the port is accessible from the container (but it 
 should). We additionally also need 'iscsi-initiator-utils' in the container.

 - Only a single pod should have access to the block device. Consequnetly, when the volume is attached to the client, other pods can't use it any more.
 The problem starts if node running pod dies. It is not perfectly handled by OpenShift now. The volume remains attached to the pod in the 'Unknown' state
 until it manually killed. Only, then, after another delay it is detached and available for replacement pod (which will struggle in ConteinerCreating 
 phase until then). The pods in 'Unknown' state is not easy to kill. 
    oc delete --force --grace-period=0 pod/mysql-1-m4wcq

 - Heketi is buggy. 
  * If something goes wrong, it starts create multitudes of Gluster volumes and finally crashes with broken database. It is possible to remove the 
  volumes and recover database from backup, but it is time consuming and unreliable for HA solution.
  * Particularly, this happens if we try to allocate more disk-space when available. The OpenShift configures the size of Gluster file system used
  to back block devices. It is 100 GB by default. If we specify 500Gi in pvc, it will try to create 15 such devices (another maximum configured by
  openshift) before crashing.
  * Overall, I'd rather only use the manual provisioning.

 - Also without heketi it is still problematic (may be it is better with official RH container running on GlusterFS 3.7), but I'd not check... We
 can try again with GlusterFS 4.1. There are probably multiple problems, but
  * GlusterFS may fail on one of the nodes (showing it up and running). If any of the block services have problems communicating with local gluster
  daemon, most requests (info/list will still work, but slow) to gluster daemon will timeout.
  
Galera
======
 - To bring new cluster up, there is several steps.
  * All members need to initialize standard standalone databases
  * One node should perform initialization and other nodes join after it is completed.
  * The nodes will delete their mysql folders and re-synchronize from the first node.
  * Then, cluster will be up and all nodes in so called primary state.

 - The procedure is similar for crash recovery:
  * If a node leaves the cluster, it may just come back and be re-sycnronized from other
  cluster members if there is a quorum. For this reason, it is necessary to keep at le
  ast 3 nodes running.
  * If all nodes crashed, then again one node should restart the cluster and others join
  later. For older versions, it is necessary to run mysqld with '--wsrep-new-cluster'.
  The new tries to automatize it and will recover automatically if 'safe_to_bootstrap' = 1
  in 'grstate.dat' in mysql data folder. It should be set by Galera based on some heuristic,
  but in fact I always had to set it manually. IMIMPORTANT, it should be set only on one of 
  the nodes.
 
 - Synchrinization only works for INNODB tables. Furthermore, binary logging should be turned
 on (yes, it is possible to turn it off and there is no complains, but only the table names are
 synchronized, no data is pushed between the nodes).
 
 - OpenShift uses 'StatefulSet' to perform such initialization. Particularly, it starts first
 node and waits until it is running before starting next one. 
  * Now the nodes need to talk between each other. The 'headless' service is used for that. 
  Unlinke standard service, the DNS does not load balance service pods, but returns IPs of
  all service members if appropriate DNS request is send (SRV). In Service spec we specify.
        clusterIP: None                                 - old version
  For clients we still need a load-balancing service. So, we need to add a second service 
  to serve their needs.
  * To decide if it should perform cluster initialization, the node tries to resolve members
  of the service. If it is alone, it initializes the cluster. Otherwise, tries to join the other
  members already registered in the service. The problem is that by default, OpenShift only
  will add member when it is ready (Readyness check). Consequently, all nodes will try to 
  initialize. There is two methods to prevent it. One is working up to 3.7 and other 3.8 up, 
  but it is no harm to use both for now). 
    The new is to set in Service spec:
        publishNotReadyAddresses: True
    The old is to specify in Service metadata.annotations:
        service.alpha.kubernetes.io/tolerate-unready-endpoints: true
  * Still, we should quickly check for peers until other pods had chance to start.
  * Furthermore, there is some differneces to 'dc' definition. We need to specify 'serviceName'
  in the StatefulSet spec.
      serviceName: adei-ss
  There are few other minor differences. For instance, the 'selector' have more flexible notation
  and should include 'matchLabels' before specifying the 'pod' selector, etc.
  
 - To check current status of the cluster
        SHOW STATUS LIKE 'wsrep_cluster_size';