Local volumes and StatefulSet to provision Master/Slave MySQL and Galera cluster

author: Suren A. Chilingaryan <csa@suren.me> 2018-03-20 15:47:51 +0100
committer: Suren A. Chilingaryan <csa@suren.me> 2018-03-20 15:47:51 +0100
commit: e2c7b1305ca8495065dcf40fd2092d7c698dd6ea (patch)
tree: abcaa7006a9c4b7a9add9bd0bf8c24f7f8ce048f /docs/performance.txt
parent: 47f350bc3aa85a8bd406d95faf084df2abf74ae9 (diff)
download: ands-e2c7b1305ca8495065dcf40fd2092d7c698dd6ea.tar.gz
ands-e2c7b1305ca8495065dcf40fd2092d7c698dd6ea.tar.bz2
ands-e2c7b1305ca8495065dcf40fd2092d7c698dd6ea.tar.xz
ands-e2c7b1305ca8495065dcf40fd2092d7c698dd6ea.zip
1 files changed, 54 insertions, 0 deletions
diff --git a/docs/performance.txt b/docs/performance.txt
new file mode 100644
index 0000000..b31c02a
--- /dev/null
+++ b/docs/performance.txt
@@ -0,0 +1,54 @@
+Divergence from the best practices
+==================================
+ Due to various constraints, I had take some decisions contradicting the best practices. There were also some
+ hardware limitations also resulting in suboptimal conifugration.
+ 
+ Storage
+ -------
+ - RedHat documentation strongly discourages running Gluster over large Raid-60. The best performance is achieved
+ if disks are organized as JBOD and each assigned a brick. The problem is that heketi is not really ready for 
+ production yet. I got numerous problems with testing. Managing '3 x 24' gluster bricks manually would be a nightmare.
+ Consequently, i opted for Raid-60 to simplify maintenance and ensure no data is lost due to mismanagement of gluster
+ volumes.
+ 
+ - In general, the architecture is more suitable for many small servers, not just a couple of fat storage servers. Then,
+ the disk load will be distributed between multiple nodes. Furthermore, we are can't use all storage with 3 nodes. 
+ We need 3 nodes to ensure abitrage in case of failure (or network outtages). Even if we the 3rd node only stores the
+ checksums, we ca't easily use it to store data. OK. Technically, we can create a 3 sets of 3 bricks and put the arbiter
+ brick on different nodes. But this again will complicate maintenace. Unless proper ordering is maintained the replication
+ may happen between bricks on the same node, etc. So, again I decided to ensure fault tollerance over performance. We still
+ can use the space when cluster is scalled.
+
+ Network
+ -------
+ - To ensure high speed communication between pods running on different nodes, RedHat recommends to enable Container Native 
+ Routing. This is done by creating a bridge for docker containers on the hardware network device instead of OpenVSwitch fabric.
+ Unfortunatelly, IPoIB is not providing Ethernet L2/L3 capabilities and it is impossible to use IB devices for bridging. 
+ It is still may be possible to solve somehow, but further research is required. The easier solution is just to switch OpenShift
+ fabric to Ethernet. Anyway, we had idea to separate storage and OpenShift networks.
+ 
+ Memory
+ ------
+  - There is multiple docker storage engines. We are currently using LVM-based 'devicemapper'. To build container, the data is
+  copied from all image layers. The new 'overlay2' provides a virtual file system (overlayfs) joining all layers and performing 
+  COW if the data is modified. It saves space, but more importantly it also enables page cache sharing reducing the memory 
+  footprint if multiple containers sharing the same layers (and they do share CentOS base image at minimum). Another adantage a 
+  slightly faster startup of containers with large images (as we don't need to copy all files). On the negative side, it is not
+  fully POSIX compliant. Some applications may have problems because. For major applications there is work-arrounds provided by
+  RedHat. But again, I opt for more standard 'devicemapper' to avoid hard to debug problems.
+
+
+What is required
+================
+ - We need to add at least another node. It will double the available storage and I expect significant improvement of storage
+ performance. Even better to have 5-6 nodes to split load.
+ - We need to switch Ethernet fabric for OpenShift network. Currently, it is not critical and will only add about 20% to ADEI 
+ performance. However, it may become an issue if optimize ADEI database handling or get more network intensive applications in
+ the cluster.
+ - We need to re-evaluate RDMA support in GlusterFS. Currently, it is unreliable causing pods to hang indefinitely. If it is 
+ fixed we can re-enable RDMA support for our volumes. It hopefully may further improve storage performance. Similarly, Gluster
+ block storage is significnatly faster for single-pod use case, but has significant stability issues at the moment.
+ - We need to check if OverlayFS causing any problems to applications we plan to run. Enabling overlayfs should be good for 
+ our cron services and may reduce memory footprint.
+
+
author	Suren A. Chilingaryan <csa@suren.me>	2018-03-20 15:47:51 +0100
committer	Suren A. Chilingaryan <csa@suren.me>	2018-03-20 15:47:51 +0100
commit	e2c7b1305ca8495065dcf40fd2092d7c698dd6ea (patch)
tree	abcaa7006a9c4b7a9add9bd0bf8c24f7f8ce048f /docs/performance.txt
parent	47f350bc3aa85a8bd406d95faf084df2abf74ae9 (diff)
download	ands-e2c7b1305ca8495065dcf40fd2092d7c698dd6ea.tar.gz ands-e2c7b1305ca8495065dcf40fd2092d7c698dd6ea.tar.bz2 ands-e2c7b1305ca8495065dcf40fd2092d7c698dd6ea.tar.xz ands-e2c7b1305ca8495065dcf40fd2092d7c698dd6ea.zip