Divergence from the best practices ================================== Due to various constraints, I had take some decisions contradicting the best practices. There were also some hardware limitations also resulting in suboptimal conifugration. Storage ------- - RedHat documentation strongly discourages running Gluster over large Raid-60. The best performance is achieved if disks are organized as JBOD and each assigned a brick. The problem is that heketi is not really ready for production yet. I got numerous problems with testing. Managing '3 x 24' gluster bricks manually would be a nightmare. Consequently, i opted for Raid-60 to simplify maintenance and ensure no data is lost due to mismanagement of gluster volumes. - In general, the architecture is more suitable for many small servers, not just a couple of fat storage servers. Then, the disk load will be distributed between multiple nodes. Furthermore, we are can't use all storage with 3 nodes. We need 3 nodes to ensure abitrage in case of failure (or network outtages). Even if we the 3rd node only stores the checksums, we ca't easily use it to store data. OK. Technically, we can create a 3 sets of 3 bricks and put the arbiter brick on different nodes. But this again will complicate maintenace. Unless proper ordering is maintained the replication may happen between bricks on the same node, etc. So, again I decided to ensure fault tollerance over performance. We still can use the space when cluster is scalled. Network ------- - To ensure high speed communication between pods running on different nodes, RedHat recommends to enable Container Native Routing. This is done by creating a bridge for docker containers on the hardware network device instead of OpenVSwitch fabric. Unfortunatelly, IPoIB is not providing Ethernet L2/L3 capabilities and it is impossible to use IB devices for bridging. It is still may be possible to solve somehow, but further research is required. The easier solution is just to switch OpenShift fabric to Ethernet. Anyway, we had idea to separate storage and OpenShift networks. Memory ------ - There is multiple docker storage engines. We are currently using LVM-based 'devicemapper'. To build container, the data is copied from all image layers. The new 'overlay2' provides a virtual file system (overlayfs) joining all layers and performing COW if the data is modified. It saves space, but more importantly it also enables page cache sharing reducing the memory footprint if multiple containers sharing the same layers (and they do share CentOS base image at minimum). Another adantage a slightly faster startup of containers with large images (as we don't need to copy all files). On the negative side, it is not fully POSIX compliant. Some applications may have problems because. For major applications there is work-arrounds provided by RedHat. But again, I opt for more standard 'devicemapper' to avoid hard to debug problems. What is required ================ - We need to add at least another node. It will double the available storage and I expect significant improvement of storage performance. Even better to have 5-6 nodes to split load. - We need to switch Ethernet fabric for OpenShift network. Currently, it is not critical and will only add about 20% to ADEI performance. However, it may become an issue if optimize ADEI database handling or get more network intensive applications in the cluster. - We need to re-evaluate RDMA support in GlusterFS. Currently, it is unreliable causing pods to hang indefinitely. If it is fixed we can re-enable RDMA support for our volumes. It hopefully may further improve storage performance. Similarly, Gluster block storage is significnatly faster for single-pod use case, but has significant stability issues at the moment. - We need to check if OverlayFS causing any problems to applications we plan to run. Enabling overlayfs should be good for our cron services and may reduce memory footprint.