OpenShift cluster is up and running. I don't plan to re-install it unless there is a new severe problems turn out for Bora/KatrinDB. It seems the system is fine for adei. You can take a look on http://adei-katrin.kaas.kit.edu So, that we have: - Automatic kickstart of the servers. This is normally done over DHCP. Since I don't have direct control over DHCP, I made a simple system to kickstart over IPMI. Scripts instruct servers to boot from Virtual CD and fetch kickstart from the web server (ufo.kit.edu actually). The kickstart is generated by php script based on server name and/or DHCP address. - Ansible-based playbooks to configure the complete cluster. Kickstart produces minimal systems with ssh server up and running. Here, I have the scripts to build complete cluster, including databases and adei installation. There are also playbooks for maintenance tasks like adding new nodes, etc. - Upgrades are not always (or actually rarely) running smoothely. To test new configurations before applying them to the production system, I also support provisioning of the staging cluster in vagrant-controlled virtual machines. This is currnetly running on ipepdvcompute3. - Replicated GlusterFS storage and some security provisions to prevent conteiners in one project to destroy data belonging to the another. The selected subset of data can be made available over NFS to external hosts, but I'd rather prefer to not overuse this possibility. - To simplify creating conteiners with complex storage requirements (like ADEI), there are also Ansible scripts to generate OpenShift templates based on the configured storage and provided container specification. - To ensure data integrity, the database engines do a lot of locking, syncing, and small writes. This does not play well with network file systems like GlusterFS. It is possible to tune database parameters a bit and run databases with small intensity of writes, but it is unsuitable for large and write-intensive workloads. The alternative is to store data directly on local storage and use repliation engine of database itself to ensure high availability. I have prepared containers to quickly bootstrap two options Master-Master replication with Galera and standard MySQL Master-Slave replication. Master/Slave replication is assynchronous and because of that significantly faster and I use it as a good compromise. It will take about 2 weeks to re-cache all Katrin data. Quite long, but it is even longer with other options. If master server crashes the users will still have access to all the historical archives and will be able to proxy data requests to the source datbase (i.e. BORA will also work). And there is no need to re-cache everything as slave could be easily converted to master. The Master/Master replication is about 50% slower, but still can be used for smaller databases if also uniterrupted writes are crucial. - Distributed ADEI. All setups now is completely independent (use different databases, can be stopped and started independently, etc.). Each setup constists of 3 main components: 1) Frontends: There are 3 frontends: production, debugging, and for logs. They are accessible individually, like: http://adei-katrin.kaas.kit.edu http://adei-katrin-debug.kaas.kit.edu http://adei-katrin-logs.kaas.kit.edu * The production frontend can be scalled to run several replicas. This is not required for performance, but if 2 copies are running, there will be no service interruption if one of servers crashed. Otherwise, there is a dozen minutes outtage will OpenShift detects that node is gone for good. 2) Maintenance: There is cron-style containers performing various maintenance tasks. Particularly, they analyze current data source configuration and schedule the caching. 3) Cachers: The caching is performed by a 3 groups of conteiners. One is responsible for current data, the second for archives, and third for logs. Each group can be scalled independently. I.e. in the begining I run multiple archive-caching replicas to get the data in. Then, focus is shifted to getting current data faster. It also may differ significantly between setups. Katrin will run multiple caching replicas, but less important or small data sources will get only one. * This architecture also allows to remove 1-minute update latency. I am not sure we can be very fast with larget Katrin setup on current minimalistic cluster, but technically the updates can be as fast as hardware allows. - There is an OpenShift template to instantiate all this containers in one go by providing a few parameters. The only requirement is to have 'setup' configuration in my repository. I also included in ADEI sources a bunch of scripts to start all known setups with preconfigured parameters and to perform various maintenance tasks. Like, ./provision.sh -s katrin - to create launch katrin setup on the cluster ./stop-caching.sh -s katrin - to temporary stop caching ./start-caching.sh -s katrin - to restart caching with pre-configured number of replicas ... - Load-balancing and high-availability using 2 ping-pong IPs. By default katrin1 & katrin2 IPs are assigned to both masters of the clusters. To balance load, the kaas.kit.edu is resolved in round-robin fashion to one of them. If one master failed, its IP will migrate to remaining master and no service interruption will occur. Both masters run OpenVPN tunnels to Katrin network. The remaining node is routing trough one of the masters. This configuration is also highly available and should not suffer if one of the masters crashing. What we are still missing: - Katrin datbasse. Marco have prepared containers using prototype I run last years. Hopefully, Jan can run it on the new system with minimal number of problems. - BORA still need to be moved. - Then, I will decommision the old Katrin server. Fault-tolerance and high-availability ===================================== - I have tested a bit for fault tolerance and recoverability. Both GlusterFS and OpenShift work fine if a single node failed. All data is available and new data can be written without problems. There is also no service interruption as ADEI runs two frontends and also includes backup MySQL server. Only caching may stop if master MySQL server is hit. - If node recovers, it will-be re-intergrated automatically. We may only need to manually convert MySQL slave to Master. Adding replacement nodes is also working quite easy using provided provisioning playbooks. But the Gluster bricks needs to be migrated manually. I provide also some scripts to simplify this task. - The situation is worse if cluster is completely turned off and turned on. Storage survive quite well, but it is necessary to check that all volumes fully healthy (sometimes volume loose some bricks and needs to be restarted to reconnect them). Also, some pods running before reboot may fail to start. Overall, it is better to avoid. If reboot is needed for some reason, the best approach is to perform rolling reboot, restarting one node after another to keep cluster always alive. Performance =========== I'd say it is more-or-less on pair with the old system (which is expected). The computing capabilities are quite faster (still there is a significant load on master servers to run the cluster), but network and storage have more-or-less the same speed. - In fact we have only a single storage node. The second is used for replication and third node is required for arbitrage in split-brain case. I will be able to use the third node also for storage, but I need at least another 4th node in the cluster to do it. The new drives slightly faster, but added replication slows down performance considerably. - The containers can't use Infiniband network efficiently. The bridging is used to allow fast networking in conteiners. Unfortunatelly, IPoIB is a high level network layer and does not provide Ethernet L2 support required to create the bridge. Consequnetly, there is a lot of packet re-building going on and the network performance is capped at about 3 GBit/s for containers. It is not realy a big problem now, as host systems are not limited. So the storage is able to use all bandwidth. Upgrades ======== We may need to scale the cluster slightly if it to be used beyond Katrin needs or with the significant increase of load by Katrin. Having 1-2 more nodes should be helpful to storage system. It also may be worth to add 40Gbit Ethernet switch. The Mellanox cards work in both Ethernet and Infiniband modes. Their switch is actually also, but they want 5 kEUR for the license to enable this feature. I guess for this money we should be able to buy a new piece of hardware. Having more storage nodes, we can also prototype new data management solutions without disturbing Katrin tasks. One idea would be run Apache Cassandra and try to re-use the data broker developed by university group to get ADEI data in their cluster. Then, we can add analysis framework on top of ADEI using Jupiter notebooks with Cassandra. Furthermore, there is Alpha version for NVIDIA GPUs support for OpenShift. So, we can try to integrate also some computing workload and potentially run also WAVe inside. I'd not do it on production system, but if we get new nodes we first may try to setup a second OpenShift cluster for testing (pair of storage nodes + GPU node). And later re-integrate it with the main one. As running without shutdowns is pretty crucial, another question if we can put the servers somewhere at SCC with reliable power supplyies, air conditioning, etc. I guess we can't expect to run without shutdowns in our server room.