Configuration ============= openshift_ip Infiniband IPs for fast communication (it also used for ADEI/MySQL bridge and so should reside on fast network. openshift_hostname The 'cluster' host name. Should match real host name for certificat validation. So, it should be set if default ip does not resolve to host name openshift_public_ip We may either skip this or set to our 192.168.26.xxx network. Usage is unclear openshift_public_hostname I guess it is also for certificates, but while communicating with external systems openshift_master_cluster_hostname Internal cluster load-balancer or just pointer to master host openshift_public_master_cluster_hostname The main cluster gateway Complex Network =============== Some things in OpenShift ansible scripts are still implemented with assumption we have a simple network configuration with a single interface communicating to the world. There are several options to change this: openshift_set_node_ip - This variable configures nodeIP in the node configuration. This variable is needed in cases where it is desired for node traffic to go over an interface other than the default network interface. openshift_ip - This variable overrides the cluster internal IP address for the system. Use this when using an interface that is not configured with the default route. openshift_hostname - This variable overrides the internal cluster host name for the system. Use this when the system’s default IP address does not resolve to the system host name. Furthermore, if we use infiniband which is not accessible to outside world we need to set openshift_public_ip - Use this for cloud installations, or for hosts on networks using a network address translation openshift_public_hostname - Use this for cloud installations, or for hosts on networks using a network address translation (NAT). This is, however, is not used trough all system components. Some provisioning code and installed scripts are still detect kind of 'main system ip' to look for the services. This ip is intendified either as 'ansible_default_ip' or by the code trying to look for the ip which is used to send packet over default route. Ansible in the end does the some thing. This plays bad for several reasons. - We have keepalived ips moving between systems. The scripts are actually catching this moving ips instead of the fixed ip bound to the system. - There could be several default routes. While it is not a problem, scripts does not expect that and may fail. For instance, the script '99-origin-dns.sh' in /etc/NetworkManager/dispatcher.d. * def_route=$(/sbin/ip route list match 0.0.0.0/0 | awk '{print $3 }') 1) Does not expect multiple default routes and will find just a random one. Then, * if [[ ${DEVICE_IFACE} == ${def_route_int} ]]; then check may fail and the resolv.conf will be not updated because currently up'ed interface is not on default route, but it actually is. Furthermore, * def_route_ip=$(/sbin/ip route get to ${def_route} | awk '{print $5}') 2) ignorant of keepalived and will bound to keepalived. But I am not sure the problems are limited to this script. There could be other places with the same logic. Some details are here: https://docs.openshift.com/container-platform/3.7/admin_guide/manage_nodes.html#manage-node-change-node-traffic-interface Hostnames ========= The linux host name (uname -a) should match the hostnames assigned to openshift nodes. Otherwise, the certificate verification will fail. It seems minor issue as system continue functioning, but better to avoid. The check can be performed with etcd: etcdctl3 --key=/etc/etcd/peer.key --cacert=/etc/etcd/ca.crt --endpoints="192.168.213.1:2379,192.168.213.3:2379,192.168.213.4:2379" Performance =========== - Redhat recommends using Native Container Routing for speeds above 1Gb/s. It creates a new bridge connected to fast fabric and docker configured to use it instead of docker0 bridge. The docker0 is routed trough the OpenVSwich fabric and the new bridge should go directly. Unfortunatelly, this is not working with Infiniband. IPoIB is not fully Ethernet compatible and is not working as slave in bridges. * There is projects for full Ethernet compatibility (eipoib) providing Ethernet L2 interfaces. But it seems there is no really mature solution ready for production. It also penalyzes performance (about 2x). * Mellanox cards working in both Ethernet and Infiniband modes. No problem to select the current mode with: echo "eth|ib|auto" > /sys/bus/pci/devices/0000\:06\:00.0/mlx4_port1 However, while the switch support Ethernet, it requires additional license basically for 50% of the original switch price (it is about 4 kEUR for SX6018). License is called: UPGR-6036-GW. - Measured performance Standard: ~ 3.2 Gb/s 28 us Standard (pods on the same node) ~ 20 - 30 Gb/s 12 us hostNet (using cluster IP ) ~ 3.6 Gb/s 23 us hostNet (using host IP) ~ 12 - 15 Gb/s 15 us Standard to hostNet ~ 10 - 12 Gb/s 18 us - So, I guess the optimal solution is really to introduce a second router for the cluster, but with Ethernet interface. Then, we can reconfigure the second Infiniband adapter for the Ethernet mode. The switch to native routing should be possible also with running cluster with short downtime. As temporary solution, we may use hostNetwork.