docs/network.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82

Configuration
=============
openshift_ip                                    Infiniband IPs for fast communication (it also used for ADEI/MySQL bridge 
                                                and so should reside on fast network.
openshift_hostname                              The 'cluster' host name. Should match real host name for certificat validation.
                                                So, it should be set if default ip does not resolve to host name
openshift_public_ip                             We may either skip this or set to our 192.168.26.xxx network. Usage is unclear
openshift_public_hostname                       I guess it is also for certificates, but while communicating with external systems
openshift_master_cluster_hostname               Internal cluster load-balancer or just pointer to master host
openshift_public_master_cluster_hostname        The main cluster gateway


Complex Network
===============
Some things in OpenShift ansible scripts are still implemented with assumption we have 
a simple network configuration with a single interface communicating to the world. There
are several options to change this:
  openshift_set_node_ip - This variable configures nodeIP in the node configuration. This 
  variable is needed in cases where it is desired for node traffic to go over an interface 
  other than the default network interface. 
  openshift_ip - This variable overrides the cluster internal IP address for the system. 
  Use this when using an interface that is not configured with the default route.
  openshift_hostname - This variable overrides the internal cluster host name for the system. 
  Use this when the system’s default IP address does not resolve to the system host name.
Furthermore, if we use infiniband which is not accessible to outside world we need to set
  openshift_public_ip -  Use this for cloud installations, or for hosts on networks using 
  a network address translation
  openshift_public_hostname - Use this for cloud installations, or for hosts on networks 
  using a network address translation (NAT).

 This is, however, is not used trough all system components. Some provisioning code and
installed scripts are still detect kind of 'main system ip' to look for the
services. This ip is intendified either as 'ansible_default_ip' or by the code trying
to look for the ip which is used to send packet over default route. Ansible in the end does
the some thing. This plays bad for several reasons. 
 - We have keepalived ips moving between systems. The scripts are actually catching
 this moving ips instead of the fixed ip bound to the system. 
 - There could be several default routes. While it is not a problem, scripts does not expect
 that and may fail.
 
For instance, the script '99-origin-dns.sh' in /etc/NetworkManager/dispatcher.d. 
    * def_route=$(/sbin/ip route list match 0.0.0.0/0 | awk '{print $3 }')
 1) Does not expect multiple default routes and will find just a random one. Then, 
    * if [[ ${DEVICE_IFACE} == ${def_route_int} ]]; then   
  check may fail and the resolv.conf will be not updated because currently up'ed 
  interface is not on default route, but it actually is. Furthermore,
    * def_route_ip=$(/sbin/ip route get to ${def_route} | awk '{print $5}')
 2) ignorant of keepalived and will bound to keepalived.
 
 But I am not sure the problems are limited to this script. There could be other places with
 the same logic. Some details are here:
 https://docs.openshift.com/container-platform/3.7/admin_guide/manage_nodes.html#manage-node-change-node-traffic-interface

Hostnames
=========
 The linux host name (uname -a) should match the hostnames assigned to openshift nodes. Otherwise, the certificate verification
 will fail. It seems minor issue as system continue functioning, but better to avoid. The check can be performed with etcd:
    etcdctl3  --key=/etc/etcd/peer.key --cacert=/etc/etcd/ca.crt --endpoints="192.168.213.1:2379,192.168.213.3:2379,192.168.213.4:2379"

Performance
===========
 - Redhat recommends using Native Container Routing for speeds above 1Gb/s. It creates a new bridge connected to fast fabric and docker
 configured to use it instead of docker0 bridge. The docker0 is routed trough the OpenVSwich fabric and the new bridge should go directly.
 Unfortunatelly, this is not working with Infiniband. IPoIB is not fully Ethernet compatible and is not working as slave in bridges. 
  * There is projects for full Ethernet compatibility (eipoib) providing Ethernet L2 interfaces.  But it seems there is no really mature 
  solution ready for production. It also penalyzes performance (about 2x).
  * Mellanox cards working in both Ethernet and Infiniband modes. No problem to select the current mode with:
     echo "eth|ib|auto" >  /sys/bus/pci/devices/0000\:06\:00.0/mlx4_port1
  However, while the switch support Ethernet, it requires additional license basically for 50% of the original switch price (it is about
  4 kEUR for SX6018). License is called: UPGR-6036-GW.

 - Measured performance
    Standard:                           ~ 3.2 Gb/s              28 us
    Standard (pods on the same node)    ~ 20 - 30 Gb/s          12 us
    hostNet (using cluster IP )         ~ 3.6 Gb/s              23 us
    hostNet (using host IP)             ~ 12 - 15 Gb/s          15 us
    Standard to hostNet                 ~ 10 - 12 Gb/s          18 us
  
  - So, I guess the optimal solution is really to introduce a second router for the cluster, but with Ethernet interface. Then, we can
  reconfigure the second Infiniband adapter for the Ethernet mode. The switch to native routing should be possible also with running
  cluster with short downtime. As temporary solution, we may use hostNetwork.