From f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf Mon Sep 17 00:00:00 2001 From: "Suren A. Chilingaryan" Date: Sun, 11 Mar 2018 19:56:38 +0100 Subject: Various fixes before moving to hardware installation --- docs/upgrade.txt | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 64 insertions(+) create mode 100644 docs/upgrade.txt (limited to 'docs/upgrade.txt') diff --git a/docs/upgrade.txt b/docs/upgrade.txt new file mode 100644 index 0000000..b4f22d6 --- /dev/null +++ b/docs/upgrade.txt @@ -0,0 +1,64 @@ +Upgrade +------- + - The 'upgrade' may break things causing long cluster outtages or even may require a complete re-install. + Currently, I found problem with 'kube-service-catalog', but I am not sure problems are limited to it. + Furthermore, we currently using 'latest' tag of several docker images (heketi is example of a critical + service on the 'latest' tag). Update may break things down. + +kube-service-catalog +-------------------- + - Update of 'kube-service-catalog' breaks OpenShift health check + curl -k https://apiserver.kube-service-catalog.svc/healthz + It complains on 'etcd'. The speific etcd check + curl -k https://apiserver.kube-service-catalog.svc/healthz/etcd + reports that all servers are unreachable. + + - In fact etcd is working and the cluster is mostly functional. Occasionaly, it may suffer from the bug + described here: + https://github.com/kubernetes/kubernetes/issues/47131 + The 'oc' queries are extremely slow and healthz service reports that there is too many connections. + Killing the 'kube-service-catalog/apiserver' helps for a while, but problem returns occasionlly. + + - The information bellow is attempt to understand the reason. In fact, it is the list specifying that + is NOT the reason. The only found solution is to prevent update of 'kube-service-catalog' by setting + openshift_enable_service_catalog: false + + - The problem only occurs if 'openshift_service_catalog' role is executed. It results in some + miscommunication between 'apiserver' and/or 'control-manager' with etcd. Still the cluster is + operational, so the connection is not completely lost, but is not working as expected in some + circustmances. + + - There is no significant changes. The exactly same docker images are installed. The only change in + '/etc' is updated certificates used by 'apiserver' and 'control-manager'. + * The certificates are located in '/etc/origin/service-catalog/' on the first master server. + 'oc adm ca' is used for generation. However, certificates in this folder are not used directly. They + are barely a temporary files used to generate 'secrets/service-catalog-ssl' which is used in + 'apiserver' and 'control-manager'. The provisioning code is in: + openshift-ansible/roles/openshift_service_catalog/tasks/generate_certs.yml + it can't be disabled completely as registered 'apiserver_ca' variable is used in install.yml, but + actual generation can be skipped and old files re-used to generate secret. + * I have tried to modify role to keep old certificates. The healhz check was still broken afterwards. + So, this is update is not a problem (or at least not a sole problem). + + - The 'etcd' cluster seems OK. On all nodes, the etcd can be verified using + etcdctl3 member list + * The last command is actually bash alias which executes + ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --endpoints https://`hostname`:2379 member list + Actually, etcd is serving two ports 2379 (clients) and 2380 (peers). One idea was that may be the + second port got problems. I was trying to change 2379 to 2380 in command above and it was failing. + However, it does not work either if the cluster in healhy state. + * One idea was that certificates are re-generated for wrong ip/names and, hence, certificate validation + fails. Or that the originally generated CA is registered with etcd. This is certainly not the (only) issue + as problem persist even if we keep certificates intact. However, I also verified that newly generated + certificates are completely similar to old ones and containe the correct hostnames inside. + * Last idea was that actually 'asb-etcd' is broken. It complains + 2018-03-07 20:54:48.791735 I | embed: rejected connection from "127.0.0.1:43066" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority", ServerName "") + However, the same error is present in log directly after install while the cluster is completely + healthy. + + - The networking seems also not an issue. The configurations during install and upgrade are exactly the same. + All names are defined in /etc/hosts. Furthermore, the names in /etc/hosts are resolved (and back-resolved) + by provided dnsmasq server. I.e. ipeshift1 resolves to 192.168.13.1 using nslookup and 192.168.13.1 resolves + back to ipeshift1. So, the configuration is indistinguishable from proper one with properly configured DNS. + + \ No newline at end of file -- cgit v1.2.3