DOs and DONTs ============= Here we discuss things we should do and we should not do! - Scaling up cluster is normally problem-less. Both nodes & masters can be added fast and without much troubles afterwards. - Upgrade procedure may cause the problems. The main trouble that many pods are configured to use the 'latest' tag. And the latest versions has latest problems (some of the tags can be fixed to actual version, but finding that is broken and why takes a lot of effort)... * Currently, there is problems if 'kube-service-catalog' is updated (see discussion in docs/upgrade.txt). While it seems nothing really changes, the connection between apiserver and etcd breaks down (at least for health checks). The intallation reamins pretty much usable, but not in healthy state. This particular update is blocked by setting. openshift_enable_service_catalog: false Then, it is left in 'Error' state, but can be easily recovered by deteleting and allowing system to re-create a new pod. * However, as cause is unclear, it is possible that something else with break as time passes and new images are released. It is ADVISED to check upgrade in staging first. * During upgrade also other system pods may stuck in Error state (as explained in troubleshooting) and block the flow of upgrade. Just delete them and allow system to re-create to continue. * After upgrade, it is necessary to verify that all pods are operational and restart ones in 'Error' states. - Re-running install will break on heketi. And it will DESTROY heketi topology! DON"T DO IT! Instead a separate components can be re-installed. * For instance to reinstall 'openshift-ansible-service-broker' use openshift-install-service-catalog.yml * There is a way to prevent plays from touching heketi, we need to define openshift_storage_glusterfs_is_missing: False openshift_storage_glusterfs_heketi_is_missing: False But I am not sure if it is only major issue. - Few administrative tools could cause troubles. Don't run * oc adm diagnostics Failures / Immidiate ======== - We need to remove the failed node from etcd cluster etcdctl3 --endpoints="192.168.213.1:2379" member list etcdctl3 --endpoints="192.168.213.1:2379" member remove - Further, the following is required on all remaining nodes if the node is forever gone * Delete node oc delete node * Remove it also from /etc/etcd.conf on all nodes ETCD_INITIAL_CLUSTER * Remove failed nodes from 'etcdClinetInfo' section in /etc/origin/master/master-config.yaml systemctl restart origin-master-api.service Scaling / Recovery ======= - One important point. * If we lost data on the storage node, it should be re-added with different name (otherwise the GlusterFS recovery would be significantly more complicated) * If Gluster bricks are preserved, we may keep the name. I have not tried, but according to documentation, it should be possible to reconnect it back and synchronize. Still it may be easier to use a new name again to simplify procedure. * Simple OpenShift nodes may be re-added with the same name, no problem. - Next we need to perform all prepartion steps (the --limit should not be applied as we normally need to update CentOS on all nodes to synchronize software versions; list all nodes in /etc/hosts files; etc). ./setup.sh -i staging prepare - The OpenShift scale is provided as several ansible plays (scale-masters, scale-nodes, scale-etcd). * Running 'masters' will also install configured 'nodes' and 'etcd' daemons * I guess running 'nodes' will also handle 'etcd' daemons, but I have not checked. Problems -------- - There should be no problems if a simple node crashed, but things may go wrong if one of the masters is crashed. And things definitively will go wrong if complete cluster will be cut from the power. * Some pods will be stuck polling images. This happens if node running docker-registry have crashed and the persistent storage was not used to back the registry. It can be fixed by re-schedulling build and roling out the latest version from dc. oc -n adei start-build adei oc -n adei rollout latest mysql OpenShift will trigger rollout automatically in some time, but it will take a while. The builds should be done manually it seems. * In case of long outtage some CronJobs will stop execute. The reason is some protection against excive loads and missing defaults. Fix is easy, just setup how much time the OpenShift scheduller allows to CronJob to start before considering it failed: oc -n adei patch cronjob/adei-autogen-update --patch '{ "spec": {"startingDeadlineSeconds": 10 }}' - if we forgot to remove old host from etcd cluster, the OpenShift node will be configured, but etcd will not be installed. We need, then, to remove the node as explained above and run scale of etcd cluster. * In multiple ocasions, the etcd daemon has failed after reboot and needed to be resarted manually. If half of the daemons is broken, the 'oc' will block. Storage / Recovery ======= - Furthermore, it is necessary to add glusterfs nodes on a new storage nodes. It is not performed automatically by scale plays. The 'glusterfs' play should be executed with additional options specifying that we are just re-configuring nodes. We can check if all pods are serviced oc -n glusterfs get pods -o wide Both OpenShift and etcd clusters should be in proper state before running this play. Fixing and re-running should be not an issue. - More details: https://docs.openshift.com/container-platform/3.7/day_two_guide/host_level_tasks.html Heketi ------ - With heketi things are straighforward, we need to mark node broken. Then heketi will automatically move the bricks to other servers (as he thinks fit). * Accessing heketi heketi-cli -s http://heketi-storage-glusterfs.openshift.suren.me --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)" * Gettiing required ids heketi-cli topology info * Removing node heketi-cli node info heketi-cli node disable heketi-cli node remove * Thats it. A few self-healing daemons are running which should bring the volumes in order automatically. * The node will still persist in heketi topology as failed, but will not be used ('node delete' potentially could destroy it, but it is failin) - One problem with heketi, it may start volumes before bricks get ready. Consequently, it may run volumes with several bricks offline. It should be checked and fixed by restarting the volumes. KaaS Volumes ------------ There is two modes. - If we migrated to a new server, we need to migrate bricks (force is required because the source break is dead and data can't be copied) gluster volume replace-brick commit force * There is healing daemons running and nothing else has to be done. * There play and scripts available to move all bricks automatically - If we kept the name and the data is still there, it should be also relatively easy to perform migration (not checked). We also should have backups of all this data. * Ensure Gluster is not running on the failed node oadm manage-node ipeshift2 --schedulable=false oadm manage-node ipeshift2 --evacuate * Verify the gluster pod is not active. It may be running, but not ready. Could be double checked with 'ps'. oadm manage-node ipeshift2 --list-pods * Get the original Peer UUID of the failed node (by running on healthy node) gluster peer status * And create '/var/lib/glusterd/glusterd.info' similar to the one on the healthy nodes, but with the found UUID. * Copy peers from the healthy nodes to /var/lib/glusterd/peers. We need to copy from 2 nodes as node does not hold peer information on itself. * Create mount points and re-schedule gluster pod. See more details https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/sect-replacing_hosts * Start healing gluster volume heal VOLNAME full - However, if data is lost, it is quite complecated to recover using the same server name. We should rename the server and use first approach instead. Scaling ======= We have currently serveral assumptions which will probably not hold true for larger clusters - Gluster To simplify matters we just reference servers in the storage group manually Arbiter may work for several groups and we should define several brick path in this case