diff options
Diffstat (limited to 'docs/troubleshooting.txt')
-rw-r--r-- | docs/troubleshooting.txt | 14 |
1 files changed, 13 insertions, 1 deletions
diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt index 315f9f4..0621b25 100644 --- a/docs/troubleshooting.txt +++ b/docs/troubleshooting.txt @@ -151,8 +151,17 @@ nodes: domino failures * This might continue infinitely as one node is gets disconnected after another, pods get rescheduled, and process never stops * The only solution is to remove temporarily some pods, e.g. ADEI pods could be easily removed and, then, provivisioned back -pods: very slow scheduling (normal start time in seconds range), failed pods, rogue namespaces, etc... +pods: failed or very slow scheduling (normal start time in seconds range), failed pods, rogue namespaces, etc... ==== + - LSDF mounts might cause pod-scheduling to fail + * It seems OpenShift tries to index (chroot+chmod) files on mount and timeouts if LSDF volume has too many small files... + * Reducing number of files with 'subPath' doesn't help here, but setting more specific 'networkPath' in pv helps + * Suggestion is to remove fsGroup from 'dc' definition, but it is added automatically if pods use network volumes, + setting volume 'gid' (cifs mount parameters specified in 'mountOptions' in pv definition) to match fsGroup doesn't help either + * Timeout seems to be fixed to 2m and is not configurable... + * Later versions of OpenShift has 'fsGroupChangePolicy=OnRootMismatch' parameter, but it is not present in 3.9 + => Honestly, solution is unclear besides reducing number of files or mounting a small share subset with little fieles + - OpenShift has numerous problems with clean-up resources after the pods. The problems are more likely to happen on the heavily loaded systems: cpu, io, interrputs, etc. * This may be indicated in the logs with various errors reporting inability to stop containers/processes, free network @@ -450,3 +459,6 @@ Various - IPMI may cause problems as well. Particularly, the mounted CDrom may start complaining. Easiest is just to remove it from the running system with echo 1 > /sys/block/sdd/device/delete + + - 'oc get scc' reports the server doesn't have a resource type "scc" + Delete (will be restarted) 'apiserver-*' pod in the 'kube-service-catalog' namespace |