Document latest problems with docker images and resource reclaimation, add docker performance checks in the monitoring scripts, helpers to filter the logs

author: Suren A. Chilingaryan <csa@suren.me> 2019-10-06 05:00:55 +0200
committer: Suren A. Chilingaryan <csa@suren.me> 2019-10-06 05:00:55 +0200
commit: ba144fab071258a97cf3c42a0defeb0aae41a353 (patch)
tree: 2e738d4e4774d754b56d79021cc8781b3c0835a5 /docs/logs.txt
parent: efe4b9bbe3c9cb950378de9697eed2030ac49ca2 (diff)
download: ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.gz
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.bz2
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.xz
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.zip
1 files changed, 9 insertions, 1 deletions
diff --git a/docs/logs.txt b/docs/logs.txt
index e27b1ff..d33ef0a 100644
--- a/docs/logs.txt
+++ b/docs/logs.txt
@@ -2,6 +2,10 @@
 =================
  - Various RPC errors. 
     ... rpc error: code = # desc = xxx ...
+
+ - PLEG is not healthy: pleg was last seen active 3m0.448988393s ago; threshold is 3m0s
+    This is severe and indicates communication probelm (or at least high latency) with docker daemon. As result the node can be marked
+    temporary NotReady and cause eviction of all resident pods.
  
  - container kill failed because of 'container not found' or 'no such process': Cannot kill container ###: rpc error: code = 2 desc = no such process"
     Despite the errror, the containers are actually killed and pods destroyed. However, this error likely triggers
@@ -25,10 +29,14 @@
     There are no adverse effects to this.  It is a potential kernel issue, but should be just ignored by the customer.  Nothing is going to break.
         https://bugzilla.redhat.com/show_bug.cgi?id=1425278
 
-
  - E0625 03:59:52.438970   23953 watcher.go:210] watch chan error: etcdserver: mvcc: required revision has been compacted
     seems fine and can be ignored.
 
+ - E0926 09:29:50.744454   93115 mount_linux.go:172] Mount failed: exit status 1
+   Output: Failed to start transient scope unit: Connection timed out
+    It seems caused by too many parallel mounts (about 500 per-node) may cause systemd to hang. 
+    Details: https://github.com/kubernetes/kubernetes/issues/79194
+        * Suggested to use 'setsid' to mount volumes instead of 'systemd-run'
     
 /var/log/openvswitch/ovs-vswitchd.log
 =====================================
author	Suren A. Chilingaryan <csa@suren.me>	2019-10-06 05:00:55 +0200
committer	Suren A. Chilingaryan <csa@suren.me>	2019-10-06 05:00:55 +0200
commit	ba144fab071258a97cf3c42a0defeb0aae41a353 (patch)
tree	2e738d4e4774d754b56d79021cc8781b3c0835a5 /docs/logs.txt
parent	efe4b9bbe3c9cb950378de9697eed2030ac49ca2 (diff)
download	ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.gz ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.bz2 ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.xz ands-ba144fab071258a97cf3c42a0defeb0aae41a353.zip