summaryrefslogtreecommitdiffstats
path: root/docs/maintenance.txt
diff options
context:
space:
mode:
authorSuren A. Chilingaryan <csa@suren.me>2019-10-06 05:00:55 +0200
committerSuren A. Chilingaryan <csa@suren.me>2019-10-06 05:00:55 +0200
commitba144fab071258a97cf3c42a0defeb0aae41a353 (patch)
tree2e738d4e4774d754b56d79021cc8781b3c0835a5 /docs/maintenance.txt
parentefe4b9bbe3c9cb950378de9697eed2030ac49ca2 (diff)
downloadands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.gz
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.bz2
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.xz
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.zip
Document latest problems with docker images and resource reclaimation, add docker performance checks in the monitoring scripts, helpers to filter the logs
Diffstat (limited to 'docs/maintenance.txt')
-rw-r--r--docs/maintenance.txt55
1 files changed, 55 insertions, 0 deletions
diff --git a/docs/maintenance.txt b/docs/maintenance.txt
new file mode 100644
index 0000000..9f52e18
--- /dev/null
+++ b/docs/maintenance.txt
@@ -0,0 +1,55 @@
+Unused resources
+================
+ ! Cleaning of images is necessary if amount of resident images grow above 1000. Everything else has not caused problems yet and could
+ be ignored unless blocking other actions (e.g. clean-up of old images)
+
+ - Deployments. As is this hasn't caused problems yet, but old versions of 'rc' may block removal of the old images and this may
+ have negative impact on performance.
+ oc adm prune deployments --orphans --keep-complete=3 --keep-failed=1 --keep-younger-than=60m --confirm
+ oc adm prune builds --orphans --keep-complete=3 --keep-failed=1 --keep-younger-than=60m --confirm
+ * This is, however, does not clean old 'rc' controllers which are allowed by 'revisionHistoryLimit' (and may be something else as
+ well). There is a script included to clean such controllers 'prunerc.sh'
+
+ - OpenShift sometimes fails to clean stopped containers. This containers again may block removal of images (and likely on itself also
+ can use Docker performance penalties if accumulated).
+ * The lost containers can be identified by looking into the /var/log/messages.
+ PodSandbox "aa28e9c7605cae088838bb4c9b92172083680880cd4c085d93cbc33b5b9e8910" from runtime service failed: ...
+ * We can find and remove the corresponding container (the short id is just first letters of the long id)
+ docker ps -a | grep aa28e9c76
+ docker rm <id>
+ * But in general any not-running container which is for a long time remains in stopped state could be considered lost. We can remove
+ all of them or just ones related to the specific image (if we are cleaning images and something blocks deletion of an old version)
+ docker rm $(docker ps -a | grep Exited | grep adei | awk '{ print $1 }')
+
+ - If cleaning containers manually or/and forcing termination of pods, some remnants could be left in '/var/lib/origin/openshift.local.volumes/pods'
+ * Probably, it is also could happen in other cases. This can be detected by looking in /var/log/messages for something like
+ Orphaned pod "212074ca-1d15-11e8-9de3-525400225b53" found, but volume paths are still present on disk.
+ * If unknown, the location for the pod in question could be found with 'find . -name heketi*' or something like (the containers names will be listed
+ under this subdirectory, so they can be used in search)...
+ * There could be problematic mounts which can be freed with lazy umount
+ * The folders for removed pods may (and should) be removed.
+
+ - Prunning unused images (this is required as if large amount is accumulated, the additional latencies in communication with docker
+ daemon will be inrtoduced and result in severe penalties to scheduling performance). Official way to clean unused images is
+ oc adm prune images --keep-tag-revisions=3 --keep-younger-than=60m --confirm
+ * This is, however, will keep all images referenced by exisitng bc, dc, rc, and pods (see above). So, it could be worth cleaning OpenShift resources
+ before before proceeding with images. If images doesn't go, it worth also tryig to clean orphaned containers.
+ * Some images could be also orphanned by OpenShift infrastructure. OpenShift supports 'hard' prunning to handle such images.
+ https://docs.openshift.com/container-platform/3.7/admin_guide/pruning_resources.html
+ First check if something needs to be done:
+ oc -n default exec -i -t "$(oc -n default get pods -l deploymentconfig=docker-registry -o jsonpath=$'{.items[0].metadata.name}\n')" -- /usr/bin/dockerregistry -prune=check
+ If there is many orphans, the hard pruning can be executed. This requires additional permissions
+ for service account running docker-registry
+ service_account=$(oc get -n default -o jsonpath=$'system:serviceaccount:{.metadata.namespace}:{.spec.template.spec.serviceAccountName}\n' dc/docker-registry)
+ oc adm policy add-cluster-role-to-user system:image-pruner ${service_account}
+ and should be done with docker registry in read-only mode (requires restart of default/docker-registry containers)
+ oc env -n default dc/docker-registry 'REGISTRY_STORAGE_MAINTENANCE_READONLY={"enabled":true}' # wait until new pods rolled out
+ oc -n default exec -i -t "$(oc -n default get pods -l deploymentconfig=docker-registry -o jsonpath=$'{.items[0].metadata.name}\n')" -- /usr/bin/dockerregistry -prune=delete
+ oc env -n default dc/docker-registry REGISTRY_STORAGE_MAINTENANCE_READONLY-
+
+ - Cleaning old images which doesn't want to go.
+ * Investigating image streams and manually deleting the old versions of the images
+ oc get is adei -o yaml
+ oc delete image sha256:04afd4d4a0481e1510f12d6d071f1dceddef27416eb922cf524a61281257c66e
+ * Cleaning old dangling images using docker (on all nodes). Tried and as it seems caused no issues to the operation of the cluster.
+ docker rmi $(docker images --filter "dangling=true" -q --no-trunc)