| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Automatic merge from submit-queue.
Add the ability to specify a timeout for node drain operations
A timeout to wait for nodes to drain pods can be specified to ensure that the upgrade continues even if nodes fail to drain pods in the allowed time. The default value of 0 will wait indefinitely allowing the admin to investigate the root cause and ensuring that disruption budgets are respected. In practice the `oc adm drain` command will eventually error out, at least that's what we've seen in our large online clusters, when that happens a second attempt will be made to drain the nodes, if it fails again it will abort the upgrade for that node or for the entire cluster based on your defined `openshift_upgrade_nodes_max_fail_percentage`.
`openshift_upgrade_nodes_drain_timeout=0` is the default and will wait until all pods have been drained successfully
`openshift_upgrade_nodes_drain_timeout=600` would wait for 600s before moving on to the tasks which would forcefully stop pods such as stopping docker, node, and openvswitch.
|
| | |
|
|\ \
| |/
|/|
| |
| |
| |
| |
| | |
Automatic merge from submit-queue.
Remove last of openshift_node role meta-depends
Remove last non-taskless meta-depends from
openshift_node role.
|
| |
| |
| |
| |
| |
| |
| |
| | |
Remove last non-taskless meta-depends from
openshift_node role.
Remove variable 'openshift_node_upgrade_in_progress' as
it is no longer used.
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In Ansible 2.2, the include_role directive came into existence as
a Tech Preview. It is still a Tech Preview through Ansible 2.4
(and in current devel branch), but with a noteable change. The
default behavior switched from static: true to static: false
because that functionality moved to the newly introduced
import_role directive (in order to stay consistent with include*
being dynamic in nature and `import* being static in nature).
The dynamic include is considerably more memory intensive as it will
dynamically create a role import for every host in the inventory
list to be used. (Also worth noting, there is at the time of this
writing an object allocation inefficiency in the dynamic include
that can in certain situations amplify this effect considerably)
This change is meant to mitigate the pressure on memory for the
Ansible control host.
We need to evaluate where it makes sense to dynamically include roles
and revert back to dynamic inclusion if and where it makes sense to do
so.
|
|
|
|
|
| |
We set these variables using facts in init, no need
to duplicate the logic all around the codebase.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit moves the pulling of images, packages,
and updating config files into a non-serialized play.
The serialized play is now in charge of marking unschedulable,
draining, stopping and restarting services, and marking
schedulable.
If rpm install / container download takes 60s per host,
this will save 3 hours and 10 minutes at 200 hosts per cluster
and forks of 20 hosts.
|
|
|
|
|
|
|
|
| |
This commit removes openshift.common.service_type
in favor of openshift_service_type.
This commit also removes r_openshift_excluder_service_type
from plays in favor of using the role's defaults.
|
|
|
|
| |
Replace with `oc adm`
|
|
|
|
| |
Switch to import_role for some required roles.
|
|
|
|
|
|
|
|
|
| |
Currently, having openshift_node and openshift_node_upgrade
as two distinct roles has created a duplication across
handlers, templates, and some tasks.
This commit combines the roles to reduce duplication
and bugs encountered by not putting code in both places.
|
|\
| |
| | |
drain still pending in below files without fix :
|
| |
| |
| |
| |
| |
| |
| | |
playbooks/common/openshift-cluster/upgrades/docker/docker_upgrade.yml
playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml
Signed-off-by: jkaurredhat <jkaur@redhat.com>
|
|/ |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
Disable/reset excluders over requested hosts
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
| |
Containerized upgrades of openvswitch are already handled by updating
the container images and pulling them again.
|
| |
|
|\
| |
| | |
Wait for nodes to be ready before proceeding with upgrade.
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Near the end of node upgrade, we now wait for the node to report Ready
before marking it schedulable again. This should help eliminate delays
when pods need to relocate as the next node in line is evacuated.
Happens near the end of the process, the only remaining task would be to
mark it schedulable again so easy for admins to detect and recover from.
|
| |
| |
| |
| | |
Closes #3070
|
|/
|
|
| |
* https://trello.com/c/TeaEB9fX/307-3-deprecate-node-evacuation
|
|
|
|
|
| |
* Added checks to make ci for yaml linting
* Modified y(a)ml files to pass lint checks
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In 3.3 one of our services lays down a systemd drop-in for configuring
Docker networking to use lbr0. In 3.4, this has been changed but the
file must be cleaned up manually by us.
However, after removing the file docker requires a restart. This had big
implications particularly in containerized environments where upgrade is
a very fragile series of upgrading and service restarts.
To avoid double docker restarts, and thus double service restarts in
containerized environments, this change does the following:
- Skip restart during docker upgrade, if it is required. We will restart
on our own later.
- Skip containerized service restarts when we upgrade the services
themselves.
- Clean shutdown of all containerized services.
- Restart Docker. (always, previously this only happened if it needed an
upgrade)
- Ensure all containerized services are restarted.
- Restart rpm node services. (always)
- Mark node schedulable again.
At the end of this process, docker0 should be back on the system.
|
|
|
|
| |
containerized.
|
| |
|
|
|
|
| |
This reverts commit 1f2276fff1e41c1d9440ee8b589042ee249b95d7.
|
|
|
|
|
|
|
|
| |
Found bug syncing binaries to containerized hosts where if a symlink was
pre-existing, but pointing to the wrong destination, it would not be
corrected.
Switched to using oc adm instead of oadm.
|
|
|
|
|
|
|
|
| |
This can fail with a transient "object has been modified" error asking
you to re-try your changes on the latest version of the object.
Allow up to three retries to see if we can get the change to take
effect.
|
|
|
|
|
|
|
|
| |
This improves the situation further and prevents configuration changes
from accidentally triggering docker restarts, before we've evacuated
nodes. Now in two places, we skip the role entirely, instead of previous
implementation which only skipped upgrading the installed version.
(which did not catch config issues)
|
| |
|
| |
|
|
|