summaryrefslogtreecommitdiff
path: root/logs/2025.11.03.storage-log.txt
blob: a95dc57c3cd30c2130aace06215d61a12f2a19b6 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
Status
======
 - Raid controller failed on ipekatrin1
 - The system was not running stable after replacement (disk disconnect after 20-30m operation)
 - ipekatrin1 was temporarily converted in the master-only node (apps scheduling disabled, glusterfs stopped)
 - Heketi and gluster-blockd were disabled and will be not available further. Existing heketi volumes preserved.
 - New disks (from ipekatrinbackupserv1) were assembled in the RAID, assembled in gluster, and manual (file walk-trough) healing 
   is executed. Expected to take about 2-3 weeks (about 2TB per day rate). No LVM configured, direct mount.
 - Application node will be recovered once we replace system SSDs with larger ones (as there currently no space for images/containers)
   and I don't want to put it on new RAID.

Recovery Logs
====
 2025.09.28
    - ipekatrin1:
        * Raid controller don't see 10 disks and behaves erratically.
        * Turned of the server and ordered a replacement.
    - Sotrage:
        * Restarted degraded GlusterFS nodes and make them work on remaining 2 nodes (1 replica + metadata for most of our storage needs).
        * Turned out 'database' volume is created in Raid-0 mode and it used backend for KDB database. So, data is gone.
        * Recovered KDB database from backups and moved it to glusterfs/openshift volume. Nothing left on 'database' volume. Can be turned off.

 2025.10.23
    - ipekatrin1: 
        * Replaced RAID controller. Make attempt to rebuild, but disks are disconnected after about 30-40 minutes (recovered after shutoff, not reboot)
        * Checked power issues: cabling bypassing PSU and monitoring voltages (12V system should not go bellow 11.9V). No change, voltages seemed fine.
        * Checked cabling issues disconnecting first one cable and then another (supported mode, single cable connects all disks). No change
        * Tried to imrpove cooling, setting fan speeds to maximum (kept) and even temporarily installing external cooler. Radiators were cool, also checked reported temperatures. No change, still goes down in 30-40 minutes.
        * Suspect backplane problems. The radiators were quite hot before adjusting cooling. Seems known stability problems due to bad signal management in firmware if overheated. Firmware updates are suggest to stabilize.
        * No support by SuperMicro. Queried Tootlec about possibility of getting firmware update or/and ordering backplane [Order RG_014523_001_Chilingaryan form 16.12.2016, Angebot 14.10, Contract: 28.11]
          Hardware: Chassis CSE-846BE2C-R1K28B, Backplan BPN-SAS3-846EL2), 2x MCX353A-FCB ConnectX-3 VPI
        * KATRINBackupServ1 (3-years older) has backplane with enough bays to mount disks. We still need to be able to put Raid-card and Mellanox ConnectX-3 board/boards with 2 ports (can leave with 1).
    - ipekatrin2: Noticed and cleared RAID alarm attributed to the battery subsystem. 
        * No apparent problems at the moment. Temperatures are all in order. Battery reports healthy. Systems works as usual.
        * Setup temperature monitoring of RAID card, currently 76-77C

 2025.10.27
    - ipekatrin1:
        * Disconnected all disks from the server and start preparing it as an application node
    - Software:
        * I have temporarily suspended all ADEI cronJobs to avoid resource contention on ipekatrin2 (as restart would be dangerous now) [clean (logs,etc.)/maintain (re-caching,etc.)/update(detecting new databases)]
    - Research:
        * DaemonSet/GlusterFS selects nodes based on the following nodeSelector
            $ oc -n glusterfs get ds glusterfs-storage -o yaml | grep -B 5 -A 5 nodeSelector 
                  nodeSelector:
                    glusterfs: storage-host
          All nodes has corresponding labels in their metadata:
            $ oc get node/ipekatrin1.ipe.kit.edu --show-labels -o yaml | grep  -A 20 labels:
                  labels:
                    ...
                    glusterfs: storage-host
                    ...
        * Thats removed now from ipekatrin1 and should be recovered if we bring storage back
            oc label --dry-run node/ipekatrin1.ipe.kit.edu glusterfs-
        * We further need to remove 192.168.12.1 from 'endpoints/gfs' (per namespaces) to avoid possible problems. 
        * On ipekatrin1, /etc/fstab glusterfs mounts should be changed from 'localhost' to some other server (or commented all-together). GlusterFS mounts 
        should be changed from localhost to (or probably just 12.2 as it only host containing data and going via intermediary makes no sense)
            192.168.12.2,192.168.12.3:<vol>  /mnt/vol  glusterfs  defaults,_netdev  0 0
        * All raid volumes be also temporarily commented in /etc/fstab and systemd
	    systemctl list-units --type=mount | grep gluster
        * Further configuration changes required to run node without glusterfs causing no damage to the rest of the system
            GlusterFS might be referenced via: /etc/hosts, /etc/fstab, /etc/systemd/system/*.mount /etc/auto.*, scripts/cron
                endpoints (per namespace), inline gluster volumes in PV (gloabl), 
                gluster-block endpoints / tcmu gateway list, sc (heketi storageclass) and controllers (ds,deploy,sts); just in case check heketi cm/secrets), 
    - Plan:
        * Prepare application node [double-check before implementing]
            + Adjust node label 
	    + Edit 'gfs' endpoints in all namespaces.
            + Check glusterblock/heketi, strange pv's. 
	    + Check Ands monitoring & maintenance scirpts
            + Adjust /etc/fstab and check systemd based mounts. Shall we do soemth with hosts?
	    + /etc/nfs-ganesha on ipekatrin1 & ipekatrin2
            + Check/change cron & monitoring scipts
	    + Check for backup scripts, it probably written on raid controller.
	    + Grep in OpenShift configs (and /etc globally) just in case
            + Google above other possible culprits.
            + Boot ipekatrin1 and check that all is fine
        * cronJobs
            > Set affinity to ipekatrin1. 
            > Restart cronJobs (maybe reduce intervals)
	* copy cluster backups out
        * ToDo
            > Ideally eliminating cronJobs all together for rest of KaaS1 life-time and replacing with continuously running cron daemon iside container
            > Rebuild ipekatrinbackupserv1 as new gluster node (using disks) and try connecting it to the cluster

  2025.10.28-31
    - Hardware
	* Re-assemled ipekatrin1 disks in ipekatrinbackupserv1 backplane using new LSI 9361-8i raid controller. Original LSI 9271-8i removed.
	* Put old (SAS2) disks from ipekatrinbackupserv1 into ipekatrin1. Imported RAID configs, RAID started and seems works stable using SAS2 setup.
    - Software
	* Removed glusterfs & fat_storage labels from ipekatrin1.ipe.kit.edu node
	    oc label node/ipekatrin1.ipe.kit.edu glusterfs-
	    oc label node/ipekatrin1.ipe.kit.edu fat_storage-
	* Indentified all endpoints used in PVs (no PV specifies IPs directly). No PV hardcode IPs directly (and it seems unsupported anyway)
	  Editied endpoints: gfs glusterfs-dynamic-etcd glusterfs-dynamic-metrics-cassandra-1 glusterfs-dynamic-mongodb glusterfs-dynamic-registry-claim glusterfs-dynamic-sharelatex-docker
	* Verified that no glusterblock devices is used by pods or outside (no iscsi devics). Checked that heketi storageClass can be safely disabled without affecting existing volumes
	  Teminated heketi/glusterblock services, removed storageclasses
	* Checked ands-distributed scripts & crons. No referring to gluster. Monitoring checks raid status, but this probably is not critical as it would just report error (which is true)
	* Set nfsganesha cluster nodes to andstorage2 only on ipekatrin1/2 (no active server on ipekatrin3). Service is inactive at the moment
	  Anyway double-check to disable on ipekatrin1 on a first boot
	* Found active 'block' volume in glusterfs. Checked it is empty and is not used by any active 'pv'. Stopped and deleted.
	* Backup is done on /mnt/provision which should work in new configuration. So, no changes are needed.
	* Mount points adjusted.
    - First Boot:
	* Disable nfs-ganesha on first boot on ipekatrin1
	* Verified that glusterfs is not started and gluster mounts are healthy
	* etcd is running and seem healthy
	    ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --endpoints https://`hostname`:2379 member list
	    curl -v --cert /etc/etcd/peer.crt      --key /etc/etcd/peer.key      --cacert /etc/etcd/ca.crt  -s https://192.168.13.1:2379/v2/stats/self
	* origin-master-api and origin-master-controllers are runnign
	* origin-node and docker failed. /var/lib/docker is on the raid (mounted /var/lib/docker, but used via lvm thin pool).
	* Created '/var/lib/docker-local for now and configured docker to user overlay2 in /etc/sysconfig/docker-storage
	    DOCKER_STORAGE_OPTIONS="--storage-driver=overlay2 --graph=/var/lib/docker-local"
	* Adjusted selinux contexts
	    semanage fcontext -a -e /var/lib/docker /var/lib/docker-local
	    restorecon -R -v /var/lib/docker-local
	* Infrastructure pods are running on ipekatrin1
	* Check Status and monitoring scripts are working [ seems reasonable to me ]
	    > Raid is not optimal and low data space is report (/mnt/ands is not mounted)
	    > Docker is not reporting available Data/Metadata space (as we are on local folder)
	* Check /var/lib/docker-local  space usage is monitored
	    > Via data space usage
    - Problems
	* We have '*-host' pvs bound to /mnt/hostdisk which are used adei/mysql (nodes 2&3) and as katrin temporary data folder. Currently keep node1 as master, but disable scheduling
	    oc adm cordon ipekatrin1.ipe.kit.edu
    - Backup
	* Backups from 'provision' volume are taken to 'kaas-manager' VM
    - Monitor
	* Usage in /var/lib/docker-local [ space usage ]
    - ToDo
	* Try building storage RAID in ipekatrinbackupserv1 (SFF-8643 to SFF-8087 cable needed, RAID-to-backplane). Turn on, check data is accessible and turn-off.
	* We shall order larger SSD for docker (LVM) and KATRIN temporary files (/mnt/hostraid). Once done, uncordon jobs on katrin2
	    oc adm uncordon ipekatrin1.ipe.kit.edu
	* We might try building a smaller RAID from stable disk bays and move ADEI replica here (discuss!) or a larger from SAS2 drives if it proves more stable.
	* We might be able to use Intel RES2SV240 or LSISAS2x28 expander board to reduce SAS3 to SAS2 speeds...

 2025.11.01-03
    - Document attempts to recover storage raid
    - GlusterFS changes and replication