1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
|
Hardware
--------
2024
- ipekatrin1: Replaced disk in section 9. LSI software reports all is OK, but hardware led indicates a error (red). Probably indicator is broken.
2025.09 (early month)
- ipekatrin2: Replaced 3 disks (don't remeber slots). two of them was already once replaced.
- Ordered spare disks
2025.10.23
- ipekatrin2: Noticed and cleared RAID alarm attributed to the battery subsystem.
* No apparent problems at the moment. Temperatures are all in order. Battery reports healthy. Systems works as usual.
2025.09.28 - 2025.11.03
- ipekatrin1: Raid controller failed. The system was not running stable after replacement (disk disconnect after 20-30m operation)
- ipekatrin1: Temporarily converted in the master-only node (apps scheduling disabled, glusterfs stopped)
- ipekatrin1: New disks (from ipekatrinbackupserv1) were assembled in the RAID, assembled in gluster, and manual (file walk-trough) healing
is executed. Expected to take about 2-3 weeks (about 2TB per day rate). No LVM configured, direct mount.
- Application node will be recovered once we replace system SSDs with larger ones (as there currently no space for images/containers)
and I don't want to put it on new RAID.
- Original disks from ipekatrin1 are assembled in ipekatrinbackupserv1. Disconnect problem preserve as some disks stop answerin
SENSE queries and backplane restarts a whole bunch of 10 disks. Anyway, all disks are accessible in JBOD mode and can be copied.
* XFS fs is severely damaged and needs reapirs. I tried accessing some files via xfs debugger, it worked. So, directory structure
and file content is, at least partially, are good and repair should be possible.
* If recovery would be necessary: buy 24 new disks, copy one-by-one, assemble in RAID, recover FS.
2025.12.08
- Copied ipekatrin1 system SSDs to new 4TB drives and reinstalled in the server (only 2TB is used due to MBR limitations)
Software
--------
2023.06.13
- Instructed MySQL slave to ignore 1062 errors as well (I have skipped a few manually, but errors appeared non-stop)
- Also ADEI-KATRIN pod got stuck. Pod was running, but apache was stuck and not replying. This caused POD state to report 'not-ready' but for some reason it was still 'live' and pod was not restarted.
2025.09.28
- Restarted degraded GlusterFS nodes and make them work on remaining 2 nodes (1 replica + metadata for most of our storage needs).
- Turned out 'database' volume is created in Raid-0 mode and it used backend for KDB database. So, data is gone.
- Recovered KDB database from backups and moved it to glusterfs/openshift volume. Nothing left on 'database' volume. Can be turned off.
2025.09.28 - 2025.11.03
- GlusterFS endpoints temporarily changed to use only ipekatrin2 (see details in dedicated logs)
- Heketi and gluster-blockd were disabled and will be not available further. Existing heketi volumes preserved.
2025.12.09
- Renabled scheduling on ipekatrin1..
- Manually run 'adei-clean' on katrin & darwin, but keep 'cron' scripts stopped for now.
- Restored configs: fstab restored, */gfs endpoints. Heketi/gluster-block stays disabled. No other system changes.
- ToDo: Re-enable 'cron' scripts if we decide to keep system running in parallel with KaaS2.
|