summaryrefslogtreecommitdiff
path: root/log.txt
blob: 8ee02bbf7c356bd8a0161b7b480f9d8b72f89beb (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
 Hardware
 --------
 2024
    - ipekatrin1: Replaced disk in section 9. LSI software reports all is OK, but hardware led indicates a error (red). Probably indicator is broken.

 2025.09 (early month)
    - ipekatrin2: Replaced 3 disks (don't remeber slots). two of them was already once replaced.
    - Ordered spare disks

 2025.10.23
    - ipekatrin2: Noticed and cleared RAID alarm attributed to the battery subsystem. 
        * No apparent problems at the moment. Temperatures are all in order. Battery reports healthy. Systems works as usual.

 2025.09.28 - 2025.11.03
    - ipekatrin1: Raid controller failed. The system was not running stable after replacement (disk disconnect after 20-30m operation)
    - ipekatrin1: Temporarily converted in the master-only node (apps scheduling disabled, glusterfs stopped)
    - ipekatrin1: New disks (from ipekatrinbackupserv1) were assembled in the RAID, assembled in gluster, and manual (file walk-trough) healing 
      is executed. Expected to take about 2-3 weeks (about 2TB per day rate). No LVM configured, direct mount.
    - Application node will be recovered once we replace system SSDs with larger ones (as there currently no space for images/containers)
      and I don't want to put it on new RAID.
    - Original disks from ipekatrin1 are assembled in ipekatrinbackupserv1. Disconnect problem preserve as some disks stop answerin
      SENSE queries and backplane restarts a whole bunch of 10 disks. Anyway, all disks are accessible in JBOD mode and can be copied.
	* XFS fs is severely damaged and needs reapirs. I tried accessing some files via xfs debugger, it worked. So, directory structure
	and file content is, at least partially, are good and repair should be possible.
        * If recovery would be necessary: buy 24 new disks, copy one-by-one, assemble in RAID, recover FS.

 2025.12.08
    - Copied ipekatrin1 system SSDs to new 4TB drives and reinstalled in the server (only 2TB is used due to MBR limitations)

Software
--------
 2023.06.13
    - Instructed MySQL slave to ignore 1062 errors as well (I have skipped a few manually, but errors appeared non-stop)
    - Also ADEI-KATRIN pod got stuck. Pod was running, but apache was stuck and not replying. This caused POD state to report 'not-ready' but for some reason it was still 'live' and pod was not restarted.

 2025.09.28
    - Restarted degraded GlusterFS nodes and make them work on remaining 2 nodes (1 replica + metadata for most of our storage needs).
    - Turned out 'database' volume is created in Raid-0 mode and it used backend for KDB database. So, data is gone.
    - Recovered KDB database from backups and moved it to glusterfs/openshift volume. Nothing left on 'database' volume. Can be turned off.

 2025.09.28 - 2025.11.03
    - GlusterFS endpoints temporarily changed to use only ipekatrin2 (see details in dedicated logs)
    - Heketi and gluster-blockd were disabled and will be not available further. Existing heketi volumes preserved.

 2025.12.09
    - Renabled scheduling on ipekatrin1..
    - Manually run 'adei-clean' on katrin & darwin, but keep 'cron' scripts stopped for now.
    - Restored configs: fstab restored, */gfs endpoints. Heketi/gluster-block stays disabled. No other system changes.
    - ToDo: Re-enable 'cron' scripts if we decide to keep system running in parallel with KaaS2.