summaryrefslogtreecommitdiffstats
path: root/docs/status.txt
blob: 681c953a92083c66322bf2dd0bfbcd9603363158 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
 OpenShift cluster is up and running. I don't plan to re-install it unless there is a new severe problems turn out for Bora/KatrinDB.
It seems the system is fine for adei. You can take a look on http://adei-katrin.kaas.kit.edu

So, that we have:
 - Automatic kickstart of the servers. This is normally done over DHCP. Since I don't have direct control over DHCP, 
 I made a simple system to kickstart over IPMI. Scripts instruct servers to boot from Virtual CD and fetch kickstart
 from the web server (ufo.kit.edu actually). The kickstart is generated by php script based on server name and/or
 DHCP address.
 - Ansible-based playbooks to configure the complete cluster. Kickstart produces minimal systems with ssh server up 
 and running. Here, I have the scripts to build complete cluster, including databases and adei installation. There
 are also playbooks for maintenance tasks like adding new nodes, etc.
 - Upgrades are not always (or actually rarely) running smoothely. To test new configurations before applying them
 to the production system, I also support provisioning of the staging cluster in vagrant-controlled virtual machines.
 This is currnetly running on ipepdvcompute3.
 - Replicated GlusterFS storage and some security provisions to prevent conteiners in one project to destroy data 
 belonging to the another. The selected subset of data can be made available over NFS to external hosts, but I'd
 rather prefer to not overuse this possibility.
 - To simplify creating conteiners with complex storage requirements (like ADEI), there are also Ansible scripts
 to generate OpenShift templates based on the configured storage and provided container specification.
 - To ensure data integrity, the database engines do a lot of locking, syncing, and small writes. This does not 
 play well with network file systems like GlusterFS. It is possible to tune database parameters a bit and run
 databases with small intensity of writes, but it is unsuitable for large and write-intensive workloads. The
 alternative is to store data directly on local storage and use repliation engine of database itself to ensure
 high availability. I have prepared containers to quickly bootstrap two options Master-Master replication with 
 Galera and standard MySQL Master-Slave replication. Master/Slave replication is assynchronous and because of
 that significantly faster and I use it as a good compromise. It will take about 2 weeks to re-cache all Katrin
 data. Quite long, but it is even longer with other options. If master server crashes the users will still have 
 access to all the historical archives and will be able to proxy data requests to the source datbase (i.e. 
 BORA will also work). And there is no need to re-cache everything as slave could be easily converted to master.
 The Master/Master replication is about 50% slower, but still can be used for smaller databases if also uniterrupted 
 writes are crucial.
 - Distributed ADEI. All setups now is completely independent (use different databases, can be stopped and started
 independently, etc.). Each setup constists of 3 main components: 
    1) Frontends: There are 3 frontends: production, debugging, and for logs. They are accessible individually,
    like:
            http://adei-katrin.kaas.kit.edu
            http://adei-katrin-debug.kaas.kit.edu
            http://adei-katrin-logs.kaas.kit.edu
      * The production frontend can be scalled to run several replicas. This is not required for performance,
      but if 2 copies are running, there will be no service interruption if one of servers crashed. Otherwise,
      there is a dozen minutes outtage will OpenShift detects that node is gone for good.
    2) Maintenance: There is cron-style containers performing various maintenance tasks. Particularly, they 
    analyze current data source configuration and schedule the caching.
    3) Cachers: The caching is performed by a 3 groups of conteiners. One is responsible for current data,
    the second for archives, and third for logs. Each group can be scalled independently. I.e. in the begining
    I run multiple archive-caching replicas to get the data in. Then, focus is shifted to getting current data
    faster. It also may differ significantly between setups. Katrin will run multiple caching replicas, but less
    important or small data sources will get only one.
      * This architecture also allows to remove 1-minute update latency. I am not sure we can be very fast with
      larget Katrin setup on current minimalistic cluster, but technically the updates can be as fast as hardware
      allows.
 - There is an OpenShift template to instantiate all this containers in one go by providing a few parameters. The 
 only requirement is to have 'setup' configuration in my repository. I also included in ADEI sources a bunch of 
 scripts to start all known setups with preconfigured parameters and to perform various maintenance tasks. Like,
    ./provision.sh -s katrin            - to create launch katrin setup on the cluster
    ./stop-caching.sh -s katrin         - to temporary stop caching
    ./start-caching.sh -s katrin        - to restart caching with pre-configured number of replicas
    ...
 - Load-balancing and high-availability using 2 ping-pong IPs. By default katrin1 & katrin2 IPs are assigned to
 both masters of the clusters. To balance load, the kaas.kit.edu is resolved in round-robin fashion to one of them.
 If one master failed, its IP will migrate to remaining master and no service interruption will occur. Both masters 
 run OpenVPN tunnels to Katrin network. The remaining node is routing trough one of the masters. This configuration 
 is also highly available and should not suffer if one of the masters crashing.

What we are still missing:
 - Katrin datbasse. Marco have prepared containers using prototype I run last years. Hopefully, Jan can run it on 
 the new system with minimal number of problems.
 - BORA still need to be moved.
 - Then, I will decommision the old Katrin server.

Fault-tolerance and high-availability
=====================================
 - I have tested a bit for fault tolerance and recoverability. Both GlusterFS and OpenShift work fine if a single
 node failed. All data is available and new data can be written without problems. There is also no service 
 interruption as ADEI runs two frontends and also includes backup MySQL server. Only caching may stop if 
 master MySQL server is hit. 
 - If node recovers, it will-be re-intergrated automatically. We may only need to manually convert MySQL slave 
 to Master. Adding replacement nodes is also working quite easy using provided provisioning playbooks. But the 
 Gluster bricks needs to be migrated manually. I provide also some scripts to simplify this task.
 - The situation is worse if cluster is completely turned off and turned on. Storage survive quite well, but 
 it is necessary to check that all volumes fully healthy (sometimes volume loose some bricks and needs to be 
 restarted to reconnect them). Also, some pods running before reboot may fail to start. Overall, it is better
 to avoid. If reboot is needed for some reason, the best approach is to perform rolling reboot, restarting 
 one node after another to keep cluster always alive.
 
Performance
===========
 I'd say it is more-or-less on pair with the old system (which is expected). The computing capabilities are 
 quite faster (still there is a significant load on master servers to run the cluster), but network and 
 storage have more-or-less the same speed.
 - In fact we have only a single storage node. The second is used for replication and third node is required
 for arbitrage in split-brain case. I will be able to use the third node also for storage, but I need at least 
 another 4th node in the cluster to do it. The new drives slightly faster, but added replication slows down 
 performance considerably.
 - The containers can't use Infiniband network efficiently. The bridging is used to allow fast networking 
 in conteiners. Unfortunatelly, IPoIB is a high level network layer and does not provide Ethernet L2 support
 required to create the bridge. Consequnetly, there is a lot of packet re-building going on and the network
 performance is capped at about 3 GBit/s for containers. It is not realy a big problem now, as host systems
 are not limited. So the storage is able to use all bandwidth.

Upgrades
========
 We may need to scale the cluster slightly if it to be used beyond Katrin needs or with the significant 
 increase of load by Katrin. Having 1-2 more nodes should be helpful to storage system. It also may be worth
 to add 40Gbit Ethernet switch. The Mellanox cards work in both Ethernet and Infiniband modes. Their switch
 is actually also, but they want 5 kEUR for the license to enable this feature. I guess for this money we
 should be able to buy a new piece of hardware.
 
 Having more storage nodes, we can also prototype new data management solutions without disturbing Katrin
 tasks. One idea would be run Apache Cassandra and try to re-use the data broker developed by university 
 group to get ADEI data in their cluster. Then, we can add analysis framework on top of ADEI using
 Jupiter notebooks with Cassandra. Furthermore, there is Alpha version for NVIDIA GPUs support for
 OpenShift. So, we can try to integrate also some computing workload and potentially run also WAVe inside. 
 I'd not do it on production system, but if we get new nodes we first may try to setup a second OpenShift
 cluster for testing (pair of storage nodes + GPU node). And later re-integrate it with the main one.
 
 As running without shutdowns is pretty crucial, another question if we can put the servers somewhere
 at SCC with reliable power supplyies, air conditioning, etc. I guess we can't expect to run without shutdowns
 in our server room.