From 0b0b9954c2d0602b1e9d0a387d2a195a790f8084 Mon Sep 17 00:00:00 2001 From: "Suren A. Chilingaryan" Date: Thu, 22 Mar 2018 04:37:46 +0100 Subject: Various fixes and provide ADEI admin container... --- docs/status.txt | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 119 insertions(+) create mode 100644 docs/status.txt (limited to 'docs/status.txt') diff --git a/docs/status.txt b/docs/status.txt new file mode 100644 index 0000000..681c953 --- /dev/null +++ b/docs/status.txt @@ -0,0 +1,119 @@ + OpenShift cluster is up and running. I don't plan to re-install it unless there is a new severe problems turn out for Bora/KatrinDB. +It seems the system is fine for adei. You can take a look on http://adei-katrin.kaas.kit.edu + +So, that we have: + - Automatic kickstart of the servers. This is normally done over DHCP. Since I don't have direct control over DHCP, + I made a simple system to kickstart over IPMI. Scripts instruct servers to boot from Virtual CD and fetch kickstart + from the web server (ufo.kit.edu actually). The kickstart is generated by php script based on server name and/or + DHCP address. + - Ansible-based playbooks to configure the complete cluster. Kickstart produces minimal systems with ssh server up + and running. Here, I have the scripts to build complete cluster, including databases and adei installation. There + are also playbooks for maintenance tasks like adding new nodes, etc. + - Upgrades are not always (or actually rarely) running smoothely. To test new configurations before applying them + to the production system, I also support provisioning of the staging cluster in vagrant-controlled virtual machines. + This is currnetly running on ipepdvcompute3. + - Replicated GlusterFS storage and some security provisions to prevent conteiners in one project to destroy data + belonging to the another. The selected subset of data can be made available over NFS to external hosts, but I'd + rather prefer to not overuse this possibility. + - To simplify creating conteiners with complex storage requirements (like ADEI), there are also Ansible scripts + to generate OpenShift templates based on the configured storage and provided container specification. + - To ensure data integrity, the database engines do a lot of locking, syncing, and small writes. This does not + play well with network file systems like GlusterFS. It is possible to tune database parameters a bit and run + databases with small intensity of writes, but it is unsuitable for large and write-intensive workloads. The + alternative is to store data directly on local storage and use repliation engine of database itself to ensure + high availability. I have prepared containers to quickly bootstrap two options Master-Master replication with + Galera and standard MySQL Master-Slave replication. Master/Slave replication is assynchronous and because of + that significantly faster and I use it as a good compromise. It will take about 2 weeks to re-cache all Katrin + data. Quite long, but it is even longer with other options. If master server crashes the users will still have + access to all the historical archives and will be able to proxy data requests to the source datbase (i.e. + BORA will also work). And there is no need to re-cache everything as slave could be easily converted to master. + The Master/Master replication is about 50% slower, but still can be used for smaller databases if also uniterrupted + writes are crucial. + - Distributed ADEI. All setups now is completely independent (use different databases, can be stopped and started + independently, etc.). Each setup constists of 3 main components: + 1) Frontends: There are 3 frontends: production, debugging, and for logs. They are accessible individually, + like: + http://adei-katrin.kaas.kit.edu + http://adei-katrin-debug.kaas.kit.edu + http://adei-katrin-logs.kaas.kit.edu + * The production frontend can be scalled to run several replicas. This is not required for performance, + but if 2 copies are running, there will be no service interruption if one of servers crashed. Otherwise, + there is a dozen minutes outtage will OpenShift detects that node is gone for good. + 2) Maintenance: There is cron-style containers performing various maintenance tasks. Particularly, they + analyze current data source configuration and schedule the caching. + 3) Cachers: The caching is performed by a 3 groups of conteiners. One is responsible for current data, + the second for archives, and third for logs. Each group can be scalled independently. I.e. in the begining + I run multiple archive-caching replicas to get the data in. Then, focus is shifted to getting current data + faster. It also may differ significantly between setups. Katrin will run multiple caching replicas, but less + important or small data sources will get only one. + * This architecture also allows to remove 1-minute update latency. I am not sure we can be very fast with + larget Katrin setup on current minimalistic cluster, but technically the updates can be as fast as hardware + allows. + - There is an OpenShift template to instantiate all this containers in one go by providing a few parameters. The + only requirement is to have 'setup' configuration in my repository. I also included in ADEI sources a bunch of + scripts to start all known setups with preconfigured parameters and to perform various maintenance tasks. Like, + ./provision.sh -s katrin - to create launch katrin setup on the cluster + ./stop-caching.sh -s katrin - to temporary stop caching + ./start-caching.sh -s katrin - to restart caching with pre-configured number of replicas + ... + - Load-balancing and high-availability using 2 ping-pong IPs. By default katrin1 & katrin2 IPs are assigned to + both masters of the clusters. To balance load, the kaas.kit.edu is resolved in round-robin fashion to one of them. + If one master failed, its IP will migrate to remaining master and no service interruption will occur. Both masters + run OpenVPN tunnels to Katrin network. The remaining node is routing trough one of the masters. This configuration + is also highly available and should not suffer if one of the masters crashing. + +What we are still missing: + - Katrin datbasse. Marco have prepared containers using prototype I run last years. Hopefully, Jan can run it on + the new system with minimal number of problems. + - BORA still need to be moved. + - Then, I will decommision the old Katrin server. + +Fault-tolerance and high-availability +===================================== + - I have tested a bit for fault tolerance and recoverability. Both GlusterFS and OpenShift work fine if a single + node failed. All data is available and new data can be written without problems. There is also no service + interruption as ADEI runs two frontends and also includes backup MySQL server. Only caching may stop if + master MySQL server is hit. + - If node recovers, it will-be re-intergrated automatically. We may only need to manually convert MySQL slave + to Master. Adding replacement nodes is also working quite easy using provided provisioning playbooks. But the + Gluster bricks needs to be migrated manually. I provide also some scripts to simplify this task. + - The situation is worse if cluster is completely turned off and turned on. Storage survive quite well, but + it is necessary to check that all volumes fully healthy (sometimes volume loose some bricks and needs to be + restarted to reconnect them). Also, some pods running before reboot may fail to start. Overall, it is better + to avoid. If reboot is needed for some reason, the best approach is to perform rolling reboot, restarting + one node after another to keep cluster always alive. + +Performance +=========== + I'd say it is more-or-less on pair with the old system (which is expected). The computing capabilities are + quite faster (still there is a significant load on master servers to run the cluster), but network and + storage have more-or-less the same speed. + - In fact we have only a single storage node. The second is used for replication and third node is required + for arbitrage in split-brain case. I will be able to use the third node also for storage, but I need at least + another 4th node in the cluster to do it. The new drives slightly faster, but added replication slows down + performance considerably. + - The containers can't use Infiniband network efficiently. The bridging is used to allow fast networking + in conteiners. Unfortunatelly, IPoIB is a high level network layer and does not provide Ethernet L2 support + required to create the bridge. Consequnetly, there is a lot of packet re-building going on and the network + performance is capped at about 3 GBit/s for containers. It is not realy a big problem now, as host systems + are not limited. So the storage is able to use all bandwidth. + +Upgrades +======== + We may need to scale the cluster slightly if it to be used beyond Katrin needs or with the significant + increase of load by Katrin. Having 1-2 more nodes should be helpful to storage system. It also may be worth + to add 40Gbit Ethernet switch. The Mellanox cards work in both Ethernet and Infiniband modes. Their switch + is actually also, but they want 5 kEUR for the license to enable this feature. I guess for this money we + should be able to buy a new piece of hardware. + + Having more storage nodes, we can also prototype new data management solutions without disturbing Katrin + tasks. One idea would be run Apache Cassandra and try to re-use the data broker developed by university + group to get ADEI data in their cluster. Then, we can add analysis framework on top of ADEI using + Jupiter notebooks with Cassandra. Furthermore, there is Alpha version for NVIDIA GPUs support for + OpenShift. So, we can try to integrate also some computing workload and potentially run also WAVe inside. + I'd not do it on production system, but if we get new nodes we first may try to setup a second OpenShift + cluster for testing (pair of storage nodes + GPU node). And later re-integrate it with the main one. + + As running without shutdowns is pretty crucial, another question if we can put the servers somewhere + at SCC with reliable power supplyies, air conditioning, etc. I guess we can't expect to run without shutdowns + in our server room. -- cgit v1.2.3