From 0b0b9954c2d0602b1e9d0a387d2a195a790f8084 Mon Sep 17 00:00:00 2001
From: "Suren A. Chilingaryan" <csa@suren.me>
Date: Thu, 22 Mar 2018 04:37:46 +0100
Subject: Various fixes and provide ADEI admin container...

---
 docs/status.txt | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 119 insertions(+)
 create mode 100644 docs/status.txt

(limited to 'docs/status.txt')

diff --git a/docs/status.txt b/docs/status.txt
new file mode 100644
index 0000000..681c953
--- /dev/null
+++ b/docs/status.txt
@@ -0,0 +1,119 @@
+ OpenShift cluster is up and running. I don't plan to re-install it unless there is a new severe problems turn out for Bora/KatrinDB.
+It seems the system is fine for adei. You can take a look on http://adei-katrin.kaas.kit.edu
+
+So, that we have:
+ - Automatic kickstart of the servers. This is normally done over DHCP. Since I don't have direct control over DHCP, 
+ I made a simple system to kickstart over IPMI. Scripts instruct servers to boot from Virtual CD and fetch kickstart
+ from the web server (ufo.kit.edu actually). The kickstart is generated by php script based on server name and/or
+ DHCP address.
+ - Ansible-based playbooks to configure the complete cluster. Kickstart produces minimal systems with ssh server up 
+ and running. Here, I have the scripts to build complete cluster, including databases and adei installation. There
+ are also playbooks for maintenance tasks like adding new nodes, etc.
+ - Upgrades are not always (or actually rarely) running smoothely. To test new configurations before applying them
+ to the production system, I also support provisioning of the staging cluster in vagrant-controlled virtual machines.
+ This is currnetly running on ipepdvcompute3.
+ - Replicated GlusterFS storage and some security provisions to prevent conteiners in one project to destroy data 
+ belonging to the another. The selected subset of data can be made available over NFS to external hosts, but I'd
+ rather prefer to not overuse this possibility.
+ - To simplify creating conteiners with complex storage requirements (like ADEI), there are also Ansible scripts
+ to generate OpenShift templates based on the configured storage and provided container specification.
+ - To ensure data integrity, the database engines do a lot of locking, syncing, and small writes. This does not 
+ play well with network file systems like GlusterFS. It is possible to tune database parameters a bit and run
+ databases with small intensity of writes, but it is unsuitable for large and write-intensive workloads. The
+ alternative is to store data directly on local storage and use repliation engine of database itself to ensure
+ high availability. I have prepared containers to quickly bootstrap two options Master-Master replication with 
+ Galera and standard MySQL Master-Slave replication. Master/Slave replication is assynchronous and because of
+ that significantly faster and I use it as a good compromise. It will take about 2 weeks to re-cache all Katrin
+ data. Quite long, but it is even longer with other options. If master server crashes the users will still have 
+ access to all the historical archives and will be able to proxy data requests to the source datbase (i.e. 
+ BORA will also work). And there is no need to re-cache everything as slave could be easily converted to master.
+ The Master/Master replication is about 50% slower, but still can be used for smaller databases if also uniterrupted 
+ writes are crucial.
+ - Distributed ADEI. All setups now is completely independent (use different databases, can be stopped and started
+ independently, etc.). Each setup constists of 3 main components: 
+    1) Frontends: There are 3 frontends: production, debugging, and for logs. They are accessible individually,
+    like:
+            http://adei-katrin.kaas.kit.edu
+            http://adei-katrin-debug.kaas.kit.edu
+            http://adei-katrin-logs.kaas.kit.edu
+      * The production frontend can be scalled to run several replicas. This is not required for performance,
+      but if 2 copies are running, there will be no service interruption if one of servers crashed. Otherwise,
+      there is a dozen minutes outtage will OpenShift detects that node is gone for good.
+    2) Maintenance: There is cron-style containers performing various maintenance tasks. Particularly, they 
+    analyze current data source configuration and schedule the caching.
+    3) Cachers: The caching is performed by a 3 groups of conteiners. One is responsible for current data,
+    the second for archives, and third for logs. Each group can be scalled independently. I.e. in the begining
+    I run multiple archive-caching replicas to get the data in. Then, focus is shifted to getting current data
+    faster. It also may differ significantly between setups. Katrin will run multiple caching replicas, but less
+    important or small data sources will get only one.
+      * This architecture also allows to remove 1-minute update latency. I am not sure we can be very fast with
+      larget Katrin setup on current minimalistic cluster, but technically the updates can be as fast as hardware
+      allows.
+ - There is an OpenShift template to instantiate all this containers in one go by providing a few parameters. The 
+ only requirement is to have 'setup' configuration in my repository. I also included in ADEI sources a bunch of 
+ scripts to start all known setups with preconfigured parameters and to perform various maintenance tasks. Like,
+    ./provision.sh -s katrin            - to create launch katrin setup on the cluster
+    ./stop-caching.sh -s katrin         - to temporary stop caching
+    ./start-caching.sh -s katrin        - to restart caching with pre-configured number of replicas
+    ...
+ - Load-balancing and high-availability using 2 ping-pong IPs. By default katrin1 & katrin2 IPs are assigned to
+ both masters of the clusters. To balance load, the kaas.kit.edu is resolved in round-robin fashion to one of them.
+ If one master failed, its IP will migrate to remaining master and no service interruption will occur. Both masters 
+ run OpenVPN tunnels to Katrin network. The remaining node is routing trough one of the masters. This configuration 
+ is also highly available and should not suffer if one of the masters crashing.
+
+What we are still missing:
+ - Katrin datbasse. Marco have prepared containers using prototype I run last years. Hopefully, Jan can run it on 
+ the new system with minimal number of problems.
+ - BORA still need to be moved.
+ - Then, I will decommision the old Katrin server.
+
+Fault-tolerance and high-availability
+=====================================
+ - I have tested a bit for fault tolerance and recoverability. Both GlusterFS and OpenShift work fine if a single
+ node failed. All data is available and new data can be written without problems. There is also no service 
+ interruption as ADEI runs two frontends and also includes backup MySQL server. Only caching may stop if 
+ master MySQL server is hit. 
+ - If node recovers, it will-be re-intergrated automatically. We may only need to manually convert MySQL slave 
+ to Master. Adding replacement nodes is also working quite easy using provided provisioning playbooks. But the 
+ Gluster bricks needs to be migrated manually. I provide also some scripts to simplify this task.
+ - The situation is worse if cluster is completely turned off and turned on. Storage survive quite well, but 
+ it is necessary to check that all volumes fully healthy (sometimes volume loose some bricks and needs to be 
+ restarted to reconnect them). Also, some pods running before reboot may fail to start. Overall, it is better
+ to avoid. If reboot is needed for some reason, the best approach is to perform rolling reboot, restarting 
+ one node after another to keep cluster always alive.
+ 
+Performance
+===========
+ I'd say it is more-or-less on pair with the old system (which is expected). The computing capabilities are 
+ quite faster (still there is a significant load on master servers to run the cluster), but network and 
+ storage have more-or-less the same speed.
+ - In fact we have only a single storage node. The second is used for replication and third node is required
+ for arbitrage in split-brain case. I will be able to use the third node also for storage, but I need at least 
+ another 4th node in the cluster to do it. The new drives slightly faster, but added replication slows down 
+ performance considerably.
+ - The containers can't use Infiniband network efficiently. The bridging is used to allow fast networking 
+ in conteiners. Unfortunatelly, IPoIB is a high level network layer and does not provide Ethernet L2 support
+ required to create the bridge. Consequnetly, there is a lot of packet re-building going on and the network
+ performance is capped at about 3 GBit/s for containers. It is not realy a big problem now, as host systems
+ are not limited. So the storage is able to use all bandwidth.
+
+Upgrades
+========
+ We may need to scale the cluster slightly if it to be used beyond Katrin needs or with the significant 
+ increase of load by Katrin. Having 1-2 more nodes should be helpful to storage system. It also may be worth
+ to add 40Gbit Ethernet switch. The Mellanox cards work in both Ethernet and Infiniband modes. Their switch
+ is actually also, but they want 5 kEUR for the license to enable this feature. I guess for this money we
+ should be able to buy a new piece of hardware.
+ 
+ Having more storage nodes, we can also prototype new data management solutions without disturbing Katrin
+ tasks. One idea would be run Apache Cassandra and try to re-use the data broker developed by university 
+ group to get ADEI data in their cluster. Then, we can add analysis framework on top of ADEI using
+ Jupiter notebooks with Cassandra. Furthermore, there is Alpha version for NVIDIA GPUs support for
+ OpenShift. So, we can try to integrate also some computing workload and potentially run also WAVe inside. 
+ I'd not do it on production system, but if we get new nodes we first may try to setup a second OpenShift
+ cluster for testing (pair of storage nodes + GPU node). And later re-integrate it with the main one.
+ 
+ As running without shutdowns is pretty crucial, another question if we can put the servers somewhere
+ at SCC with reliable power supplyies, air conditioning, etc. I guess we can't expect to run without shutdowns
+ in our server room.
-- 
cgit v1.2.3