From e0b1b53f21095707af87a095934e971d788a90c7 Mon Sep 17 00:00:00 2001
From: "Suren A. Chilingaryan" <csa@suren.me>
Date: Wed, 11 Dec 2019 22:01:13 +0100
Subject: Document further possible problems with MySQL replication

---
 docs/troubleshooting.txt | 33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt
index a47d133..fd57150 100644
--- a/docs/troubleshooting.txt
+++ b/docs/troubleshooting.txt
@@ -319,11 +319,40 @@ MySQL
 =====
  - MySQL may stop replicating from the master. There is some kind of deadlock in multi-threaded SLAVE SQL.
  This can be seen by exexuting (which should show a lot of slave threads waiting on coordinator to provide
- load).
+ the load).
         SHOW PROCESSLIST;
  The remedy is to restart slave MySQL with 'slave_parallel_workers=0', give it a time to go, and then
- restart back in the standard multithreading mode.
+ restart back in the standard multithreading mode. This can be achieved by editing 'statefulset/mysql-slave-0'
+ and setting environmental vairable 'MYSQL_SLAVE_WORKERS' to 0 and, then, back to original value (16 currently).
 
+- This could be not end of this. The execution of statments from the log could 'stuck' because of the some "bad"
+  transaction. This can be detected by looking into the replication status on the slave database:
+        SHOW SLAVE STATUS\G
+ It may happen what while 'Retrieved_Gtid_Set' increments (binary log is successfully transfered from the master
+ and contains new transactions), the 'Executed_Gtid_Set' staying constant, i.e. new transactions are not applied 
+ to the slave server. Also, in this case 'Relay_Log_Pos' and 'Exec_Master_Log_Pos' will not advance. And the
+ 'Slave_SQL_Running_State' will likely stuck in the 'Reading event from the relay log' state. The solution is to 
+ skip misbehaving transaction. Basically, we need to skip problematic transaction and to continue from the next 
+ one. The following considers that GTID_MODE is 'on' as we have it on the current KaaS platform.
+  * First, we need to find the next transaction. Likely it the just an increment of the 'Executed_Gtid_Set', i.e.
+    if 'Executed_Gtid_Set=3beaee24-2e55-11e8-9612-0a580a80000c:1-953787306' on the slave, the next GTID would be:
+    'Executed_Gtid_Set=3beaee24-2e55-11e8-9612-0a580a80000c:953787307' (yes "1-" also goes away). But this is not
+    guaranteed. The more reliable way to find the misbehaving transaction is to look into the 'bin-log' on the master
+    server (information is taken from SHOW SLAVE STATUS executed on the slave). In '/var/lib/mysql/data' run
+        mysqlbinlog --start-position=<Exec_Master_Log_Pos> <Relay_Master_Log_File> | head -n 50
+    Here you will find which transaction has likely caused the problem. Furhermore, there will be the line looking like
+        SET @@SESSION.GTID_NEXT='4ab8feff-5272-11e8-9320-08002715584a:201840'
+    This is the gtid of the next transaction.
+ * So, the following commands should be executed on the slave MySQL server (see details, https://www.thegeekdiary.com/how-to-skip-a-transaction-on-mysql-replication-slave-when-gtids-are-enabled/)
+        SLAVE STOP;
+        SET @@SESSION.GTID_NEXT='<found_gtid_of_next_transaction>';
+        BEGIN;
+        COMMIT;
+        SET GTID_NEXT='AUTOMATIC';
+        SLAVE START;
+ * It is also possible to review the stuck transaction on the slave mysql node. In the '/var/lib/mysql/data' run 
+        mysqlbinlog --start-position=<Relay_Log_Pos> <Relay_Log_File>
+    
 Administration
 ==============
  - Some management tasks may require to login on ipekatrin* nodes. Thereafter, the password-less execution of
-- 
cgit v1.2.1