Tuesday, 9 December 2014

MaxScale, manual control, external monitors and notification methods


One of the nice things about the "plug and play" approach of MaxScale is that people constantly find ways of using it that were not originally envisaged when we designed MaxScale. One such configuration that I have heard of from multiple sources is using monitoring outside of MaxScale itself. This post will discuss a little about how monitoring works and how it can be moved outside of MaxScale. In particular a simplified example will be presented which shows how to use the notification mechanism in Galera to control MaxScale's use of the nodes in a Galera cluster.

Monitoring Within MaxScale

Perhaps it is best to start with a little background as to what the function of the monitor plugin is within MaxScale, how they work and how the plugins communicate with the other components of MaxScale.

MaxScale monitors the servers for one reason only, so that it is able to feed the routing algorithms with the state they require in order to make the decision as to which server the request the router is dealing with should be sent. In particular the monitoring is not designed to be a mechanism that administrators would use to determine the health of the database cluster or to provide critical alarms. There are other tools that are much more focused to preforming that task than MaxScale.

Monitor plugins within MaxScale have one of the least complex interface requirements of any plugin module within MaxScale, other than the entry points used to configure options and for diagnostic purposes there are essentially four entry points. A pair of entry points to start and stop the monitor and a pair of entry points to register and unregister a server that should be monitored. The monitor implementation is expected to create a thread of its own on which to run, it will then monitor the servers that are registered with it asynchronously from any operations within MaxScale.

The monitor threads communicate with the rest of MaxScale by setting flag bits on the individual servers to reflect the state of the server and adding values to some monitor owned fields within the servers. These "monitor owned" fields are related to the measurement of specific quantities, namely the replication lag with a cluster. The mechanism of more interest here is the status bits within the server structure. The bit values that are of interest are;
  • Running - the server is running the database, connections can be made to the server and SQL statements executed.
  • Joined - the server is a member of a multi-master cluster and is able to accept both read and write statements to execute on that cluster.
  • Master - the server is a member of a cluster and can be considered as a master of that cluster, i.e. it can accept both read and write requests.
  • Slave - the server is a member of a cluster and is able to accept read requests for that cluster only.
  • Maintenance - the server is running however it is in a maintenance mode, new connections or operations should not be sent to the server.

It is possible to set all of these bits manually, however under normal circumstances, when a monitor modules has been added to the configuration file, these will be changed back periodically by the monitor to reflect the situation as observed by the monitor. Therefore in order to effectively control the state of the servers either manually or externally of MaxScale it is important that no monitor module within MaxScale is monitoring the same set of hosts.

Manually Setting Server Status

The MaxScale command line interface application, maxadmin, allows the server status bits to be set and cleared manually using the set server command. MaxAdmin will also take commands as arguments on the command line, therefore a simple command can be used to set the status of a database server.

    $ maxadmin -pskysql set server dbserver1 master

The above command will set the server which has a section name of dbserver1 to be a master server, note this has no impact on any other server status bits or on the bits of other servers. If the dbserver1 server previously had the slave bit set it would remain set. Likewise if the server dbserver2 had the master bit set before the call then it will still be set after this call.

To clear a status bit on a server the clear server command should be used.

    $ maxadmin -pskysql clear server dbserver1 master

As soon as a manual method of control is available, using these maxadmin scripts, this provides a mechanism to also integrate with third party monitors or use any state change notification that exists within your cluster management software.

Using Galera Notify Command Option

The Galera Cluster provides a mechanism to have the cluster make a call to an external application whenever the cluster state changes, this can be utilised to have the Galera Cluster directly control MaxScale rather than have MaxScale monitor the status of the Galera Cluster and responded to the monitored state values. This approach has the advantage that MaxScale will be informed immediately of a state change within the Galera Cluster, however the disadvantage of this mechanism is that at least one node in the cluster must be running in order to trigger the notification process.

The wsrep_notify_cmd option in the Galera Cluster can be used to define the name of an application or script that is called when the status of the node changes or the membership of the cluster changes. This script is called with a number of parameters that need to be parsed in order to determine what has happened in the cluster. At the simplest level this can be used to set to status of the node on which it is executed. When the status of the current node changes the wsrep_notify_cmd is called with the --status= argument. A very simple script might be

#!/bin/sh

while [ $# -gt 0 ]
do
    case $1 in
    --status)
        STATUS=$2
        shift
        ;;
    esac
    shift
done

if [ "$STATUS" == "Undefined" ]
then
    maxadmin -p$PASSWD clear server $SERVER synced
    maxadmin -p$PASSWD clear server $SERVER running
else
    case $STATUS in
    Joiner)
        maxadmin -p$PASSWD set server $SERVER running
        maxadmin -p$PASSWD clear server $SERVER synced
        ;;
    Donor)
        maxadmin -p$PASSWD set server $SERVER running
        maxadmin -p$PASSWD set server $SERVER synced
        ;;
    Joined)
        maxadmin -p$PASSWD set server $SERVER running
        maxadmin -p$PASSWD clear server $SERVER synced
        ;;
    Synced)
        maxadmin -p$PASSWD set server $SERVER running
        maxadmin -p$PASSWD set server $SERVER synced
        ;;
    Error)
        maxadmin -p$PASSWD clear server $SERVER running
        maxadmin -p$PASSWD clear server $SERVER synced
        ;;
    esac
fi

Note the the above script will marked a server as running and joined to the cluster is the Galera status says that it is either a fully synchronised node or it is a donor node, this differs from the default Galera Monitor plugin in that the plugin will remove the donor node from consideration for traffic to be routed to it.

In order to make this script useful two more things must be done, the variable PASSWD should be set to the password of the admin user for admin and the variable SERVER should be set to the name by which this server is known in the MaxScale configuration file.

Other Notification Methods

Although the above example is based around the Galera wsrep_notify_cmd mechanism the principles still hold for other environments, for example the MHA scripts can be enhanced in order to allow MHA to control the server status bits if replication clusters. The MonYog http based API can be used such that when MonYog monitors the change of state of a node it will make an API call to an HTTP server which will execute the set of maxadmin commands needed to reflect the monitored change without MaxScale. This style of manual (or external) control of MaxScale could also be integrated into pacemaker resource agents in order to synchronise the changes to the database with the changes in MaxScale.

Monday, 29 September 2014

Releases, profiling and scheduling

The MaxScale team have been working hard fixing bugs and improving performance. On Friday we released a update of MaxScale, the pluggable proxy for MySQL and MariaDB, I wanted to write a little about a few of those changes. I will not mention every change, there are release notes that give the list of bugs fixed in this version, but rather highlight a couple of performance related changes and describe the rationale behind them. However before I start on the two items I wanted to discuss just a quick note to say that this version introduces cmake as the means to build MaxScale. The previous Makefiles are still available in this version, but will be withdrawn in future versions as we transition to cmake for build, installation and packaging.

Connection Starvation

In one of the test scenarios that I was running I was observing an issue that manifested itself as pauses in the result set returned via MaxScale. In the particular situation a very large, multi-megabyte, result set was returned, sometimes during the result set stream there would be a measurable pause during which the client would receive no new data. There was clearly some blocking or starvation issue within MaxScale. Clearly this needed to be investigated and resolved.

I added some extra instrumentation to poll.c and noted that we often had no threads available to poll. Increasing the number of threads did not solve the problem, so clearly we had an issue with threads being blocked. This turned out to be the mutexes that had been added to prevent multiple threads processing reads and write for the same DCB. The problem was that this caused other threads to block whilst the first thread worked on a read or a write and starved other DCB's of the processor. So these mutexes had to be removed and replaced by a simple queuing mechanism that allows for the subsequent threads to be free'd back to the polling loop immediately rather than blocking.

This has made a big difference and is key to the MaxScale architecture - the guiding principle is that threads should not block, if they do then we have the problem that we will never be able to handle the load as the number of threads we require will grow uncontrollably.


Now the threads were more available, and the setting of the number of threads makes a bigger difference to performance, I went on to look at synchronisation issues. So I added instrumentation to the spinlock to detect hot locks that might need decomposing or indicate poor parallelisation in the code. What I found was interesting…

Looking at the Descriptor Control Block (DCB) that was being used to write the results back to the client I saw the write queue spinlock was the only lock that was suffering any obvious contention, however this was isolated to a single DCB, so was not an issue with contention between router sessions, but within a single session.


       Write Queue Lock Statistics:
               Spinlock acquired                       1635627
               Total no. of spins                      69365889
               Average no. of spins (overall)          42
               Average no. of spins (when contended)   11336
               Maximum no. of spins                    68855
               Maximim no. of blocked threads          1
               Contended locks                         6119
               Contention percentage                   0


The spinlock was often acquired and mostly did not contend, however there were a significant amount of calls, 6119 in this sample, that did. When there was contention the number of spins required was high, average of 11336. So the conclusion was that some times when the queue is locked there is a significant amount of work being done with the lock being held. There are two things it could be, either the appending of the buffers to the queue or the processing of the queue to send data. Looking at the show dcb output in the debug interface the queue could become very large.

The next approach tried was to profile the code, with gprof, to see if this problem could be pinned down further. This yielded some more interesting data.

Each sample counts as 0.01 seconds.
 %   cumulative   self  self     total
time   seconds   seconds    calls  ms/call  ms/call  name
94.00    314.29   314.29 26007877 0.01     0.01  gwbuf_append
 4.43    329.10    14.82 1116143246   0.00     0.00  atomic_add
 0.43    330.55     1.45                 dcb_drain_writeq
 0.28    331.49     0.94 26397028 0.00     0.00  spinlock_acquire
 0.21    332.18     0.69                 dcb_write
 0.16    332.73     0.55       34  16.18    16.18  gwbuf_alloc
 0.13    333.15     0.42                 frame_dummy
 0.12    333.56     0.41 42439118 0.00     0.00  gwbuf_consume
 0.07    333.80     0.24 33654544 0.00     0.00  gwbuf_free
 0.05    333.98     0.18 40316849 0.00     0.00  gw_write
 0.05    334.16     0.18 25997764 0.00     0.00  gwbuf_length
 0.03    334.25     0.09 26390425 0.00     0.00  spinlock_release



So it seems the gwbuf_append is taking significant amounts of CPU time, at least in this scenario. So I made a simple change to maintain a tail pointer in the linked list and remove the need to walk the linked list in order to insert an item. This made a difference, yielding the following profile data…

Each sample counts as 0.01 seconds.
 %   cumulative   self  self     total
time   seconds   seconds    calls  ms/call  ms/call  name
46.26      0.87     0.87 73798949   0.00     0.00  atomic_add
11.77      1.09     0.22      226   0.97     0.97  gwbuf_alloc
10.70      1.29     0.20                   dcb_drain_writeq
 8.56      1.45     0.16 21028798   0.00     0.00  gwbuf_consume
 5.88      1.56     0.11                   dcb_write
 4.81      1.65     0.09 10261279   0.00     0.00  gwbuf_append
 3.48      1.71     0.07 10581718   0.00     0.00  spinlock_acquire
 2.14      1.75     0.04 21027433   0.00     0.00  gwbuf_free
 1.34      1.78     0.03                   frame_dummy
 1.07      1.80     0.02 21048962   0.00     0.00  gw_write
 1.07      1.82     0.02        1  20.00    33.49  poll_waitevents


This  shows that gwbuf_append is no longer a major issue. The top function being the atomic_add that is used for gathering stats and implementing the spinlock primitives. This is not too surprising since atomic_add needs to lock the bus and is thus an expensive instruction.

It also had an impact on the write queue spinlock, with simpler duration sample now having no contention, with a single session running.

       Write Queue Lock Statistics:
               Spinlock acquired                     1176242
               Total no. of spins                    0
               Average no. of spins (overall)        0
               Maximum no. of spins                  0
               Maximim no. of blocked threads        0
               Contended locks                       0
               Contention percentage                 0



These changes are encouraging, but only related to heavy write loads on MaxScale currently, it does not give universal improvements in performance, but it does yield some significant improvements and reduce the CPU overhead for handling large results sets, making a previously CPU bound process now IO bound.

Blocking Zombies

MaxScale is designed to use several independent threads to process requests, each connection has a descriptor control block associated with it. These DCB's can only be destroyed when MaxScale is sure than known of the threads are currently using the DCB. If a DCB is destroyed prematurely by a thread there is a risk that another thread is still using it. Therefore MaxScale implements a two stage process to destroy the DCB's. When a thread wishes to destroy a DCB it marks it as a "zombie" and places it on a zombie queue. Part of the DCB is a bitmask, with a bit per thread running in MaxScale. This bit is set for all threads when the DCB is in a non-zombie state, once it is placed in the zombie state threads will clear the bit in this bitmask that corresponds to the thread's ID when the thread enters the polling loop. Once the last thread clears its' bit the DCB can be finally destroyed.

This mechanism relies on threads frequently entering the polling loop in order to cleanup DCB's in a timely fashion. However the change that was introduced above to queue embedded requests for a DCB was preventing the thread reentering the polling loop when a large request or response was being sent via MaxScale. The result was that on systems with result sets of many gigabytes a large number of DCB's would build up in the zombie queue and connection would not be closed. This was particularly noticeable when using the binlog router for MySQL replication.

Looking at this problem also highlighted a different issue that could occur, the polling scheme would allow for a thread to pick up all events available at the time the poll call was made. That thread would then be responsible for processing that set of events while allowing other threads to poll and collect a set of events for it to work on. If the first, or any event other than the last in this set either take significant time to be processed, or more likely be one of these connections that was streaming a large quantity of data, the subsequent events in that set would not be processed for some considerable time.

It was decided that the best approach to both of these problems would be to separate the processing of the events from the polling of the events. The implementation now will poll for events within a thread, add these events to queue that is common to all threads and then pick the first event from this queue to process. If another event comes in for a DCB that is currently being processed then it is marked as having pending events. When the current process completes that DCB will then be re-queued at the end of the event queue to be picked up later, as opposed to the previous scheme which would allow that pending event to be processed immediately to the detriment of other DCB's. This also meant that threads would reenter the polling loop more frequently and zombie process could be done more often in the case of these workloads.

The result was that under high load situations a much fairer event scheduling took place, every connection would tend to see a more predictable performance and surprisingly the processor load MaxScale imposed on the system was decreased.

These are just two of the changes that can be found in the current release of MaxScale, there are many others, some mundane other more important. I would encourage anybody that has not already down so to take a look at the release notes to see what has changed.  As with any software we still have issues that are open and we continue to work on these, increase the test coverage and respond to issues raised by users of MaxScale.

Thursday, 14 August 2014

Which SQL queries take all the time? Using MaxScale to answer that age old question.

Here we take a look at how one of the example filters supplied with the MaxScale 1.0 beta can answer that simplest of profiling questions - "Which of my database queries run within the MySQL server for the longest time?".

Why A MaxScale Filter?

MaxScale filters, although they can be thought of as simply a way to modify the SQL requests that pass through MaxScale, are also a good place to insert logging or monitoring. They have the advantage that that receive not just the SQL statements that are sent to the server, but they may also receive the returned results. This means that MaxScale can not just log the query information, but it can also record timestamps related to when it observes different parts of the request or response. As a way to demonstrate this a very simple example filter was constructed for the 1.0 Beta release of MaxScale - the top filter.

What Does Topfilter Do?

The topfilter filter observes both requests being sent to the server and responses being sent back. It sends both the requests and the responses onwards unmodified, however it takes note of the request contents, provided it is a SQL request, along with a timestamp when it saw the request. When it sees the first response come back from the server it also takes a timestamp and uses the difference as a measure of the execution time of the statement. The longest running queries are retained, along with the execution time data. The number of retained queries can be configured within the MaxScale configuration of the filter. When the client connection is closed a report will be written to file with the longest running queries that were observed for that connection.

How Do I Setup My System

The first step in creating a system to allow you to measure your queries is to configure a MaxScale service that is suitable for your application. That may be any of the supported routers; e.g. the readconnroute or Read/Write Splitter routers. Here we will assume that we only have one server, we are not using MaxScale for the scale-out abilities of the tool, merely as a simple performance tuning aid. Therefore we will use the readconnroute router with the option to connect to any server, the only server we have. The basic configuration for the service is as follows:

[Top10]
type=service
router=readconnroute
servers=dbnode1
user=maxadmin
passwd=6628C50E07CCE1F0392EDEEB9D1203F3

The service also requires a listener, to define the port on which MaxScale will listen to requests to connect to the service and to link the service with the protocol module.

[Top10 Listener]
type=listener
service=Top10
protocol=MySQLClient
port=7706

We have used port 7706 for MaxScale to listen to incoming requests, this will allow the MaxScale to be located on the same machine as the MySQL/MariaDB server, any port may be chosen by the user provided it is not already in use. If no MySQL instance is running on the same machine the default MySQL port of 3306 may be used.

In this example we will assume we have a single database node;  which we will call dbnode1.

The database node should be configured as follows;

[dbnode1]
type=server
address=…
port=3306
protocol=MySQLBackend

The address= entry should contain the IP address or a hostname for the server. The assumption here is that the MySQL server is using the default MySQL port of 3306 to listen for incoming requests.

For completeness we should also add a monitor module that can determine if the node is in the correct state to communicate. This is more important if the backend is not a single database server, but a cluster of servers. It is good practice to always include a monitor even when only a single node is configured as it allows for ease of extension to a cluster and provides more information as to the state of the system.

[MySQLMonitor]
type=monitor
module=mysqlmon
servers=dbnode1
user=maxmonitor
passwd= 6628C50E37CCE1F0392EDEEB9D1403F3

We now have a configuration that basically allows MaxScale to be used as a pure proxy for accesses to the database. There is no intelligence within MaxScale being used, as we have no options to route to more than a single database. Although the principles here are still true if multiple backend databases are given and the more usual scale-out properties of MaxScale are used, in order to reduce the complexity and illustrate the technique only a single backend database has been used here.

At this point the application should work if connected to port 7706 on the MaxScale host, but no monitoring data will be collected. We most add the filter into the MaxScale configuration.

To add the filter we must first create a filter definition that links our filter module to the parameters we want for this particular case. Assuming we want to capture the 10 longest running queries and write them into a log file in the directory /var/log/myapp with filenames of the form session.$$, a filter section as follows can be used.

[top10]
type=filter
module=topfilter
count=10
filebase=/var/log/myapp/session.

We then need to add this filter to the service by adding an extra line to the definition for the filter

[Top10]
type=service
router=readconnroute
router_options=running
servers=dbnode1
user=maxadmin
passwd=6628C50E07CCE1F0392EDEEB9D1203F3
filters=top10

Now if we connect to the port 7706 with our application we will get a number of files created in /var/log/myapp, these files will contain reports of the statements executed. An example of such a report is shown below.

Session started Thu Jun 19 17:06:34 2014
Connection from 127.0.0.1
Username        mark

Total of 24 statements executed.
Total statement execution time      35.653 seconds
Average statement execution time     1.486 seconds
Total connection time               80.443 seconds
Top 10 longest running queries in session.
==========================================

Time (sec) | Query
-----------+-----------------------------------------------------------------
    22.856 |  select sum(salary), year(from_date) from ...
     5.301 |  select d.dept_name as "Department", y.y1 ...
     2.898 |  select year(now()) - year(birth_date) as age, ...
     2.221 |  select dept_name as "Department", sum(salary) / 12 ...
     0.839 |  select dept_name as "Department", avg(year(now()) - ...
     0.665 |  select year(hire_date) as "Hired", d.dept_name, ...
     0.337 |  select moves.n_depts As "No. of Departments",...
     0.245 |  select year(now()) - year(birth_date) as age, ...
     0.175 |  select year(hire_date) as "Hired", count(*) as ...
     0.160 |  select year(hire_date) - year(birth_date) as ...
-----------+-----------------------------------------------------------------

Please note, the queries have been truncated in this blog entry, an actual run would include the complete query text.

Refining The Selection Criteria

Other options may be given to the top filter to reduce the subset of queries it will consider; regular expression patterns may be used to include or exclude particular statements, the client address may be used to limit the sessions to only those originating from that address or the client user name may be used to limit the functionality to only that user.

I am only interested in select performance

In this case you can use the regular expression capabilities to limit the report to simply show select statements, other statements will not be included in the ranking. To do this modify the filter section with the extra parameter match=

[top10]
type=filter
module=topfilter
count=10
match=select
filebase=/var/log/myapp/session.

I want to see what data modification statements take a long time

In this case, rather than build a complex match expression the exclude= option can be used to exclude any select statements.

[top10]
type=filter
module=topfilter
count=10
exclude=select
filebase=/var/log/myapp/session.

I only want select statements from a particular database user

The match= option can be used to restrict the logging to just select statements and the additional user= parameter can be used to further limit the statements that are including in the ranking.

[top10]
type=filter
module=topfilter
count=10
match=select
user=reporting
filebase=/var/log/myapp/session.

Conclusion

MaxScale was not designed as a tool that would be used to analyse the performance of your SQL application, and it is not as good or as extensive in this role as some of the purpose written tools, but the filters and the general plugin architecture of MaxScale mean that it can be used to perform tasks it was not originally conceived to do. The filter presented here is only really a demonstration of what can be achieved, it was written very quickly and did not require very much code to implement. However it hopefully can be useful in its own right and also to stimulate more complex and useful filters to be created.

Monday, 28 July 2014

MaxScale, from proxy to replication relay. Part 2, the slave side

Part 1 of this blog post told the story of creating a binlog router for MaxScale that could connect to a MySQL Replication Master, download binlog file from that master and store them locally on the MaxScale server. This post will concentrate on the other side of the router, the interaction with the MySQL slaves that will see MaxScale as the replication master.

Router Structure

In some ways the role of master for MaxScale is much closer to the more expected style of interaction that MaxScale was designed to deliver; a connections originates from a client to a MaxScale service. MaxScale then processes that requirement and returns a result back to the client. The most obvious difference is of course that the processing does not involve forwarding the request on to another server, rather it involves sending back a result already cached within MaxScale. However if you delve a little deeper you find that this is too simplistic a view and, as with the master interaction, the way a slave interacts with MaxScale does not follow this simple pattern.

It is necessary to understand a little about how the slave interaction works in order to highlight the differences and illustrate why and how changes had to be made to the MaxScale model to facilitate the slave side of the binlog router in MaxScale. It should be noted however that the aim was not to create a MaxScale variant that would just act as a binlog relay server or to have facilities in the core that would not be generally useful for other router plugins. The aim, as always with MaxScale, was to keep the specifics of the binlog routing problem within the router plugin, whilst enhancing the core with general functionality that could benefit other router, filter or protocol plugin.

Phases

The slave interaction can really be thought of as consisting of a number of phases; the first of these is the registration phase. The registration phase is where the slave and master exchange information they need in order to setup the replication channel. Once setup the master will stream replication events to the slave until such time as the connection is lost or the slave disconnects from the master.

The second phase is the catchup phase; this is when a slave connects to a master and request binlog records from a position in the binlog file which is before the current leading edge of the binlog file. The master must stream the binlog event from this point upto the current insert point in the current binlog file. The slave itself sees no difference between these historic binlog events and new binlog events for current database updates, it is a mere convince of implementation to consider this as a separate phase. Indeed this phase may not exist if the slave that connects is already up to date with the master. What makes this phase different from an implementation perspective however is that there is no external trigger to send these events; only the slave registration message. There is also potential for the number and size of messages to send to be massive. In testing several tens or hundreds of gigabytes have been sent during this phase.

The final phase is the steady state phase; the client is at the leading edge of the binlog records and is merely sent new binlog events when database updates occur. In this case the sending of new events is triggered by the arrival of events form the real database master. This is an example of events on one connection, the connection to the master, causing a reaction on one or more other connections, the connections to every up-to-date slave. Once a slave has entered this third phase it is possible for it to go back to the catchup phase if for some reason a particular slave connection is unable to maintain the rate required to match the incoming master arrival rate. Therefore it is normal to see slave connections go between phase 3 and phase 2 for brief periods.

Registration Phase

The first, registration phase fits vey easily into the MaxScale event driven model; a slave connects to MaxScale and sends requests, in the form of queries. These requests are parsed by a "mini-parser" in the router plugin and the stored response that was obtained when MaxScale registered to the real master server. The router implements a state machine for this slave replication registration progress, with each successful query exchange advancing the state of the slave connection until the state machine reaches the registered state. Upon reaching the registered state with the catchup phase is entered or the steady state phase depending upon the binlog position requested by the slave.

Catchup Phase

The catchup phase is entered when a slave completes registration but that slave asks for a binlog position and/or file which is before the latest available binlog event which MaxScale holds. MaxScale must send all of the events, starting from the requested position, to the current latest position. These events are streamed by MaxScale to the slave server, with no messages being sent from the slave to MaxScale. The architecture of MaxScale is however that it is event driven, it receives in event and then fully processes that event before returning the executing thread back to the thread pool to process more events.

This model does not fit well with the streaming operation required when an out of date slave connects to MaxScale. The potential exists for the processing thread to read and stream vast amounts of data before it returns to the thread pool. The would mean that the thread would not be available to process other requests and would potentially starve those other requests. However returning to the thread pool sooner is not possible as the slave will not send any further events in order to receive the remainder of the binlog. Some alternate mechanism is required if MaxScale is to support this kind of operation without either using large numbers of threads or suffering starvation issues.

The solution chosen was to add a new mechanism to the descriptor control block (DCB) to allow the definition of low and high water marks for the queues of data waiting to be sent. The router sends binlog events which will create a queue of outstanding write requests in the DCB, when this queue reaches a certain size, the high water mark, the process will terminate and the thread will return to the thread pool. Once the write queue drains to below the low water mark for the DCB a synthetic event is generated for the DCB. This event is used to trigger the router to send more binlog events, up to the point of once again hitting the high water mark for the DCB or until all stored binlog events have been sent.

This approach allows the MaxScale processing model to be satisfied, gives a configurable throttling mechanism to the bandwidth used and also provides a way to limit the amount of memory each slave connection uses to buffer outgoing binlog events.

Steady State Phase

In the steady state phase a slave connection is registered to receive binlog events and currently has the most up-to-date event that MaxScale is aware of. Any new event tat arrives from the master must be saved to the MaxScale binlog file and ten forwarded to each host that is currently in the steady state phase. This is done by cycling around each server and sending a copy of the event to the server if it is currently up to date. This is done on the sae thread that receives the event from the master server, since, following the MaxScale event driven rule, there are no other events to trigger this transmission. Whilst this works well with a small number of slaves, as the number of slaves increases the processing time for each incoming binlog records also increases, as this occurs it becomes more likely that a new incoming event arrives before the processing of the previous one completes. This can eventually lead to starvation of the MaxScale thread pool and poor performance. Ideally a mechanism should exist to allow a single record to be sent to multiple slaves using multiple threads. In order to facilitate this a mechanism to allow for worker threads to be evoke from within an event processing thread is required. This is one issue that still needs to be resolved in the current proof of concept that has been developed.

Outstanding Issues

As well as the issue with the lack of a worker thread mechanism within MaxScale there are a few other limitations that need to be overcome with the proof of concept as it stands;

  • It is not possible to connect a slave to MaxScale until after MaxScale has connected to a master server. This limitation exists as until MaxScale can connect to a master it does not have the responses it needs to respond to the registration queries that the slave makes to MaxScale.
  • MaxScale only supports slaves of the same version as the master to which it attaches. There is no provision for converting the binlog events and protocol between MySQL versions.
  • Only MySQL 5.6 has been tested currently with the proof of concept.

Thursday, 17 July 2014

MaxScale - from proxy to MySQL replication relay

During the first part of the year I spent a lot of time working on a proof of concept to use MaxScale as a way to distribute MySQL binlogs for large replications installations. I have to admit when I first heard the idea from Booking.com my reaction was - "MaxScale is a proxy for client applications, it can't do this". However I was wrong, proving that making versatile, configurable software can throw up surprises even for the people that design it.

The Problem

There have been posts elsewhere about the problem this is trying to solve, so I will not go into too much detail. Suffice to say that with large numbers of slaves connected to a single master the load on the master becomes too high, using intermediate relay servers causes other issues because of the way MySQL replication re-executes the statements on the relay server and then sends the binlog records for that re-executed SQL rather than the original binlog records.

MaxScale's Structure

As followers of MaxScale are probably bored of hearing by now MaxScale is built as a general purpose core that provides support facilities for a proxy and a number of plugins of different types. Hence the original idea of using it as the basis for a replication relay service came about. My problem was how to fit replication into something that was designed to act as a proxy for database applications and forward queries from those applications to a backend server. The most obvious feature that needs to be utilised is the query router within MaxScale. Normally these routers take requests in and forward them to one of a number of database servers, very much a push model of interaction. In replication however the slave servers each register with the master and then the master will stream changes, in the shape of binlog records, to the slaves - not the classical request/response model.

The conventional structure of a query router plugin within MaxScale is to have an instance of the router per service and that each client that connects is allocated a new session within the router. Each request that then arrives at the router for a given session is routed to an outbound server using rules in the router. Replies back from that backend server are sent to the client connection that was opened when the session was created. Replication calls for a somewhat different structure however.

In replication we need the router module to register with the master server and request binlog records from the master. This should be done once, and should probably not be in response to any event that comes in from the clients of MaxScale - in this case the clients are the slave servers. The other difference is that we don't get a single binlog record, the equivalent of a response in a more traditional router, and return it to the single client. Rather we may send it nowhere or to multiple connections, it depends how many slave servers we have attached and the current binlog positions of those slaves servers. We also may need to send this record to a slave at some undetermined time in the future, if the slave is lagging behind or not connected at the time. Therefore the router has to act as a store and forward relay rather than a mere proxy, forwarding request and response packets.

So after spending a little time thinking about the problem and listening to Jean-Francois at Booking.com explain just why something like this would be useful, I decided that what seemed like a crazy idea at first was indeed a very good idea. A germ of an implementation plan began to form and I started to construct a proof of concept of a router plugin module that would route binlog records. This rest of this post is the story of how I put that proof of concept together.

The Prototype Router

The requirements for this new router plugin where now becoming clearer
  • It must request and receive binlog records from the master autonomously of any slave activity.
  • Binlog records must be stored on permanent or semi-permanment storage to allow replaying of those records to the slave servers.
  • The slave servers must be able to request historical binlog records without sending any additional traffic to the master server.
  • Binlog records received from the master must be relayed to the slaves that are able to accept them, i.e. not lagging behind the master, in a timely fashion.
I was also very keen that while doing this I did not do anything that would specialise MaxScale, I wanted the binlog router to work with the same MaxScale core, with the same protocol plugins, monitors and filters, as any other plugin. It is fine to enhance the core to provide new generic services that are not already available, but wholesale adaptation of the core for this one requirement would not be desirable.

With these requirements and constraints in place I decided the best thing to do, as always, was to divide and conquer. Hence the first part of the implementation would concentrate on the interaction between MaxScale and the master. Registering for the binlog records, retrieving them and storing them in a file. Once this was working the next step would be to move to the slave side.

Master Interaction

The first step in the process was to examine a typical interaction between a slave and the master during the registration. Although the protocol defines the registration process itself, the slaves run a number of queries in order to determine server version and settings in place on the master that might have an impact on the slave. Since MaxScale would need to act as a master it must also be able to support these queries from slaves of MaxScale. The choice was that either MaxScale should know how to respond to these queries or it merely proxies these slave requests to the real master. The problem with the later approach is that it would require a different connection from the one that MaxScale uses to receive binlog records. It is not possible to use the same connection, since once the connection is receiving binlog records you can not send another query on that connection without halting the flow of binlog records. The decision was therefore taken to not use this method, but rather to have MaxScale respond directly to these requests without forwarding them on to the master.

The method of providing these query responses is fairly simple, a list of the requests that slaves may make was built up by observing the message traffic for a number of registration sessions. These queries are then saved in the binlog router; the router executes each of these queries itself during MaxScale's own registration process and the responses are stored within the router. When a slave makes a request some time later the saved response the master made to MaxScale is simply replayed to the slave.

Registration Process

Having settled on the method of dealing with the registration queries from the slave servers the next issue was how to get MaxScale to register with the master as a slave. The normal flow of router interactions is for a client session to be created, with traffic begin forwarded to the backend databases only when that client session made a request. In this case the client session is a slave server connection, and the backend server is the master from which MaxScale is receiving the binlog records. Ideally MaxScale should register with the master and start the process of receiving and storing binlog records before the first slave connects to MaxScale. Or indeed it should collect binlog records even if no slaves are connected; so the existing workflow of a router would not be acceptable.

Fortunately there was a solution, as part of the process of creating a listener for a service MaxScale will load the router that the service uses and create an instance of the router. This involves calling the createInstance() entry point of the router. In the case of most routers this will simply set up any router wide data structures, check any router options etc. However in the case of the binlog router we use this as the trigger for MaxScale to connect to the master. The connection to the master is not a simple process however, it requires several interactions with the master server. The rule of MaxScale implementation, no thread should block, means that we can not complete the process in the createInstance() call, as it would have to block waiting for a response for the master.

The solution to this problem was for the createInstance() call to create a dummy client and establish a MaxScale session to connect to the master. This allows the usual proxy calls to be used, but with an internal client rather than an external client application. Requests are sent via this dummy client, and responses received back. A finite state machine is built to execute the various queries required in the registration process.

The trigger to move from each state to the next is the reception of the response to the previous query or in the case of the last two states the COM_REGISTER_SLAVE and COM_BINLOG_DUMP messages. This allows the registration process to be implemented as a non-blocking process, with the clientReply() entry point of the router triggering the transition to the next state and sending the next request. After each message is sent control of the thread is returned back to the MaxScale core, thus satisfying the "no blocking" rule.

Upon completion of the state machine the master server will then start to stream binlog records to MaxScale. These messages arrive asynchronously at MaxScale, as replies to the COM_BINLOG_DUMP message that was sent by MaxScale during the final state transition.  As far as MaxScale is concerned it is receiving an endless stream of responses to the COM_BINLOG_DUMP message that it sent to the master. MaxScale then saves these into a local binlog file maintained by the router plugin. The router must examine each of the binlog records in order to determine if these are records that affect the management of the binlog file itself, i.e. a binlog rotate event, or if it is a binlog record that should not be stored in the file.

With the completion of this phase of the router we now had a MaxScale setup that could connect to a master server, register for replication updates and save binlog files locally on the MaxScale server. This enabled us to test the process and confirm the contents of the MaxScale maintained binlog files matched those on the master server. The next phase in the implementation of the router was to create the slave side of the interaction, this will be covered in another post.

Links


Friday, 11 July 2014

Using MaxScale to segregate load in a MySQL server cluster

The server weighting feature of MaxScale may initially be seen as a way to influence load balancing for unequal servers, but it can also be useful for other reasons. Here we look at how we can use it to segregate different workloads within a cluster of MySQL or MariaDB servers.

Concept

The concept is based around a scenario in which you might have a number of applications that share the same database, but that have different workload profiles on that data. For example an online order processing system, with a fulfilment application and a reporting application. Each of these has different demands it places on the database and each has different priorities within the business. It is probably more important for a business to accept orders and fulfil them than to have up-to-the-minute reporting if failures start to occur within the infrastructure.

The Implementation

There are a few things to understand about MaxScale which make the implementation here possible before delving into the configuration.

Services

An application accesses databases via MaxScale using services that are defined within the MaxScale configuration file. These services define a combination of the protocols used, the databases that support the service and the servers to which the service will route. A single instance of MaxScale can support multiple services and the same server can appear in multiple services within MaxScale.

Server Weighting

The server weighting is not performed on a fixed weight parameter for every server, instead every service can define the parameter used by the service to weight the connections. This means that a server can have many weights associated with it, and each service the includes the server can weight the distribution on different server parameters.

Example

Returning to the example in the introduction, assume we define three parameters for each server, 
  • processing_weight - the order processing weight
  • fulfilment_weight - the fulfilment weight
  • reporting_weight - the reporting weight
We also define three services within MaxScale
  • processing - used by the online order processing application
  • fulfilment - used by the fulfilment application 
  • reporting - used by the reporting application
Each of these services uses the weight associated with it from the above list.

[processing]
type=service
weightby=processing_weight

[fulfilment]
type=service
weightby=fulfilment_weight

[reporting]
type=service
weightby=reporting_weight

If we assume we have six servers, called db1to db6, then we can define the set of servers for each service  to be all six servers. Each server can be assigned a weight to be used by each service.

[db1]
type=server
processing_weight=1000
fulfilment_weight=1
reporting_weight=1

[db2]
type=server
processing_weight=1000
fulfilment_weight=1
reporting_weight=1

[db3]
type=server
processing_weight=1000
fulfilment_weight=1
reporting_weight=1

[db4]
type=server
processing_weight=1
fulfilment_weight=1000
reporting_weight=1

[db5]
type=server
processing_weight=1
fulfilment_weight=1000
reporting_weight=1

[db6]
type=server
processing_weight=1
fulfilment_weight=1
reporting_weight=1000

Using the values above, the order processing application, connecting to the processing service would have 1000 connections to db1, db2 and db2 before any connections would be routed to db4, db5 or db6. The fulfilment application would have 1000 connections routed to db4 and db5 before any other servers are used. The reporting application would have 1000 connections to db6 before any other servers are used. So what do you gain by doing this above simply having services that just include the subset of hosts that you wish to have the particular application connect with? The answer is you have high availability.

If in our example the server db6 fails, connections from the reporting application would be shared amongst the remaining servers. In this case we have made it equally likely that these go to the ordering system servers and the fulfilment system servers, but the weight can be adjusted to make reporting favour the fulfilment database servers if that is a more desirable failure mode.

In this example I have used the term connections, but the same mechanism is available for both connection routing and routing using the Read/Write Splitter. The segregation method can also be applied to both replication clusters and Galera Clusters. In the former only read operations (or connections) can be segregated; however in a Galera Cluster both read and write operations can be segregated.

Tuesday, 8 July 2014

MaxScale Modules - What's In 1.0Beta?


MaxScale is a modular proxy application, the modules can be considered as the building blocks of your proxy implementation within your MySQL database environment. It is important to know what building blocks you have at your disposal.  The release of version 1.0 as a beta means that the number of available modules has grown once again. Normally I post about the incremental changes in what is available, but I thought that maybe it was a good time to post a short summary of all the modules. This is not designed to be comprehensive documentation for the functionality available, merely as a summary of what is available.

Routing Modules

Routers are perhaps the most important modules within MaxScale, since they make the decisions as to where to send requests. However they are incapable of functioning autonomously and require monitor and protocol modules in order to fulfil a useful function.

There are two classes of router within MaxScale, real query routers and pseudo routers that are used as ways to expose internal MaxScale state. There are two important query routers included in the release, the readconnrouter and readwritesplit modules.

Readconnroute

The readconnroute module should perhaps be renamed, it is a general purpose routing module that is used for all forms of connection routing, not just for read only connections. It will distribute connections amongst a set of servers using the constraints passed in the routing options. The following routing options are available;
  • master - only route to a server which has the master status bit set
  • slave - only route to a server that has the slave status bit set
  • synced - only route to a server that has the synced status bit set
If no routing options are given then the router will distribute the connections over all the servers that are currently running. Servers that are down or in maintenance mode will not have connections routed to them.

In a hierarchical MySQL replication tree the master node is treated as the node nearest the root of the tree that has the master bit set. Any intermediate nodes, which have both the master and slave bits set, will be treated as a slave node.

The synced option is reserved for use with Galera clusters and ensures that connections are not routed to servers that a not a fully fledged member of the cluster. In addition the Master/Slave states may be used in Galera along with the special monitor module that elects one of the Galera nodes as a master.

The distribution of connections within a set of servers may not be even, but will honour any weighting bias that is introduced via the weightby setting of the service and the individual server weights.

Read/Write Splitter

The Read/Write Splitter is a much more sophisticated routing module, it examines the incoming request and determines if that request should be sent to a read/write server or if it may be sent to a read only server; i.e. a master or a slave in MySQL replication environment. The Read/Write Splitter is not however limited to MySQL Replication, it may also be used with Galera Clusters.

When used with Galera the monitor module will elect one of the nodes in the Galera cluster as the master to which all writes will be routed. The read load may then be spread across all of the other nodes within the Galera Cluster. This provides a means to use Galera in a pure HA mode, with read load balancing, rather than a true multi-master configuration.

The Read/Write Splitter is able to balance the load on the slave servers using several different criteria, these as chosen by the use of a router option setting in the MaxScale service. The options available are;

  • least number of connections from all MaxScale services
  • least number of connection from this MaxScale service
  • least number of current operations currently in progress
  • least lagging behind the master
The server weighting mechanism may also be applied to the first three of these methods in order to define a non-uniform distribution within the chosen balancing criteria is required.

In addition it is possible to define a maximum replication lag, in terms of seconds behind master, which a slave must adhere to in order to be chosen as an acceptable destination for read only statements to be sent.

Pseudo Routers

MaxScale 1.0 beta also comes with a number of pseudo routers.

CLI

The CLI module is a simple information provider used by the maxadmin client interface in order to execute statistics gathering and administration commands. This is not a router in the true sense, but rather a way to expose internal MaxScale data.

Debugcli

The debugcli router is closely related to the CLI router and provides an interface for developers to access internal MaxScale data structures.

Monitors

Monitors provide data to the rest of the modules within MaxScale which will be used by them to determine the current state of the servers within the system and hence the best destination to send requests. It is possible to use MaxScale without any monitors, setting the server states manually, however this looses many of the advantages of using MaxScale. There are currently two monitors available in the 1.0 beta version of MaxScale.

Mysqlmon

A monitor module that is designed to monitor MySQL replication configurations. It will detect the master and slave status of each server it is monitoring and will build the relationships between these nodes in order to give MaxScale modules access to the replication hierarchy information. In addition the mysqlmon module will measure the replication lag for data written to the master before it is available on each of the slaves.

Galeramon

This monitor module is designed to monitor Galera clusters and provide information regarding the membership of the Galera cluster for each of the servers. In addition the Galera monitor is able to nominate a single server within a Galera cluster as a master node and the remainder as slaves. This allows Galera to be used as a highly available database with a single writable master and read only slaves with a very fast failover response.

Filters

Filters provide a means to extend the MaxScale functionality by adding small modules into the processing pipeline of the MaxScale service. They allow for examination and modification of SQL requests, four such modules are included in the 1.0Beta release.

QLA

The qlafilter is a simple query logging filter that writes copies of a query to a per user connection log file. The qlafilter has mechanisms to limit those queries that are logged using regular expressions, connection source address and connection user name.

Regex

The regexfilter provides a means to alter SQL statements as they pass through MaxScale. It uses regular expressions to match and replace content as it traverses MaxScale.

Tee

A filter that will duplicate all or some of the requests to a second service within MaxScale. The tee filter  also has mechanisms to limit those queries that are logged using regular expressions, connection source address and connection user name.

Top

The top filter is a logging filter that will record the top N longest running SQL statements in your connection. When the connection is closed a log file per connection will be written. The number of statements that are retained by this filter is controllable via a configuration parameter for the filter.  The top filter uses the same mechanisms to limit those queries that are measured as those available for the QLA and Tee filters, regular expressions, connection source address and connection user name.

Protocol Modules

Protocol modules are, as the name suggests, responsible for the protocol aspect of interface to or from external systems and MaxScale. The most important protocol modules are those used for the communication from the client application to MaxScale and the MaxScale to database server communication path.

MySQLClient

The protocol module used for client applications which normally directly connect to a MySQL database to connect to MaxScale.

MySQLBackend

The protocol module MaxScale uses to connect to the backend MySQL databases.

Telnet

Terminal protocol used to connect the debug interface to MaxScale. This is only used for the debug interface and will never be used to carry database traffic.

Maxscaled

The protocol used by the MaxScale administration interface to connect to MaxScale.

HTTPD

An experimental HTTP protocol for use by HTTP requests for REST style interfaces.




Bugzilla: https://bugs.skysql.com

IRC: #maxscale on freenode