Friday, 19 December 2014

A Year In the Life of MaxScale

This time of the year it is traditional, at least in the UK, to look back and reflect on the year that is coming to a close. Since we have just produced the release candidate for MaxScale and are looking forward to the GA release early in the New Year, it seems like a good time to reflect on the events that have bought us to this stage in the story of MaxScale.

Going Public

The start of 2014 also marked the start for MaxScale, with the first public announcements regarding MaxScale and the first downloadable binaries. MaxScale itself had been started internally before that, but we wanted to hold off on letting it out into "the wild" until there was enough of the functionality provided in order to be able to do more than just give "what it might be" type promises. At first we only had tar files available and only for CentOS/RedHat Linux distributions, we also have the source code available in GitHub for the first time. 

That first version of MaxScale contain the fundamental modules it needed to demonstrate the functionality we wanted to show, namely the MySQL protocol modules that allowed connections using the MySQL client protocol and for MaxScale to be able to connect with the backend MariaDB and MySQL databases. It also contained the first iterations of monitors that allowed us to monitor the basic state of a Galera cluster or a Master/Slave replication setup using MySQL replication. The two classes of router that MaxScale supports were represented by the read connection router and an early version of the read/write splitter router. This early version included many restrictions on SQL that would and would not work with the splitter. In the first few months of the year we worked on improving the functionality of these modules and talking about MaxScale at events such as MySQL meet up groups, FOSDEM and Percona Live. We also created the MaxScale Google Group as a way to allow users to communicate with us and for the MaxScale team to announce improvements and bug fixes within MaxScale. We also launched the public bug site,

We had always planned MaxScale with the five plugin classes, protocols, monitors, routers, filters and  authentication; at this time we had two plugin classes missing completely, filters and authentication. The filters we could do without for the time, but we needed some form of authentication, so we built that into the protocol module as a short term expediency. We built what we called "transparent authentication", we loaded the user data from the backend database at startup and used this to authenticate the users that logged into MaxScale. We had to do this since it is key to the design of MaxScale that we have the ability to create multiple connections from MaxScale to the backend databases and to be able to create those at different times in the lifecycle of the client session. This meant we could not simply proxy the package exchange for the login sequence, we had to do more than that. While this is a vital part of the design, it does lead to some challenges and is perhaps one area in which the users need to be aware of MaxScale when creating authentication configurations within the underlying database cluster, a considerable amount of time has been invested in answering questions in this area and providing documentation, however it still remains as one of those areas that needs more time begin invested in to explain fully how to setup the authentication models that are required.

Improved Galera Support

We were considering better support for Galera clusters, in particular we wanted to do something that would allow us to segregate the write operations in order to remove write conflicts. We considered writing a Galera specific router that would send all writes to a single node whilst distributing the reads across all nodes. It had been my plan to use this as a tutorial on the subject of writing your own router. It was while doing this that I realised this was not really necessary, since our existing read/write splitter could do this task for us, it just needed a way to determine which node should receive the writes.

The current Read/Write splitter used the Master bit in the server status to determine which database should receive the writes, therefore all that was needed was for the Galera monitor to select one of the  running nodes to be the nominal master and set the master bit for that node. We wanted the selection to be predictable, so that if two MaxScales were front ending the same Galera Cluster they would both choose the same node as the master. We used the fact that Galera allocates each node an ID to then select the node with the lowest ID as the master. The result was that we had a solution that provided a Galera cluster as a high availability solution with read scale out for MySQL/MariaDB. Failover was very quick, since there was no database activity to do or binlog manipulation as there would be with a MySQL replication cluster. What this solution does not give you is any write scale out ability.

Users and Surprises

It had always been my hope that users would find MaxScale useful and want to extend the MaxScale functionality to fit the particular needs they have, this is part of the reason or the plugin nature of MaxScale and the division of responsibilities between those plugins. However it was somewhat surprising to be contacted with the idea of using MaxScale not as a proxy between the database clients and the databases but as a proxy between the master and slaves in a MySQL replication cluster. I have to admit that my first reaction was that this was really not something MaxScale was designed to do, but upon meeting with the people proposing this I was convinced not just that it was an extremely good idea, but also that it was something that we could fit into the MaxScale architecture without having to make changes to the core or jeopardising the original MaxScale concept or design. So began a period of working closely with to produce a specialist router module for MaxScale that allows MaxScale to act as an intermediate master within a replication tree. It was particularly nice to be able to have such an unexpected use case presented to us and find that MaxScale was flexible enough to be able to facilitate it. It was also nice to then find such a well respected member of the user community publicly talk about this use of our work and even present it at conferences.

We have also had a number of other requests from users through the year, these have resulted in either the modification of existing module to fit better in a given environment or the creation of completely new modules. This has included a number of different monitor modules to cover more database deployment architectures; MMM multi-master and MySQL NDB Cluster being two cases of this. We also produced a hinting mechanism at the request of a user such that the original SQL text could include hints as to the destination to which statements should be sent.

We have also come across other users that wanted additional functionality who have written or are writing plugin modules of their own. This again has been a vindication of our original concept of the pluggable architecture and the flexibility that we had hoped to achieve.

Filters & Other New Features

Throughout the year we have also added many new features to MaxScale, the biggest of these probably being the introduction of the filter API and a set of "example" filters for various tasks. This has provided us with some simple logging mechanisms, query rewriting and query duplication functionality. Other features that have been added include the ability to not just monitor a single level master/slave implementation, but a fully hierarchical replication cluster.

The concept of weighting in the routing decision has been added; this provides a way to allow MaxScale to manage set of servers that have dissimilar configurations or to segregate load between servers within a cluster. A client application has been added to allow for some management and monitoring of MaxScale to be undertaken. The read/write splitter has been enhanced to remove not just the limitations that existed when it was first launched, but also to add new facilities to control the routing or support constructs not previously available.

Authentication has also undergone some change, with the addition of a mechanism to allow MaxScale to track new or changed users within the database and the ability to wildcard match IP addresses in the MySQL user table. Also support has been added for users that have been given access to only a subject of databases.

The MySQL Replication cluster monitor has also been to measure slave lag within the cluster. This slave lag is then made available to the routing modules so that the Read/Write splitter can be configure to disregard slaves that are more than a certain amount behind the master, or always choose to see read operations to the slave that is most up to date.

As well as new features and enhancements we have also spent a lot of time fixing bugs, writing test cases and generally trying to improve MaxScale and the MaxScale documentation.

Packaging is another area that has benefited from more attention during the year, with RPM and Debian packages being introduced, allowing for much easier installation of MaxScale and also reflecting support for more Linux distributions.

Beta Testing & Towards First GA

The last few months has seen the focus of the MaxScale team to be more related to the upcoming GA release, with new features playing less of a role in the development. That is not to say there has not been ay new features, but these have been targeted for release after the first GA release. This allows us to concentrate on stabilising the features in that GA release and have them more fully tested without preventing new features moving forwards.

We also engaged with a number of users that expressed an interest in becoming beta testers for the MaxScale GA release. This, together with the internal testing which we have been performing is helping improve the quality of the MaxScale core and those modules that will be included in the first GA release.

Thank You

MaxScale of course is not a "one man band" and I would like to thank the developers that have been working with me over the year; Massimiliano and Vilho, plus those that have joined the team during the later part of the year; Markus, Martin & Timofey. They have all put an enormous amount of effort in to bring MaxScale to the brink of GA release. I should also not forget Ivan Zoratti with whom a lot of the early ideas for MaxScale have been developed, although he is no longer working with the team his influence is still felt in a lot of areas.

The other group I would like to thank are the users that have given the their time to try MaxScale, give us feedback and encouragement that what we are working is useful and has a place in the database infrastructure. It has also been very gratifying to get the personal feedback at events and to see others outside of the group start to talk about MaxScale and share some of the same thoughts we have had within the group as to what MaxScale can do within a database deployment - even if some of those have surprised us.

Tuesday, 9 December 2014

MaxScale, manual control, external monitors and notification methods

One of the nice things about the "plug and play" approach of MaxScale is that people constantly find ways of using it that were not originally envisaged when we designed MaxScale. One such configuration that I have heard of from multiple sources is using monitoring outside of MaxScale itself. This post will discuss a little about how monitoring works and how it can be moved outside of MaxScale. In particular a simplified example will be presented which shows how to use the notification mechanism in Galera to control MaxScale's use of the nodes in a Galera cluster.

Monitoring Within MaxScale

Perhaps it is best to start with a little background as to what the function of the monitor plugin is within MaxScale, how they work and how the plugins communicate with the other components of MaxScale.

MaxScale monitors the servers for one reason only, so that it is able to feed the routing algorithms with the state they require in order to make the decision as to which server the request the router is dealing with should be sent. In particular the monitoring is not designed to be a mechanism that administrators would use to determine the health of the database cluster or to provide critical alarms. There are other tools that are much more focused to preforming that task than MaxScale.

Monitor plugins within MaxScale have one of the least complex interface requirements of any plugin module within MaxScale, other than the entry points used to configure options and for diagnostic purposes there are essentially four entry points. A pair of entry points to start and stop the monitor and a pair of entry points to register and unregister a server that should be monitored. The monitor implementation is expected to create a thread of its own on which to run, it will then monitor the servers that are registered with it asynchronously from any operations within MaxScale.

The monitor threads communicate with the rest of MaxScale by setting flag bits on the individual servers to reflect the state of the server and adding values to some monitor owned fields within the servers. These "monitor owned" fields are related to the measurement of specific quantities, namely the replication lag with a cluster. The mechanism of more interest here is the status bits within the server structure. The bit values that are of interest are;
  • Running - the server is running the database, connections can be made to the server and SQL statements executed.
  • Joined - the server is a member of a multi-master cluster and is able to accept both read and write statements to execute on that cluster.
  • Master - the server is a member of a cluster and can be considered as a master of that cluster, i.e. it can accept both read and write requests.
  • Slave - the server is a member of a cluster and is able to accept read requests for that cluster only.
  • Maintenance - the server is running however it is in a maintenance mode, new connections or operations should not be sent to the server.

It is possible to set all of these bits manually, however under normal circumstances, when a monitor modules has been added to the configuration file, these will be changed back periodically by the monitor to reflect the situation as observed by the monitor. Therefore in order to effectively control the state of the servers either manually or externally of MaxScale it is important that no monitor module within MaxScale is monitoring the same set of hosts.

Manually Setting Server Status

The MaxScale command line interface application, maxadmin, allows the server status bits to be set and cleared manually using the set server command. MaxAdmin will also take commands as arguments on the command line, therefore a simple command can be used to set the status of a database server.

    $ maxadmin -pskysql set server dbserver1 master

The above command will set the server which has a section name of dbserver1 to be a master server, note this has no impact on any other server status bits or on the bits of other servers. If the dbserver1 server previously had the slave bit set it would remain set. Likewise if the server dbserver2 had the master bit set before the call then it will still be set after this call.

To clear a status bit on a server the clear server command should be used.

    $ maxadmin -pskysql clear server dbserver1 master

As soon as a manual method of control is available, using these maxadmin scripts, this provides a mechanism to also integrate with third party monitors or use any state change notification that exists within your cluster management software.

Using Galera Notify Command Option

The Galera Cluster provides a mechanism to have the cluster make a call to an external application whenever the cluster state changes, this can be utilised to have the Galera Cluster directly control MaxScale rather than have MaxScale monitor the status of the Galera Cluster and responded to the monitored state values. This approach has the advantage that MaxScale will be informed immediately of a state change within the Galera Cluster, however the disadvantage of this mechanism is that at least one node in the cluster must be running in order to trigger the notification process.

The wsrep_notify_cmd option in the Galera Cluster can be used to define the name of an application or script that is called when the status of the node changes or the membership of the cluster changes. This script is called with a number of parameters that need to be parsed in order to determine what has happened in the cluster. At the simplest level this can be used to set to status of the node on which it is executed. When the status of the current node changes the wsrep_notify_cmd is called with the --status= argument. A very simple script might be


while [ $# -gt 0 ]
    case $1 in

if [ "$STATUS" == "Undefined" ]
    maxadmin -p$PASSWD clear server $SERVER synced
    maxadmin -p$PASSWD clear server $SERVER running
    case $STATUS in
        maxadmin -p$PASSWD set server $SERVER running
        maxadmin -p$PASSWD clear server $SERVER synced
        maxadmin -p$PASSWD set server $SERVER running
        maxadmin -p$PASSWD set server $SERVER synced
        maxadmin -p$PASSWD set server $SERVER running
        maxadmin -p$PASSWD clear server $SERVER synced
        maxadmin -p$PASSWD set server $SERVER running
        maxadmin -p$PASSWD set server $SERVER synced
        maxadmin -p$PASSWD clear server $SERVER running
        maxadmin -p$PASSWD clear server $SERVER synced

Note the the above script will marked a server as running and joined to the cluster is the Galera status says that it is either a fully synchronised node or it is a donor node, this differs from the default Galera Monitor plugin in that the plugin will remove the donor node from consideration for traffic to be routed to it.

In order to make this script useful two more things must be done, the variable PASSWD should be set to the password of the admin user for admin and the variable SERVER should be set to the name by which this server is known in the MaxScale configuration file.

Other Notification Methods

Although the above example is based around the Galera wsrep_notify_cmd mechanism the principles still hold for other environments, for example the MHA scripts can be enhanced in order to allow MHA to control the server status bits if replication clusters. The MonYog http based API can be used such that when MonYog monitors the change of state of a node it will make an API call to an HTTP server which will execute the set of maxadmin commands needed to reflect the monitored change without MaxScale. This style of manual (or external) control of MaxScale could also be integrated into pacemaker resource agents in order to synchronise the changes to the database with the changes in MaxScale.

Monday, 29 September 2014

Releases, profiling and scheduling

The MaxScale team have been working hard fixing bugs and improving performance. On Friday we released a update of MaxScale, the pluggable proxy for MySQL and MariaDB, I wanted to write a little about a few of those changes. I will not mention every change, there are release notes that give the list of bugs fixed in this version, but rather highlight a couple of performance related changes and describe the rationale behind them. However before I start on the two items I wanted to discuss just a quick note to say that this version introduces cmake as the means to build MaxScale. The previous Makefiles are still available in this version, but will be withdrawn in future versions as we transition to cmake for build, installation and packaging.

Connection Starvation

In one of the test scenarios that I was running I was observing an issue that manifested itself as pauses in the result set returned via MaxScale. In the particular situation a very large, multi-megabyte, result set was returned, sometimes during the result set stream there would be a measurable pause during which the client would receive no new data. There was clearly some blocking or starvation issue within MaxScale. Clearly this needed to be investigated and resolved.

I added some extra instrumentation to poll.c and noted that we often had no threads available to poll. Increasing the number of threads did not solve the problem, so clearly we had an issue with threads being blocked. This turned out to be the mutexes that had been added to prevent multiple threads processing reads and write for the same DCB. The problem was that this caused other threads to block whilst the first thread worked on a read or a write and starved other DCB's of the processor. So these mutexes had to be removed and replaced by a simple queuing mechanism that allows for the subsequent threads to be free'd back to the polling loop immediately rather than blocking.

This has made a big difference and is key to the MaxScale architecture - the guiding principle is that threads should not block, if they do then we have the problem that we will never be able to handle the load as the number of threads we require will grow uncontrollably.

Now the threads were more available, and the setting of the number of threads makes a bigger difference to performance, I went on to look at synchronisation issues. So I added instrumentation to the spinlock to detect hot locks that might need decomposing or indicate poor parallelisation in the code. What I found was interesting…

Looking at the Descriptor Control Block (DCB) that was being used to write the results back to the client I saw the write queue spinlock was the only lock that was suffering any obvious contention, however this was isolated to a single DCB, so was not an issue with contention between router sessions, but within a single session.

       Write Queue Lock Statistics:
               Spinlock acquired                       1635627
               Total no. of spins                      69365889
               Average no. of spins (overall)          42
               Average no. of spins (when contended)   11336
               Maximum no. of spins                    68855
               Maximim no. of blocked threads          1
               Contended locks                         6119
               Contention percentage                   0

The spinlock was often acquired and mostly did not contend, however there were a significant amount of calls, 6119 in this sample, that did. When there was contention the number of spins required was high, average of 11336. So the conclusion was that some times when the queue is locked there is a significant amount of work being done with the lock being held. There are two things it could be, either the appending of the buffers to the queue or the processing of the queue to send data. Looking at the show dcb output in the debug interface the queue could become very large.

The next approach tried was to profile the code, with gprof, to see if this problem could be pinned down further. This yielded some more interesting data.

Each sample counts as 0.01 seconds.
 %   cumulative   self  self     total
time   seconds   seconds    calls  ms/call  ms/call  name
94.00    314.29   314.29 26007877 0.01     0.01  gwbuf_append
 4.43    329.10    14.82 1116143246   0.00     0.00  atomic_add
 0.43    330.55     1.45                 dcb_drain_writeq
 0.28    331.49     0.94 26397028 0.00     0.00  spinlock_acquire
 0.21    332.18     0.69                 dcb_write
 0.16    332.73     0.55       34  16.18    16.18  gwbuf_alloc
 0.13    333.15     0.42                 frame_dummy
 0.12    333.56     0.41 42439118 0.00     0.00  gwbuf_consume
 0.07    333.80     0.24 33654544 0.00     0.00  gwbuf_free
 0.05    333.98     0.18 40316849 0.00     0.00  gw_write
 0.05    334.16     0.18 25997764 0.00     0.00  gwbuf_length
 0.03    334.25     0.09 26390425 0.00     0.00  spinlock_release

So it seems the gwbuf_append is taking significant amounts of CPU time, at least in this scenario. So I made a simple change to maintain a tail pointer in the linked list and remove the need to walk the linked list in order to insert an item. This made a difference, yielding the following profile data…

Each sample counts as 0.01 seconds.
 %   cumulative   self  self     total
time   seconds   seconds    calls  ms/call  ms/call  name
46.26      0.87     0.87 73798949   0.00     0.00  atomic_add
11.77      1.09     0.22      226   0.97     0.97  gwbuf_alloc
10.70      1.29     0.20                   dcb_drain_writeq
 8.56      1.45     0.16 21028798   0.00     0.00  gwbuf_consume
 5.88      1.56     0.11                   dcb_write
 4.81      1.65     0.09 10261279   0.00     0.00  gwbuf_append
 3.48      1.71     0.07 10581718   0.00     0.00  spinlock_acquire
 2.14      1.75     0.04 21027433   0.00     0.00  gwbuf_free
 1.34      1.78     0.03                   frame_dummy
 1.07      1.80     0.02 21048962   0.00     0.00  gw_write
 1.07      1.82     0.02        1  20.00    33.49  poll_waitevents

This  shows that gwbuf_append is no longer a major issue. The top function being the atomic_add that is used for gathering stats and implementing the spinlock primitives. This is not too surprising since atomic_add needs to lock the bus and is thus an expensive instruction.

It also had an impact on the write queue spinlock, with simpler duration sample now having no contention, with a single session running.

       Write Queue Lock Statistics:
               Spinlock acquired                     1176242
               Total no. of spins                    0
               Average no. of spins (overall)        0
               Maximum no. of spins                  0
               Maximim no. of blocked threads        0
               Contended locks                       0
               Contention percentage                 0

These changes are encouraging, but only related to heavy write loads on MaxScale currently, it does not give universal improvements in performance, but it does yield some significant improvements and reduce the CPU overhead for handling large results sets, making a previously CPU bound process now IO bound.

Blocking Zombies

MaxScale is designed to use several independent threads to process requests, each connection has a descriptor control block associated with it. These DCB's can only be destroyed when MaxScale is sure than known of the threads are currently using the DCB. If a DCB is destroyed prematurely by a thread there is a risk that another thread is still using it. Therefore MaxScale implements a two stage process to destroy the DCB's. When a thread wishes to destroy a DCB it marks it as a "zombie" and places it on a zombie queue. Part of the DCB is a bitmask, with a bit per thread running in MaxScale. This bit is set for all threads when the DCB is in a non-zombie state, once it is placed in the zombie state threads will clear the bit in this bitmask that corresponds to the thread's ID when the thread enters the polling loop. Once the last thread clears its' bit the DCB can be finally destroyed.

This mechanism relies on threads frequently entering the polling loop in order to cleanup DCB's in a timely fashion. However the change that was introduced above to queue embedded requests for a DCB was preventing the thread reentering the polling loop when a large request or response was being sent via MaxScale. The result was that on systems with result sets of many gigabytes a large number of DCB's would build up in the zombie queue and connection would not be closed. This was particularly noticeable when using the binlog router for MySQL replication.

Looking at this problem also highlighted a different issue that could occur, the polling scheme would allow for a thread to pick up all events available at the time the poll call was made. That thread would then be responsible for processing that set of events while allowing other threads to poll and collect a set of events for it to work on. If the first, or any event other than the last in this set either take significant time to be processed, or more likely be one of these connections that was streaming a large quantity of data, the subsequent events in that set would not be processed for some considerable time.

It was decided that the best approach to both of these problems would be to separate the processing of the events from the polling of the events. The implementation now will poll for events within a thread, add these events to queue that is common to all threads and then pick the first event from this queue to process. If another event comes in for a DCB that is currently being processed then it is marked as having pending events. When the current process completes that DCB will then be re-queued at the end of the event queue to be picked up later, as opposed to the previous scheme which would allow that pending event to be processed immediately to the detriment of other DCB's. This also meant that threads would reenter the polling loop more frequently and zombie process could be done more often in the case of these workloads.

The result was that under high load situations a much fairer event scheduling took place, every connection would tend to see a more predictable performance and surprisingly the processor load MaxScale imposed on the system was decreased.

These are just two of the changes that can be found in the current release of MaxScale, there are many others, some mundane other more important. I would encourage anybody that has not already down so to take a look at the release notes to see what has changed.  As with any software we still have issues that are open and we continue to work on these, increase the test coverage and respond to issues raised by users of MaxScale.

Thursday, 14 August 2014

Which SQL queries take all the time? Using MaxScale to answer that age old question.

Here we take a look at how one of the example filters supplied with the MaxScale 1.0 beta can answer that simplest of profiling questions - "Which of my database queries run within the MySQL server for the longest time?".

Why A MaxScale Filter?

MaxScale filters, although they can be thought of as simply a way to modify the SQL requests that pass through MaxScale, are also a good place to insert logging or monitoring. They have the advantage that that receive not just the SQL statements that are sent to the server, but they may also receive the returned results. This means that MaxScale can not just log the query information, but it can also record timestamps related to when it observes different parts of the request or response. As a way to demonstrate this a very simple example filter was constructed for the 1.0 Beta release of MaxScale - the top filter.

What Does Topfilter Do?

The topfilter filter observes both requests being sent to the server and responses being sent back. It sends both the requests and the responses onwards unmodified, however it takes note of the request contents, provided it is a SQL request, along with a timestamp when it saw the request. When it sees the first response come back from the server it also takes a timestamp and uses the difference as a measure of the execution time of the statement. The longest running queries are retained, along with the execution time data. The number of retained queries can be configured within the MaxScale configuration of the filter. When the client connection is closed a report will be written to file with the longest running queries that were observed for that connection.

How Do I Setup My System

The first step in creating a system to allow you to measure your queries is to configure a MaxScale service that is suitable for your application. That may be any of the supported routers; e.g. the readconnroute or Read/Write Splitter routers. Here we will assume that we only have one server, we are not using MaxScale for the scale-out abilities of the tool, merely as a simple performance tuning aid. Therefore we will use the readconnroute router with the option to connect to any server, the only server we have. The basic configuration for the service is as follows:


The service also requires a listener, to define the port on which MaxScale will listen to requests to connect to the service and to link the service with the protocol module.

[Top10 Listener]

We have used port 7706 for MaxScale to listen to incoming requests, this will allow the MaxScale to be located on the same machine as the MySQL/MariaDB server, any port may be chosen by the user provided it is not already in use. If no MySQL instance is running on the same machine the default MySQL port of 3306 may be used.

In this example we will assume we have a single database node;  which we will call dbnode1.

The database node should be configured as follows;


The address= entry should contain the IP address or a hostname for the server. The assumption here is that the MySQL server is using the default MySQL port of 3306 to listen for incoming requests.

For completeness we should also add a monitor module that can determine if the node is in the correct state to communicate. This is more important if the backend is not a single database server, but a cluster of servers. It is good practice to always include a monitor even when only a single node is configured as it allows for ease of extension to a cluster and provides more information as to the state of the system.

passwd= 6628C50E37CCE1F0392EDEEB9D1403F3

We now have a configuration that basically allows MaxScale to be used as a pure proxy for accesses to the database. There is no intelligence within MaxScale being used, as we have no options to route to more than a single database. Although the principles here are still true if multiple backend databases are given and the more usual scale-out properties of MaxScale are used, in order to reduce the complexity and illustrate the technique only a single backend database has been used here.

At this point the application should work if connected to port 7706 on the MaxScale host, but no monitoring data will be collected. We most add the filter into the MaxScale configuration.

To add the filter we must first create a filter definition that links our filter module to the parameters we want for this particular case. Assuming we want to capture the 10 longest running queries and write them into a log file in the directory /var/log/myapp with filenames of the form session.$$, a filter section as follows can be used.


We then need to add this filter to the service by adding an extra line to the definition for the filter


Now if we connect to the port 7706 with our application we will get a number of files created in /var/log/myapp, these files will contain reports of the statements executed. An example of such a report is shown below.

Session started Thu Jun 19 17:06:34 2014
Connection from
Username        mark

Total of 24 statements executed.
Total statement execution time      35.653 seconds
Average statement execution time     1.486 seconds
Total connection time               80.443 seconds
Top 10 longest running queries in session.

Time (sec) | Query
    22.856 |  select sum(salary), year(from_date) from ...
     5.301 |  select d.dept_name as "Department", y.y1 ...
     2.898 |  select year(now()) - year(birth_date) as age, ...
     2.221 |  select dept_name as "Department", sum(salary) / 12 ...
     0.839 |  select dept_name as "Department", avg(year(now()) - ...
     0.665 |  select year(hire_date) as "Hired", d.dept_name, ...
     0.337 |  select moves.n_depts As "No. of Departments",...
     0.245 |  select year(now()) - year(birth_date) as age, ...
     0.175 |  select year(hire_date) as "Hired", count(*) as ...
     0.160 |  select year(hire_date) - year(birth_date) as ...

Please note, the queries have been truncated in this blog entry, an actual run would include the complete query text.

Refining The Selection Criteria

Other options may be given to the top filter to reduce the subset of queries it will consider; regular expression patterns may be used to include or exclude particular statements, the client address may be used to limit the sessions to only those originating from that address or the client user name may be used to limit the functionality to only that user.

I am only interested in select performance

In this case you can use the regular expression capabilities to limit the report to simply show select statements, other statements will not be included in the ranking. To do this modify the filter section with the extra parameter match=


I want to see what data modification statements take a long time

In this case, rather than build a complex match expression the exclude= option can be used to exclude any select statements.


I only want select statements from a particular database user

The match= option can be used to restrict the logging to just select statements and the additional user= parameter can be used to further limit the statements that are including in the ranking.



MaxScale was not designed as a tool that would be used to analyse the performance of your SQL application, and it is not as good or as extensive in this role as some of the purpose written tools, but the filters and the general plugin architecture of MaxScale mean that it can be used to perform tasks it was not originally conceived to do. The filter presented here is only really a demonstration of what can be achieved, it was written very quickly and did not require very much code to implement. However it hopefully can be useful in its own right and also to stimulate more complex and useful filters to be created.

Monday, 28 July 2014

MaxScale, from proxy to replication relay. Part 2, the slave side

Part 1 of this blog post told the story of creating a binlog router for MaxScale that could connect to a MySQL Replication Master, download binlog file from that master and store them locally on the MaxScale server. This post will concentrate on the other side of the router, the interaction with the MySQL slaves that will see MaxScale as the replication master.

Router Structure

In some ways the role of master for MaxScale is much closer to the more expected style of interaction that MaxScale was designed to deliver; a connections originates from a client to a MaxScale service. MaxScale then processes that requirement and returns a result back to the client. The most obvious difference is of course that the processing does not involve forwarding the request on to another server, rather it involves sending back a result already cached within MaxScale. However if you delve a little deeper you find that this is too simplistic a view and, as with the master interaction, the way a slave interacts with MaxScale does not follow this simple pattern.

It is necessary to understand a little about how the slave interaction works in order to highlight the differences and illustrate why and how changes had to be made to the MaxScale model to facilitate the slave side of the binlog router in MaxScale. It should be noted however that the aim was not to create a MaxScale variant that would just act as a binlog relay server or to have facilities in the core that would not be generally useful for other router plugins. The aim, as always with MaxScale, was to keep the specifics of the binlog routing problem within the router plugin, whilst enhancing the core with general functionality that could benefit other router, filter or protocol plugin.


The slave interaction can really be thought of as consisting of a number of phases; the first of these is the registration phase. The registration phase is where the slave and master exchange information they need in order to setup the replication channel. Once setup the master will stream replication events to the slave until such time as the connection is lost or the slave disconnects from the master.

The second phase is the catchup phase; this is when a slave connects to a master and request binlog records from a position in the binlog file which is before the current leading edge of the binlog file. The master must stream the binlog event from this point upto the current insert point in the current binlog file. The slave itself sees no difference between these historic binlog events and new binlog events for current database updates, it is a mere convince of implementation to consider this as a separate phase. Indeed this phase may not exist if the slave that connects is already up to date with the master. What makes this phase different from an implementation perspective however is that there is no external trigger to send these events; only the slave registration message. There is also potential for the number and size of messages to send to be massive. In testing several tens or hundreds of gigabytes have been sent during this phase.

The final phase is the steady state phase; the client is at the leading edge of the binlog records and is merely sent new binlog events when database updates occur. In this case the sending of new events is triggered by the arrival of events form the real database master. This is an example of events on one connection, the connection to the master, causing a reaction on one or more other connections, the connections to every up-to-date slave. Once a slave has entered this third phase it is possible for it to go back to the catchup phase if for some reason a particular slave connection is unable to maintain the rate required to match the incoming master arrival rate. Therefore it is normal to see slave connections go between phase 3 and phase 2 for brief periods.

Registration Phase

The first, registration phase fits vey easily into the MaxScale event driven model; a slave connects to MaxScale and sends requests, in the form of queries. These requests are parsed by a "mini-parser" in the router plugin and the stored response that was obtained when MaxScale registered to the real master server. The router implements a state machine for this slave replication registration progress, with each successful query exchange advancing the state of the slave connection until the state machine reaches the registered state. Upon reaching the registered state with the catchup phase is entered or the steady state phase depending upon the binlog position requested by the slave.

Catchup Phase

The catchup phase is entered when a slave completes registration but that slave asks for a binlog position and/or file which is before the latest available binlog event which MaxScale holds. MaxScale must send all of the events, starting from the requested position, to the current latest position. These events are streamed by MaxScale to the slave server, with no messages being sent from the slave to MaxScale. The architecture of MaxScale is however that it is event driven, it receives in event and then fully processes that event before returning the executing thread back to the thread pool to process more events.

This model does not fit well with the streaming operation required when an out of date slave connects to MaxScale. The potential exists for the processing thread to read and stream vast amounts of data before it returns to the thread pool. The would mean that the thread would not be available to process other requests and would potentially starve those other requests. However returning to the thread pool sooner is not possible as the slave will not send any further events in order to receive the remainder of the binlog. Some alternate mechanism is required if MaxScale is to support this kind of operation without either using large numbers of threads or suffering starvation issues.

The solution chosen was to add a new mechanism to the descriptor control block (DCB) to allow the definition of low and high water marks for the queues of data waiting to be sent. The router sends binlog events which will create a queue of outstanding write requests in the DCB, when this queue reaches a certain size, the high water mark, the process will terminate and the thread will return to the thread pool. Once the write queue drains to below the low water mark for the DCB a synthetic event is generated for the DCB. This event is used to trigger the router to send more binlog events, up to the point of once again hitting the high water mark for the DCB or until all stored binlog events have been sent.

This approach allows the MaxScale processing model to be satisfied, gives a configurable throttling mechanism to the bandwidth used and also provides a way to limit the amount of memory each slave connection uses to buffer outgoing binlog events.

Steady State Phase

In the steady state phase a slave connection is registered to receive binlog events and currently has the most up-to-date event that MaxScale is aware of. Any new event tat arrives from the master must be saved to the MaxScale binlog file and ten forwarded to each host that is currently in the steady state phase. This is done by cycling around each server and sending a copy of the event to the server if it is currently up to date. This is done on the sae thread that receives the event from the master server, since, following the MaxScale event driven rule, there are no other events to trigger this transmission. Whilst this works well with a small number of slaves, as the number of slaves increases the processing time for each incoming binlog records also increases, as this occurs it becomes more likely that a new incoming event arrives before the processing of the previous one completes. This can eventually lead to starvation of the MaxScale thread pool and poor performance. Ideally a mechanism should exist to allow a single record to be sent to multiple slaves using multiple threads. In order to facilitate this a mechanism to allow for worker threads to be evoke from within an event processing thread is required. This is one issue that still needs to be resolved in the current proof of concept that has been developed.

Outstanding Issues

As well as the issue with the lack of a worker thread mechanism within MaxScale there are a few other limitations that need to be overcome with the proof of concept as it stands;

  • It is not possible to connect a slave to MaxScale until after MaxScale has connected to a master server. This limitation exists as until MaxScale can connect to a master it does not have the responses it needs to respond to the registration queries that the slave makes to MaxScale.
  • MaxScale only supports slaves of the same version as the master to which it attaches. There is no provision for converting the binlog events and protocol between MySQL versions.
  • Only MySQL 5.6 has been tested currently with the proof of concept.

Thursday, 17 July 2014

MaxScale - from proxy to MySQL replication relay

During the first part of the year I spent a lot of time working on a proof of concept to use MaxScale as a way to distribute MySQL binlogs for large replications installations. I have to admit when I first heard the idea from my reaction was - "MaxScale is a proxy for client applications, it can't do this". However I was wrong, proving that making versatile, configurable software can throw up surprises even for the people that design it.

The Problem

There have been posts elsewhere about the problem this is trying to solve, so I will not go into too much detail. Suffice to say that with large numbers of slaves connected to a single master the load on the master becomes too high, using intermediate relay servers causes other issues because of the way MySQL replication re-executes the statements on the relay server and then sends the binlog records for that re-executed SQL rather than the original binlog records.

MaxScale's Structure

As followers of MaxScale are probably bored of hearing by now MaxScale is built as a general purpose core that provides support facilities for a proxy and a number of plugins of different types. Hence the original idea of using it as the basis for a replication relay service came about. My problem was how to fit replication into something that was designed to act as a proxy for database applications and forward queries from those applications to a backend server. The most obvious feature that needs to be utilised is the query router within MaxScale. Normally these routers take requests in and forward them to one of a number of database servers, very much a push model of interaction. In replication however the slave servers each register with the master and then the master will stream changes, in the shape of binlog records, to the slaves - not the classical request/response model.

The conventional structure of a query router plugin within MaxScale is to have an instance of the router per service and that each client that connects is allocated a new session within the router. Each request that then arrives at the router for a given session is routed to an outbound server using rules in the router. Replies back from that backend server are sent to the client connection that was opened when the session was created. Replication calls for a somewhat different structure however.

In replication we need the router module to register with the master server and request binlog records from the master. This should be done once, and should probably not be in response to any event that comes in from the clients of MaxScale - in this case the clients are the slave servers. The other difference is that we don't get a single binlog record, the equivalent of a response in a more traditional router, and return it to the single client. Rather we may send it nowhere or to multiple connections, it depends how many slave servers we have attached and the current binlog positions of those slaves servers. We also may need to send this record to a slave at some undetermined time in the future, if the slave is lagging behind or not connected at the time. Therefore the router has to act as a store and forward relay rather than a mere proxy, forwarding request and response packets.

So after spending a little time thinking about the problem and listening to Jean-Francois at explain just why something like this would be useful, I decided that what seemed like a crazy idea at first was indeed a very good idea. A germ of an implementation plan began to form and I started to construct a proof of concept of a router plugin module that would route binlog records. This rest of this post is the story of how I put that proof of concept together.

The Prototype Router

The requirements for this new router plugin where now becoming clearer
  • It must request and receive binlog records from the master autonomously of any slave activity.
  • Binlog records must be stored on permanent or semi-permanment storage to allow replaying of those records to the slave servers.
  • The slave servers must be able to request historical binlog records without sending any additional traffic to the master server.
  • Binlog records received from the master must be relayed to the slaves that are able to accept them, i.e. not lagging behind the master, in a timely fashion.
I was also very keen that while doing this I did not do anything that would specialise MaxScale, I wanted the binlog router to work with the same MaxScale core, with the same protocol plugins, monitors and filters, as any other plugin. It is fine to enhance the core to provide new generic services that are not already available, but wholesale adaptation of the core for this one requirement would not be desirable.

With these requirements and constraints in place I decided the best thing to do, as always, was to divide and conquer. Hence the first part of the implementation would concentrate on the interaction between MaxScale and the master. Registering for the binlog records, retrieving them and storing them in a file. Once this was working the next step would be to move to the slave side.

Master Interaction

The first step in the process was to examine a typical interaction between a slave and the master during the registration. Although the protocol defines the registration process itself, the slaves run a number of queries in order to determine server version and settings in place on the master that might have an impact on the slave. Since MaxScale would need to act as a master it must also be able to support these queries from slaves of MaxScale. The choice was that either MaxScale should know how to respond to these queries or it merely proxies these slave requests to the real master. The problem with the later approach is that it would require a different connection from the one that MaxScale uses to receive binlog records. It is not possible to use the same connection, since once the connection is receiving binlog records you can not send another query on that connection without halting the flow of binlog records. The decision was therefore taken to not use this method, but rather to have MaxScale respond directly to these requests without forwarding them on to the master.

The method of providing these query responses is fairly simple, a list of the requests that slaves may make was built up by observing the message traffic for a number of registration sessions. These queries are then saved in the binlog router; the router executes each of these queries itself during MaxScale's own registration process and the responses are stored within the router. When a slave makes a request some time later the saved response the master made to MaxScale is simply replayed to the slave.

Registration Process

Having settled on the method of dealing with the registration queries from the slave servers the next issue was how to get MaxScale to register with the master as a slave. The normal flow of router interactions is for a client session to be created, with traffic begin forwarded to the backend databases only when that client session made a request. In this case the client session is a slave server connection, and the backend server is the master from which MaxScale is receiving the binlog records. Ideally MaxScale should register with the master and start the process of receiving and storing binlog records before the first slave connects to MaxScale. Or indeed it should collect binlog records even if no slaves are connected; so the existing workflow of a router would not be acceptable.

Fortunately there was a solution, as part of the process of creating a listener for a service MaxScale will load the router that the service uses and create an instance of the router. This involves calling the createInstance() entry point of the router. In the case of most routers this will simply set up any router wide data structures, check any router options etc. However in the case of the binlog router we use this as the trigger for MaxScale to connect to the master. The connection to the master is not a simple process however, it requires several interactions with the master server. The rule of MaxScale implementation, no thread should block, means that we can not complete the process in the createInstance() call, as it would have to block waiting for a response for the master.

The solution to this problem was for the createInstance() call to create a dummy client and establish a MaxScale session to connect to the master. This allows the usual proxy calls to be used, but with an internal client rather than an external client application. Requests are sent via this dummy client, and responses received back. A finite state machine is built to execute the various queries required in the registration process.

The trigger to move from each state to the next is the reception of the response to the previous query or in the case of the last two states the COM_REGISTER_SLAVE and COM_BINLOG_DUMP messages. This allows the registration process to be implemented as a non-blocking process, with the clientReply() entry point of the router triggering the transition to the next state and sending the next request. After each message is sent control of the thread is returned back to the MaxScale core, thus satisfying the "no blocking" rule.

Upon completion of the state machine the master server will then start to stream binlog records to MaxScale. These messages arrive asynchronously at MaxScale, as replies to the COM_BINLOG_DUMP message that was sent by MaxScale during the final state transition.  As far as MaxScale is concerned it is receiving an endless stream of responses to the COM_BINLOG_DUMP message that it sent to the master. MaxScale then saves these into a local binlog file maintained by the router plugin. The router must examine each of the binlog records in order to determine if these are records that affect the management of the binlog file itself, i.e. a binlog rotate event, or if it is a binlog record that should not be stored in the file.

With the completion of this phase of the router we now had a MaxScale setup that could connect to a master server, register for replication updates and save binlog files locally on the MaxScale server. This enabled us to test the process and confirm the contents of the MaxScale maintained binlog files matched those on the master server. The next phase in the implementation of the router was to create the slave side of the interaction, this will be covered in another post.


Friday, 11 July 2014

Using MaxScale to segregate load in a MySQL server cluster

The server weighting feature of MaxScale may initially be seen as a way to influence load balancing for unequal servers, but it can also be useful for other reasons. Here we look at how we can use it to segregate different workloads within a cluster of MySQL or MariaDB servers.


The concept is based around a scenario in which you might have a number of applications that share the same database, but that have different workload profiles on that data. For example an online order processing system, with a fulfilment application and a reporting application. Each of these has different demands it places on the database and each has different priorities within the business. It is probably more important for a business to accept orders and fulfil them than to have up-to-the-minute reporting if failures start to occur within the infrastructure.

The Implementation

There are a few things to understand about MaxScale which make the implementation here possible before delving into the configuration.


An application accesses databases via MaxScale using services that are defined within the MaxScale configuration file. These services define a combination of the protocols used, the databases that support the service and the servers to which the service will route. A single instance of MaxScale can support multiple services and the same server can appear in multiple services within MaxScale.

Server Weighting

The server weighting is not performed on a fixed weight parameter for every server, instead every service can define the parameter used by the service to weight the connections. This means that a server can have many weights associated with it, and each service the includes the server can weight the distribution on different server parameters.


Returning to the example in the introduction, assume we define three parameters for each server, 
  • processing_weight - the order processing weight
  • fulfilment_weight - the fulfilment weight
  • reporting_weight - the reporting weight
We also define three services within MaxScale
  • processing - used by the online order processing application
  • fulfilment - used by the fulfilment application 
  • reporting - used by the reporting application
Each of these services uses the weight associated with it from the above list.




If we assume we have six servers, called db1to db6, then we can define the set of servers for each service  to be all six servers. Each server can be assigned a weight to be used by each service.







Using the values above, the order processing application, connecting to the processing service would have 1000 connections to db1, db2 and db2 before any connections would be routed to db4, db5 or db6. The fulfilment application would have 1000 connections routed to db4 and db5 before any other servers are used. The reporting application would have 1000 connections to db6 before any other servers are used. So what do you gain by doing this above simply having services that just include the subset of hosts that you wish to have the particular application connect with? The answer is you have high availability.

If in our example the server db6 fails, connections from the reporting application would be shared amongst the remaining servers. In this case we have made it equally likely that these go to the ordering system servers and the fulfilment system servers, but the weight can be adjusted to make reporting favour the fulfilment database servers if that is a more desirable failure mode.

In this example I have used the term connections, but the same mechanism is available for both connection routing and routing using the Read/Write Splitter. The segregation method can also be applied to both replication clusters and Galera Clusters. In the former only read operations (or connections) can be segregated; however in a Galera Cluster both read and write operations can be segregated.