Click here to Skip to main content
13,199,852 members (65,998 online)
Click here to Skip to main content
Add your own
alternative version

Tagged as

Stats

42.8K views
2 downloads
19 bookmarked
Posted 19 Jun 2016

Writing a MySQL storage engine from scratch

, 19 Jun 2016
Rate this:
Please Sign up or sign in to vote.
Describes how to write a MySQL storage engine - a plugin which is responsible for persisting MySQL table data.

Introduction

This article summarizes the things that I learned when I wrote a new MySQL storage engine (“SE”). MySQL’s storage engines are plugins which are responsible for actually storing the data on disk, and providing access to the data. The SEs usually are implemented as key/value databases, but not necessarily have to. Oracle’s MySQL comes with a variety of SEs, and the whole SE landscape can be a bit overwhelming. The default is InnoDB, a key/value database library. Other SEs can read and write CSV files (“Tina”) or were written specifically for archiving the data (“archive”). MariaDB, a fork, is more progressive and includes SEs for accessing Cassandra NoSQL databases, Percona’s TokuDB key/value library and xtradb, an improved version of InnoDB.

In total, MySQL comes with 13 SEs, MariaDB has about 20. That’s clearly not enough. Let’s add another one!

Downloading code and first impressions

You can download the MySQL code from GitHub. I am working with version 5.7.12.

git clone https://github.com/mysql/mysql-server.git
cd mysql-server
git checkout mysql-5.7-12

Let’s have a brief look at the source code structure. The amount of files and directories is a bit intimidating, but most of them are not relevant for us. There are just a few directories which are worth a closer look:

include           Stores global include files. Here you will find mysql.h, which has many important macros and declarations.
sql               Contains the actual sql related code: parser, query engine etc.
sql/field.h       The Field class describes a MySQL column
sql/item.h        The Item class is a node in the abstract syntax tree which is generated by the parser
sql/sql_*.cc      These files implement the actual SQL operations
sql/handler.h     The base class for a storage engine
storage           This directory has all the storage engines
storage/innobase  The InnoDB storage engine
storage/example   A storage engine example project

After browsing through the code, it becomes clear that parts of MySQL’s code base are relatively old. While most code is implemented as C++, it’s actually a “C with classes”. More advanced C++ features like templates, exceptions or the standard library are not used. Linked lists are implemented as structures wrapping a “void *” object (my_list.h). MySQL also has its own string implementation (string.c) instead of using std::string. RAII is also not used. Instead, you will find many “goto” instructions which are responsible for cleaning up allocated resources.

Functions can be extremely long (more than 1000 lines of code, with many levels of nested “if” clauses - mysql_prepare_create_table), and some of the classes are also huge. Most methods are stored in the base class instead of derived classes, and in addition the classes store a lot of state.

MySQL uses several memory allocators (memroot_allocator.h). They are not STL compatible. Many classes overwrite operator new and operator delete to use a custom allocator.

The code has no consistent coding style. Indention varies, structure names are sometimes capitalized (“struct THD”), lower case with underscores (“st_mem_root”) or start with a capital letter and continue with underscores (“Field_long”). The use of spaces varies as well (foo= 3; foo = 3; foo=3).

I have never spoken to someone working for Oracle or MySQL AB, but we can try to make a few guesses about their corporate culture based on the code we have seen. The lack of a coding style means that each engineer can pick the one he’s most familiar with. Coding styles vary a lot, even in the same file, which can be a sign that there is no “code ownership” and everyone can work on every part of the code. I have worked with similar code in the past, in a similar environment, and it was a great place to work.

The use of “C with classes”, the refusal to use the STL and the archaic long and complex functions will require lots of refactoring if MySQL wants to get rid of its technical debt and attract new developers.

Compile, install, run, debug

Let’s continue with the installation. If you followed the steps above, then you already checked out the code into the “mysql-server” directory. In here we will run cmake, which will generate the Makefiles. We will create a Debug build which is slower but can be used for debugging.

cd mysql-server
cmake -DCMAKE_BUILD_TYPE=Debug
make -j 5
sudo make install

Yourcompiled MySQL files are now installed in /usr/local/mysql. You can run cmake in a subdirectory to have separate builds for Debug and Release, and you can choose different installation directories. Run “cmake --help” for a list of options. For this article I’ll just keep everything as simple as possible.

Now there are just a few more things to do till we can successfully start the MySQL daemon. We need to create a new data directory (which stores the tables) and initialize it (note that you will have to change the path names in the following commands):

mkdir /home/chris/tmp/mysql-data # must be empty, or else...
cd /usr/local/mysql/bin
./mysqld --initialize --datadir=/home/chris/tmp/mysql-data

The last command will generate a root password. Write it down, you will need it in a moment. I’ve set up my development environment with an empty root password - it’s just more convenient while testing. You can reset the password with the following commands (replace <PASSWORD> with the generated root password from above):

# start the server
./mysqld --datadir=/home/chris/tmp/mysql-data

# in a separate terminal we can now change the password
./mysqladmin  password --user=root --password=<password>

Now we can create a new database.

./mysqladmin create test --user=root --password

The client can be started as well, and then you can finally create tables, insert data etc.

./mysql --user=root test

Bootstrapping a new storage engine

A storage engine is implemented as a dynamic library (a “.so” file), and the sources are stored in the “mysql-server/storage” directory. If you just want to toy around then you can modify the existing “example” SE. I chose to create my own called “upscaledb” with the following steps:

cd mysql-server/storage
# copy the "ha_example" directory to "ha_upscaledb"
cp -r ha_example ha_upscaledb
# rename the files
cd ha_upscaledb
mv ha_example.h ha_upscaledb.h
mv ha_example.cc ha_upscaledb.cc

As a last step, use your favourite IDE (I use vim) to replace “example” with “upscaledb” and “EXAMPLE” with “UPSCALEDB”. Don’t forget to change the CMakeLists.txt file as well. Then go to the root directory of mysql-server and once more run “cmake” and “make”. The new upscaledb storage engine is now built, the file name is mysql-server/storage/upscaledb/ha_upscaledb.so.

Now we have to inform MySQL about the new storage engine. First we create a symbolic link from the installation directory to the new .so file. With this link our server will always use the newest version of the .so file. Whenever we make changes we simply compile the storage engine and restart the MySQL server.

cd /usr/local/mysql/lib/plugin
sudo ln -s ~/prj/mysql-server/storage/upscaledb/ha_upscaledb.so ha_upscaledb.so

The final step is to update MySQL’s internal system table. We can do this with the MySQL client. Make sure that the MySQL server is still running!

cd /usr/local/mysql/bin
./mysql --user=root test

mysql> INSTALL  PLUGIN upscaledb SONAME ‘ha_upscaledb.so’;

You can now start using the new SE and try to create tables (CREATE TABLE test (value INTEGER) ENGINE=upscaledb;). Since our SE is just a skeleton without any implementation, we will get an error. Before we start adding logic I’ll show you how to debug the MySQL server. The following command starts the server in gdb, and lets gdb catch all signals (i.e. CTRL-C breaks into the debugger):

cd /usr/local/mysql/bin
gdb --args ./mysqld --gdb --datadir=/home/chris/tmp/mysql-data

Try setting a breakpoint in the “create()” method of your SE, and run the CREATE TABLE command from above once more!

Adding functionality

It’s now time to fill in the various methods of our Handler class. It would need too much time to describe every method in detail. Besides, some of them are still unknown to me. But I will give an overview, and provide links to actual implementations.

It is important to be aware that there can be multiple instances of a Handler working on the same table. The actual table data therefore needs to be stored in a separate object (a “Share”), and the Handlers then acquire the Share. If your data is stored in a plain file then the file handle would be a member of the Share, not of your Handler.

You can look at the ExampleShare here and my UpscaledbShare here. You see that the UpscaledbShare stores the actual database handles, and additional metadata about the databases.

Creating tables

ExampleHandler::create and UpscaledbHandler::create are called whenever a table is created. Immediately afterwards, MySQL will call the open() method on that table. Your create() method therefore can prepare the table, but does not actually have to open it.

The implementation of UpscaledbHandler::create() is straight forward. It creates an upscaledb Environment, then creates a database for each index. If the user created a table without a primary key then an index is generated. The database configuration depends on the actual type of the column and a few other parameters (i.e. whether it is unique or not).

Opening tables

ExampleHandler::open and UpscaledbHandler::open are called whenever a table is opened. As mentioned above, this can (and will) happen many times for the same table. You therefore have to store the actual table data in your “Share”.

After retrieving a pointer to the Share object, the UpscaledbHandler::open() method checks if the upscaledb environment was already opened. If yes, it returns immediately. If not then it proceeds with opening the file, and stores the environment’s handle in the “Share”.

Closing tables

The ExampleHandler::close and UpscaledbHandler::close methods are supposed to close the table. If your table data is stored in the Share then you might use reference-counting to figure out when to destroy the Share object. In my UpscaledbHandler I never destroy the share; after all, the Share will be required again sooner or later.

INSERTing rows

Whenever you invoke the INSERT SQL statement, the write_row() method of your handler is invoked. Its only parameter is the new row, serialized in a byte array. This array stores the actual columns in different order than the one you specified when calling the CREATE TABLE statement. The primary key is always at the beginning, followed by all other indexed columns, and finally followed by the non-indexed column.

This byte array usually starts with an (optional) bitmap describing the null values in the current column. It is followed by either fixed length columns, or columns with variable length (a TINYTEXT, MEDIUMTEXT, TEXT, LONGTEXT or one of the corresponding BLOB columns). The variable length columns start with one or two bytes storing the size of the column, followed by the data (which can be stored in a separate memory chunk; in this case the byte array contains an encoded pointer to the actual data). If you persist a row in a file it makes sense to “condense” variable length rows in a more compact format, in order to reduce space.

My UpscaledbHandler caches the indexed Field objects to quickly extract the indexed columns. The following code can then be used to extract the key for an indexed column (“index” is the numerical ID of the index; the primary index is always 0).

static inline ups_key_t
key_from_row(TABLE *table, const uchar *buf, int index)
{
  KEY_PART_INFO *key_part = table->key_info[index].key_part;
  uint16_t key_size = (uint16_t)key_part->length;
  uint32_t offset = key_part->offset;

  if (key_part->type == HA_KEYTYPE_VARTEXT1
          || key_part->type == HA_KEYTYPE_VARBINARY1) {
    key_size = buf[offset];
    offset += 1;
  }
  else if (key_part->type == HA_KEYTYPE_VARTEXT2
          || key_part->type == HA_KEYTYPE_VARBINARY2) {
    key_size = *(uint16_t *)&buf[offset];
    offset += 2;
  }

  ups_key_t key = ups_make_key((uchar *)buf + offset, key_size);
  return key;
}

Having extracted the index keys, you can then store the row’s data in your file. You will have to deal with three cases:

  1. The user did not specify ANY index or primary key
  2. The user specified a primary key but no other indexes
  3. The user specified a primary key and additional indexes

If that’s not complex enough then also keep in mind that an index can be “virtual”, i.e. it combines multiple columns (as in the following statement).

CREATE TABLE test (
    id         INT NOT NULL,
    last_name  CHAR(30) NOT NULL,
    first_name CHAR(30) NOT NULL,
    PRIMARY KEY (id),
    INDEX name (last_name, first_name)    -- creates a virtual index!
);

Here is the implementation of the UpscaledbHandler::write_row method.

<h3deleteing rows="">

The DELETE SQL command ends up calling the delete_row() method of your handler. You have to make sure that ALL indices are deleted, not just the primary one. However, deleting the primary key is actually quite simple because MySQL will use a database cursor to locate it. Here is the UpscaledbHandler::delete_row implementation.

UPDATEing rows

This is the most complex one - at least if you want to make it fast. The update_row() method receives two parameters; the old row’s value and the new row’s value. The naive solution is to call delete_row() for the old row, then write_row() for the new row. This works, but is really slow because you end up updating all columns, even if just one was updated. It is much faster to only update those columns that were modified.

Cursors

For a lot of tasks, the MySQL core will just create a database cursor, locate a key (either on the primary or on a secondary index) and then move forward while working with the actual data. There are a few methods that you have to implement to support the cursors.

  • index_init(): creates a cursor for a secondary index
  • index_end(): can close the cursor
  • index_read_map(): positions the cursor on a row
  • index_next(): moves cursor to the next key, retrieves the row
  • index_prev(): moves cursor to the previous key, retrieves the row
  • index_last(): moves cursor to the last key, retrieves the row
  • index_first(): moves cursor to the first key, retrieves the row
  • index_next_same(): moves cursor to the next duplicate of the current key
  • rnd_init(): creates a cursor for the primary index
  • rnd_end(): closes the cursor
  • rnd_next(): moves cursor to the next key, retrieves the row
  • rnd_pos(): moves cursor to a specified row

Other methods

A few other methods which are important or interesting enough to be mentioned.

rename_table(): renames a table (and all its files). This is used whenever your schema is changed, i.e. because you add a column. MySQL copies all the data to a temporary table, then uses the rename_table() method to rename the temporary table into your original table.

delete_table(): deletes a table (and all its files).

table_flags(): returns a set of flags which describe the capabilities of your handler. A lot of these are not well documented, and many are geared towards InnoDB. My guess is that only InnoDB implements all capabilities.

index_flags(): returns a set of flags which describe the capabilities of your handler - i.e. whether your index_prev() and index_next() methods are provided etc.

Conclusion

Writing your own MySQL Storage Engine sounds like a complex endeavor, but actually it is not. You do not have to implement all the methods that I described above. For some methods, MySQL will deal with a missing implementation and provide its own. For others, it will simply abort some SQL commands with an error. The ExampleHandler is basically empty and does not provide any functionality. But nevertheless you can load it, place breakpoints in a debugger and start adding functionality piece by piece. If you get stuck then you can have a look at existing SEs of MySQL and MariaDB. And the developers on the mysql-internals mailing list are also very helpful.

Ideas for interesting SEs coming to my mind:

  • An append-only database which persists its data in HDFS; users can then use Spark or Hadoop’s Map/Reduce jobs to further process the data
  • In-memory tables based on std::map or std::multi_map
  • SEs which are backed by XML files

What other ideas do you have?

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

cruppstahl
Software Developer
Germany Germany
I am the author of upscaledb (http://upscaledb.com), a fast embedded key-value database with built-in analytical functions.

You may also be interested in...

Pro

Comments and Discussions

 
QuestionEngine idea Pin
Chris Seline15-Jun-17 9:46
memberChris Seline15-Jun-17 9:46 
AnswerRe: Engine idea Pin
cruppstahl15-Jun-17 10:07
membercruppstahl15-Jun-17 10:07 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.171020.1 | Last Updated 20 Jun 2016
Article Copyright 2016 by cruppstahl
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid