BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Erlang's Mnesia - a distributed DBMS for highly scalable apps

Erlang's Mnesia - a distributed DBMS for highly scalable apps

2007 could well be dubbed "the year the normalized database myth was debunked". We have seen interesting discussions on how large websites have moved away from the traditional relational database approach in order to service the millions of hits per hour they receive. As Joe Gregorio recently observed, common themes are emerging (emphasis mine):
If you want to scale to the petabyte level, or the billion requests a day, you need to be:
  • Distributed. The data has to be distributed across multiple machines.
  • Joinless. No joins, and no referential integrity, at least at the data store level.
  • De-Normalized. No one said this explicily, but I presume there is a lot of de-normalization going on if you are avoiding joins.
  • Transcationless. No transactions.
Those constraints represent something fundamentally different from a relational database.
These attributes can be used to describe Mnesia , the Erlang distributed DBMS, that supports high scalability and fault tolerance through replication, with the ability to lookup records without the need for a traditional relational database join. Mnesia is part of Erlang/OTP and designed to be used with Erlang language applications running on the Erlang VM (there are interfaces available for C/C++ and Java).

In the white paper "Mnesia - A Distributed Robust DBMS for Telecommunications Applications ", published in 1999, the authors (Håkan Mattsson, Hans Nilsson, Claes Wikstrom) introduce the astonishingly prescient concepts and key differences in the database. From the abstract:
The Mnesia DBMS runs in the same adress space as the application owning the data, yet the application cannot destroy the contents of the data base. This provides for both fast accesses and efficient fault tolerance, normally conflicting requirements. The implementation is based on features in the Erlang programming language, in which Mnesia is embedded.
Section 1 of the paper explains that the stringent fault tolerance and highly availability requirements of the telecommunications applications were the motivation behind building the DBMS. The requirements were:
  1. Fast realtime key/value lookup.
  2. Complicated non realtime queries mainly for operation and maintenance.
  3. Distributed data due to distributed applications.
  4. High fault tolerance.
  5. Dynamic re configuration.
  6. Complex objects.
Section 2 of the paper provides an overview of Mnesia components. Mnesia is comprised of several Erlang applications that perform the essential DBMS services such as locking, transaction management and replication. The authors note that Erlang is well suited to implementing the system, and that it comprises around 20,000 lines of code. The query syntax is part of Erlang itself, while the data model is akin to an object-relational DBMS:
Mnesia is also interesting due to its tight coupling to the programming language Erlang, thus almost turning Erlang into a database programming language. This has many benefits, the foremost being that the impedance mismatch between data format used by the DBMS and data format used by the programming language which is being used to manipulate the data, completely disappears.
The data model supports the concept of a table, where records equate to rows, and with each column able to store "arbitrarily complex compound data structures such as trees, functions, closures, code etc". An example of an Erlang complex record is provided:
X = #person{
    name = klacke,
    data = {male, 36, 971191},
    married_to = eva,
    children = [marten, maja, klara]
}.
Mnesia also supports a similar concept to Views known as Rules.

Replication in Mnesia is one of the mechanisms used to provide a fault tolerant DBMS. Tables can be replicated to nodes in a heterogeneous network, however this is transparent to the application. Replicated tables are peers, i.e. it does not use a master-slave model.

Section 3 of the paper details several unique features of Mnesia:
  • Complex values. Mnesia supports the ability to handle complex values naturally and efficiently (i.e. via a single lookup operation). "Organizing the telecommunications data in third (or even first) normal form is usually not possible."  Storing complex values in columns means that no joins need to be made.
  • Data format and address space. The DBMS runs in the same process space as the application. This allows a lookup to return a pointer to the object without the need to marshall the object to or from different data formats, or access it over the wire. The paper covers a common criticism of this approach where an application crash can corrupt the database by explaining that in Erlang an application can not crash in a way that would impact the DBMS. "Erlang processes have the efficiency advantage of running in the same address space but they do not have the possibility to explicitly read or write each others memory."
  • Fault Tolerance. Tables can be replicated to several compute nodes. Write operations apply to all replicas within the context of a transaction, with the ability to update replicas that are not available through updates upon recovery. "This mechanism makes it possible to design systems where several geographically distinct systems cooperate to provide a continuously running nonstop system."
  • Distribution and location transparency. Provide the ability for the application developer to access tables transparently - i.e. regardless of if it is remote or local, or a replica. However, there also exists a means to determine the location of the table and have code execution occur close to the data in the case where performance is critical.
  • Transactions and ACID. Mnesia supports ACID (Atomicity, Consistency, Isolation and Durability), but also offers the ability to perform in memory only operations on tables (at the expense of durability).
  • The ability to bypass the transaction manager. The concept of a "dirty interface" is introduced as a light weight locking mechanism that avoids the overhead of using the transaction manager. This capability is useful for performance critical operations, such as performing a record lookup. "These dirty operations are true realtime DBMS operations: they take the same predictable amount of time regardless of the size of the database."
  • Queries. Queries are expressed using a so-called "list comprehension syntax". For example, to find the names of people with more than X children the expression is formulated as: query [P.name || P < table(person), length(P.children) > X] end.
  • Schema alteration. The Erlang language itself supports the ability to change code of executing processing without stopping the process. This allows the Mnesia database to alter schema dynamically. "Since Mnesia is intended for nonstop applications, all system activities such as performing a backup, changing the schema, dumping tables to secondary storage and copying replicas have to be performed in the background while still allowing the applications to access and modify tables as usual."
In Section 4 of the paper, the authors explore implementation aspects such as persistence, lock management, query implementation and the fact that nodes in the distributed application may run on different endian computers, thus allowing a the system to work in a heterogeneous environment.

Section 5 discusses performance and observes that the "dirty interface" is considerably quicker than the transactional counterpart. As expected, the cost of replication adds significantly to the execution time of a transaction. The time (microseconds) to perform the transaction for a single node were 1877 with regular locking, 1225 with an explicit write lock, and 181 using a "dirty interface". For 3 nodes, the measured times were 13372, 12185 and 1121 for each of the transaction types.

The paper concludes with:
The Mnesia system is currently being used to build real products in Ericsson today, thus it is no longer a mere prototype system, it has matured enough to be labeled a product.
Mnesia development has continued since the paper, adding features such as fragmented tables (similar to shards, but handled at the database level), and is used in open source projects such as YAWS (Erlang Web Framework) and ejabberd (XMPP server).

While Mnesia excels at scalability and low latency in transactions on horizontally fragmented data, one remaining challenge may be how it will scale in terms of very large datasets. Examples to greater than 60 million rows have been cited. However, as Bill de hÓra recently wrote, increasing volumes of data will force us to rethink our database strategy:
I think that increased data volumes will impact day to day programing work far more than multicore will. A constant theme in the work I've done in the last few years has been dealing with larger and larger datasets. What Joe Gregorio calls "Megadata" (but now wishes he didn't). Large data sets are no longer esoteric concerns for a few big companies, but are becoming commonplace.
and concludes that:
The big volumes mean you need to be able to write data and not care where it went. And you need keyed lookup for reads built on top of the FS, not in the RDBMS (on the basis that an RDBMS with no joins, constraints or triggers is an indexed filesystem). That will end looking looking something like hadoop, mogilefs or S3 - a data parallel architecture.
Given the uptick in interest in Erlang/OTP (including recent InfoQ coverage), that could well end up looking like Mnesia as well.

Rate this Article

Adoption
Style

BT