BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Running Axon Server in a Virtual Machine

Running Axon Server in a Virtual Machine

Key Takeaways

  • Your choice of platform is the result of your preference in flexibility, where more flexibility usually implies more preparatory work.
  • Modern infrastructure best practices support automation in rolling out your deployments. These can be used to great effect for your Axon Server cluster.
  • When using VMs rather than containers, the directory layout used for installation becomes more important, and it is advisable to adhere to Linux standards.
  • Security on the installation is imperative, but it may bite you in unexpected ways if you’re not careful. That should never be read as advice to bypass it, however.
  • Persistent volumes come in many shapes and sizes, and your infrastructure provider has a big influence. Be prepared to compromise on standards or even usability, if the cost is performance. It pays to have an idea of what your future usage will bring.

Introduction: Freedom vs Responsibility

In this series, we’ve been looking at running Axon Server locally, in Docker, and Kubernetes, and if there is one obvious conclusion we can draw, it’s about the differences in the support we get for deployment from the environment as we progress down that list. So far we’ve seen an increasingly predictable result from deployments, and with Kubernetes, it is pretty easy to have multiple deployments that we can set up and tear down in a jiffy.

However, we also saw big differences in the amount of preparation needed; to run it locally, we only need a JDK and the JAR file, whereas for running it in Kubernetes we need a container image, a Kubernetes cluster, and some way to hook storage to its Persistent Volumes. We also need to teach Kubernetes how it can determine if an Axon Server node is feeling happy, so it can restart it if needed.

So, what happens if we use a VM as a platform? Naturally, we need to do more work to get everything set up correctly, because instead of sharing a part of the Operating System, we now have to consider everything from the machine and upwards.

The benefit is the increased collection of “knobs” we can turn to tune our environment. Not all OS distributions give us exactly the same environment, so for many applications the installation instructions on a full server tend to either explode into a lot of choices to handle the differences, or reduce to the same basic instructions as for running locally, leaving it to the user to figure out the details. However, one anti-pattern we want to avoid is a manually-installed application that requires an operator to log on to the server and “do stuff”. This will inevitably lead to a situation where the server is simply left untouched for a long time because anything you do may make it unstable.

Luckily, nowadays there’s a focus on automation that can get us out of this, keeping the day-to-day work to a minimum, thanks to DevOps and CI/CD.

Buzzwords to the rescue!

The phrases I’d like to draw attention to are: “Infrastructure as Code”, “Ephemeral Infrastructure”, and “Immutable Infrastructure”. For those that aren’t too familiar with them, I’ll start with a bit of explanation.

“Infrastructure as Code” is used to signify the use of code, perhaps structured and human-readable data, to represent the environment. This code is then interpreted by tools to realize it, either through following the steps in this description (imperative) or through comparing the current state to the description of the intended result (declarative). Dockerfiles and Packer descriptors are good examples of the imperative approach, while Kubernetes and Terraform use the declarative approach.

One immediate difference is that the imperative approach tends to work well for constructing building blocks, leaving no doubt about the result, while the declarative approach works well for composing the actual deployment based on those blocks, allowing the actual installation details to be worked out by tools optimized for the base they’re building it on. The only thing we should be aware of is that these tools do not free us from the initial investment of time to determine the details of the installation, e.g. where to install, what infrastructural components to build on, and how to combine everything into a smoothly working setup. However, once we have these worked out, we are freed from the actual execution details.

The term “Ephemeral Infrastructure” is used to describe the infrastructure that is set up temporarily, on-demand, and cleaned up immediately after use. For the Operations world, this was a pretty radical change, as the Quality of Service demands on infrastructure tended to make it pretty expensive to set up “just a VM”. However, thanks to a determined shift towards Infrastructure as Code, it started to become possible to set up not just a single server but a complete environment, with just a single push of a button. Combine that with the lowered cost of infrastructure components thanks to virtualization and a move to the cloud, and soon you could add a “while you wait” to the delivery catalog.

For containerized environments such as Kubernetes, it is the new normal. As a result, test environments, a traditional bane from a cost perspective (or development perspective, due to the concessions to budget constraints), can become an on-demand feature, set up at the start of an automated build and torn down as soon as all the tests are done. If you don’t mind running big tests outside of office hours, because you have done your homework and they are fully automated, you can even use off-peak hours and get really big savings in the cloud.

When you hear the phrase “Immutable Infrastructure”, your first thought might be that I am going to contradict the previous paragraph on “Ephemeral Infrastructure”, but nothing could be further away from the truth. To introduce this concept, let me ask if you ever heard someone, in anger, frustration, or even plain panic, search for whoever changed something on the server? When your organization’s skills in Operations grow, installations tend to progress from pure handwork by a small team, through a standardization phase with documentation and strict rules on who can do what, towards full automation and a no-hands policy. The idea is that any installation must be based on some kind of structured description, and an automated process to realize it using that description. With this approach, you should be able to replace the currently running installation with a single push of a button and get the same situation back again. Also, upgrades should not require you to log on to the server, move new files in place, copy and edit some others, and start the new version. Instead, you should be able to change the source (“Infrastructure as Code”!) and let it continue from there automatically. The even bigger advantage comes into play in a disaster recovery situation: your entire environment can be rebuilt with no handwork! However, that does mean your installation needs to make a strict separation between the immutable part, OS and application, and the application’s state. Naturally, you’ll need to ensure the state is safely stored where you won’t lose it together with the rest of the installation, or you’ll have to make regular backups, but the net gain is tremendous.

It should be clear that these three concepts together are what we need to strive for. With a platform like Kubernetes, you are pushed towards this quite naturally, while for VMs we need to script this ourselves, either literally or by using a tool. This last bit may come as a surprise, but tools like Terraform, even though they integrate quite well into many platforms, only give you the capability. As said, there are lots of knobs you can turn, and that means there is no objectively “best” approach. Worse actually: if I have a configuration that works well with e.g. Azure, I cannot tell it to “do the same thing” on AWS, simply because of the differences in capabilities and how these are structured. So whichever way we are wending, we’ll need to do it by hand anyway, to discover which knobs we want to be twisted how.

Step 1: Let there be Linux!

In a container, we could focus on having just enough pre-installed commands to run a Java application using e.g. Google’s “Distroless” images. On a VM we need a complete OS, which means we have to choose a base distribution. To get that out of the way: choose your favorite! For Axon Server, it makes no difference at all. The important bit is that all your required logging and management tools are supported, and a Java 11 JDK is installed. (We want to stay with LTS versions) To start with that last one, depending on your distribution, the name to search for is “java-11-openjdk-headless” or “openjdk-11-jdk-headless”, but there are also commercial alternatives. Given that we’re not going to use X-Windows, OpenJDK is the variant I would choose, but you sometimes need to also add the “dejavu-sans-fonts” and “urw-fonts” packages because the Java runtime contains code that expects some basic fonts to be available. If you are basing your image on a Debian distribution (including Ubuntu Server), you’ll be using “apt” to install the missing pieces, while the Red Hat world (including FedoraCentOS, and whatever is going to replace it) uses “yum”, and Suse uses YaST. We’ll get into the details a bit further down.

The second thing to think about is where to install Axon Server so it integrates nicely into the Linux directory structure. A common practice is to install services under “/usr/lib”, “/var/lib” or “/usr/local/lib”. The first should be read-only, the second is for the corresponding state, the third for “local” installations. Strictly speaking, Axon Server is not a standard Linux tool, which argues for local, but recently many non-default applications choose “/usr/lib” or “/var/lib”, so we’ll follow suit and use “/usr/lib/axonserver”. Also, because we want Axon Server not to run as the superuser, we’ll create a user and group for it, with that directory as its home. Depending on how you want to deal with the logging, you could configure Axon Server to directly send its output to e.g. StackDriver, but for now, we’ll let it put everything in “/var/log/axonserver”. Now we could go further and also use “/var/lib/axonserver” for storing the state (event store, control-db, replication logs, and such), but we want those to survive re-installation and upgrades, so most will live on a separate disk mounted under “/mnt”.

Now, if you are somewhat familiar with the layout of the Linux filesystem, you may ask why I made no mention of “/opt”, which is often used for third-party applications. I wanted to stress however that we’re going to aim for a read-only installation, with state separated out, and so the approach chosen above is more “proper”. I also want to stress that we do not see the server as immutable, while the installation of Axon Server is an addition. A complete server, with Axon Server, is my immutable starting point.

Step 2: Creating the image

Now that we have chosen where to install, we need to decide on how to create a VM image that will be the model for all our installations. Eventually, this process will be fully automated, but first, we need to research the steps and verify the result will work as intended. So open a browser to your cloud provider’s console and create a VM based on your favorite or most stable distribution, while keeping an editor handy to record everything you do. In my case, I’m using Google Cloud, and will create a GCE instance based on CentOS 8, which is the Open Source version of Red Hat Enterprise Linux aka RHEL. If you are using another cloud provider, the steps needed to create a base image will differ only in the details of creating the initial VM and the clean-up when all is ready. Please note that the hardware choices for this initial VM, the amount of memory and CPU cores, as well as the network connectivity, will have no effect on the VMs you will be creating using the prepared image. The only choice you have to be careful about is the CPU type: If you start with ARM, all your eventual servers will need to use that type as well. The same sometimes holds for Intel vs AMD: although modern OS’s should be able to adapt if they have both alternatives installed, it’s better to keep things predictable.

When the VM is ready, start an SSH session into it. A common practice is that you’ll get a standard user (such as “ec2-user” on AWS) or one based on your own account, and this user will have the right to perform commands as “root” using the “sudo” command. Once in, the first thing we do is to make sure the latest security updates are applied, and while we’re using the package installer, we’ll also install Java:

$ sudo yum -y update
…[output removed]
$ sudo yum -y install java-11-openjdk-headless dejavu-sans-fonts urw-fonts
…[more output removed]
$ java -version
openjdk version "11.0.9" 2020-10-20 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.9+11-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.9+11-LTS, mixed mode, sharing)
$ 

On Debian-based distributions, you would have used something like “sudo apt update” to first refresh to local package lists, and then “sudo apt upgrade -y” to actually perform the updating. Installing Java is then “sudo apt install -y openjdk-11-jdk-headless”.

Our next step is to create a user who’ll be the owner of Axon Server, and we use the “adduser” command for that:

$ sudo adduser -d /usr/lib/axonserver -U axonserver
$ 

This will create the “/usr/lib/axonserver” directory, a group named “axonserver” and a user named “axonserver” belonging to that group. The directory is made the new user’s home directory and (group) ownership is set to it. Next, we prepare the additional directories:

$ sudo mkdir -p /var/log/axonserver /var/lib/axonserver
$ sudo chown axonserver:axonserver /var/log/axonserver /var/lib/axonserver
$ 

Other things to arrange are the subdirectories for Axon Server’s state, which we already said will use an external disk. In the example configuration below I’ll be quite generous and use several disks, but you can rearrange this to fit your local strategy. Important is the following:

  • The Event Store (events and snapshots) will grow quickly and should therefore get the largest disk. There is the option to use a smaller fast disk for recent events, and a larger, slower (and thus cheaper) option for the full store, using the “PRIMARY” and “SECONDARY” roles in replication groups. This is a new feature since release 4.4.
  • The replication logs are regularly cleaned of older entries, as they cache changes sent to (or received from) other members of replication groups. It is useful to employ a smaller and faster disk here.
  • ControlDB and PID-file (where the Process ID is kept) are even smaller and could even be kept on server-local storage. Of these only the ControlDB needs to be backed up and you can do that using the REST API.
  • Properties file and system token file are (from Axon Server’s perspective) read-only, and you can either retrieve them from a vault or VM metadata, or put them on a shared disk. We use the shared disk approach here.
  • Honorable mention is for the license file, which used to be read-only, but can now be updated through the Axon Server UI. Also, when building a cluster only the first node needs to get the license file, as the others will be automatically updated using that first copy. In that case, the location for the license file needs to be writable.

Now we need to register all these settings in the “axonserver.properties” or “axonserver.yml” file so we have the basic configuration ready. We’ll use YaML here:

logging:
  file: '/var/log/axonserver/axonserver.log'

axoniq:
  axonserver:
    replication:
      log-storage-folder: '/mnt/axonserver-logs'
    event:
      storage: '/mnt/axonserver-events'
    snapshot:
      storage: '/mnt/axonserver-events'
    controldb-path: '/var/lib/axonserver'
    pid-file-location: '/var/lib/axonserver'
    enterprise:
      license-directory: '/var/lib/axonserver'
    accesscontrol:
      systemtokenfile: '/var/lib/axonserver/axonserver.token'

The network disks for replication logs, the event store with events and snapshots, and the control-db will need to be writable. The configuration volume can be made read-only, and use subdirectories for the individual servers that need access to it. An alternative for the “config” folder is to use a “Secrets Vault” and download server-specific files from it during startup. We’ll get back to this in step 5.

Finally, we need to add the Axon Server jar and optionally the CLI jar, after which we are ready to dive a bit into running services when the VM starts up.

Step 3: Running Axon Server as a Service

In Docker, a container has a command that is run, and which determines its lifetime. Because we now have a full VM, with several services being run at various points during the server’s startup sequence, we need to look at the support Linux provides for this. If you’re a developer and only deal with macOS or Linux (as a subsystem of Windows) from that perspective, you may not have noticed the turf-war around a process called “init”. When processors were using 32 bits or less (yes, it was once that way, just don’t ask me to return there), the first program to run on any Unix system at startup was called “init”. Init is responsible for bringing the system up (or down) to the target “runlevel”, which is a number from 0 (the stopped state) to 5 (full multi-user). Two additional levels are “s” for “single-user” and “6” for “reboot”, but we can leave those out of scope. Init uses the target “runlevel” to move through the different levels up (or down) to that target, every step executing the commands configured for entering (when going up) or exiting (down) that level. For runlevel 5 an important task is to start the login process on all terminal ports, (physical ports, not network ports) of which the “console” port is probably the only one that has survived, even for VMs. There is a lot more I could tell about this, but for the moment this will have to do.

The implementation of this initialization process came to Linux from Unix by way of Minix version 2 and has been kept relatively unchanged for quite a long time, although the possibilities for customizing the commands executed at different levels did increase. Most commercial Linux distributions recognized the limitations and inserted a step with their own, more flexible initialization process. In 2010 a successor to init was proposed in the form of “systemd”, and in good OS-nerd tradition, the throne was defended with a fierce fight. Nowadays most distributions have happily adopted systemd, although init-diehards often get the option (during the installation of the OS) to choose their favorite process. What is important for us, is that rather than choosing a runlevel and a filename starting with a number (to insert it at the correct point of startup order) we can now “simply” indicate the “Before” and “After” dependencies of our service, and leave the ordering to systemd.

The choice for systemd has been made by the provider of our base image, and all cloud providers I checked off “choose the new default”, so I can feel safe in suggesting we use a systemd “service” file to take care of starting and stopping Axon Server. A service file uses the INI file format, where we have three sections:

  • The “Unit” section introduces the service and its place in the service order. (think “Before”, “After”, “Wants”, “Requires”)
  • The “Service” section provides information on how to run the service.
  • The “Install” section tells systemd where to place the service in the global phasing of the services, comparable to the runlevels of init.

If we add a small shell script that performs any necessary initializations and then runs Axon Server (not in the background), this will be a class of service called “simple”. However, we can also let systemd monitor a PID file (a file containing the process ID) allowing the script to start Axon Server in the background. This type is called “forking”. We’ll also tell it to ignore standard output, as logging is already being sent to “/var/log/axonserver”:

# /etc/systemd/system/axonserver.service
[Unit]
Description=Axon Server Service
Requires=google-startup-scripts.service
After=google-startup-scripts.service

[Service]
Type=forking
User=axonserver
Group=axonserver
ExecStart=/var/lib/axonserver/start-axonserver.sh
PIDFile=/var/lib/axonserver/axonserver.pid
StandardOutput=null
StandardError=null
TimeoutStartSec=10

[Install]
WantedBy=multi-user.target

This piece of magic has a declared dependency on the Google Compute Engine, in the form of the last process to run before ours, specified as “Requires” (we need this one to have started) and “After”. (we want it to have finished starting before we get a shot) For Amazon Linux on AWS we can use a service named “cloud-final”, which has a comparable value for us.

The startup script, which starts Axon Server in the background and creates a PID file, becomes:

#!/bin/bash

USR_AXONSERVER=/usr/lib/axonserver
VAR_AXONSERVER=/var/lib/axonserver
cd ${VAR_AXONSERVER}

PIDFILE=${VAR_AXONSERVER}/axonserver.pid
AXONIQ_PIDFILE=${VAR_AXONSERVER}/AxonIQ.pid

if [ -s ${PIDFILE} ] ; then
    PID=$(cat ${PIDFILE})
    if ps -p ${PID} > /dev/null ; then
        echo "Axon Server is already running (PID=${PID})"
        exit 0
    fi

    echo "Cleaning up old PID files"
    rm -f ${PIDFILE} ${AXONIQ_PIDFILE}
fi

java -jar ${USR_AXONSERVER}/axonserver.jar &
PID=$!

echo ${PID} > ${PIDFILE}

This script first checks if there already is a pid file, and if so, whether the process with that id is still present. If present, we know Axon Server is already running and our work is done. If not, Axon Server did not exit normally and we need to clean up the remains, or it will refuse to start. The “AxonIQ.pid” file is made by Axon Server itself, in a format determined by the Spring-boot code that generates the value, where systemd only wants the number. We do however need to remove it if it exists. The last steps are to start Axon Server and store the process id in the pid file. Systemd will be able to use that when it needs to stop Axon Server, and the format difference explains why we cannot use Axon Server’s own pid file

Step 4: Securing the Installation

Starting Axon Server manually works fine, and using the configuration discussed in the previous step we can even start it using systemd... on some Linux distributions. There is a vital component in Linux that we haven’t discussed yet, and that is SELinux, which is short for “Security-Enhanced Linux”. Whether or not this is enabled (let’s hope it is) and to what level, is up to the person first installing the OS, but since we are using a provider-supplied base image, we may not know in advance. On the GCE standard CentOS 7 image, everything will have worked just fine with minimal trouble. However, if you use a CentOS 8 based image, you may find yourself unable to start the Axon Server service no matter what you try. Checking the information that a “systemctl status axonserver” command can give you, you could get a distinct impression that it is somehow not allowed. This is exactly what SELinux does for you, and you should be thankful for it. Perhaps not happy, maybe very unhappy, but definitely thankful. If you search the Internet to find out how to change SELinux into permissive mode, meaning that it will complain but not disallow. You will see warnings, but everything is working again. Another effect could have been, for example, if we had set the HTTP port to one of the defaults for it, 80 or 443, that the application would start but be unable to claim that port, even if it ran as user “root”.

If like me you’re on the software-building side of things, you tend to know “just enough” of operations to get your stuff done, but SELinux is rarely part of that segment of knowledge. If you want to use it, you’ll have to study it, and it ain’t simple. Now in principle, I think security should be as tight as possible, but I had “Study SELinux” somewhat further down my to-do list, and within a few hours I frustratingly went from “this should be possible” to “let’s just disable it”. Explanations that assume you know what you want to achieve in SELinux terminology are there, but for an outsider, the details are rapidly lost in translation. In the current world, disabling security is not a good idea, and luckily I found help for the hunt. What SELinux does is to add extra “labels” to a file. These labels are then used to determine what this application is allowed to do, above and beyond the normal file modes and ownership rules.

To get more details on what is happening, some additional tools are needed, which for CentOS 8 are in packages named “setools-console” and “policycoretools-python-utils”. If, like me, you want to know more about what is happening, I definitely advise you to look up “semanage” and “sesearch”. The approach I eventually chose was suggested in a post advising to keep executables in “/usr/local/bin”, as that would let them inherit the “bin_t” type, which systemd is allowed to execute. It appears there are actually three kinds of labels, with a name ending in “_u” for “user”, “_r” for “role”, and “_t” for “type”. You can view these by adding the “-Z” option to the “ls” command:

$ ls -l
total 95232
-rw-rw-r--. 1 axonserver axonserver     1026 Feb  4 11:01 axoniq.license
-rwxr-xr-x. 1 axonserver axonserver  3482022 Jan 22 13:56 axonserver-cli.jar
-rwxr-xr-x. 1 axonserver axonserver 93958618 Jan 22 13:56 axonserver.jar
-rw-rw-r--. 1 axonserver axonserver      961 Feb  4 11:01 axonserver.properties
-rw-r--r--. 1 axonserver axonserver      396 Jan 22 13:56 axonserver.service
-rw-r--r--. 1 axonserver axonserver     1157 Jan 22 13:56 logback-spring.xml
-rwxr-xr-x. 1 axonserver axonserver     3108 Jan 22 13:56 startup.sh
$ ls -lZ
-rw-rw-r--. axonserver axonserver system_u:object_r:user_home_t:s0   axoniq.license
-rwxr-xr-x. axonserver axonserver unconfined_u:object_r:user_home_t:s0 axonserver-cli.jar
-rwxr-xr-x. axonserver axonserver unconfined_u:object_r:user_home_t:s0 axonserver.jar
-rw-rw-r--. axonserver axonserver system_u:object_r:user_home_t:s0   axonserver.properties
-rw-r--r--. axonserver axonserver unconfined_u:object_r:user_home_t:s0 axonserver.service
-rw-r--r--. axonserver axonserver unconfined_u:object_r:user_home_t:s0 logback-spring.xml
-rwxr-xr-x. axonserver axonserver unconfined_u:object_r:user_home_t:s0 startup.sh
$

The problem shown above is the type “user_home_t”, which makes all files here “normal” contents of a user’s home and definitely not meant to be started as a system service! So we are going to use two commands: first “restorecon” to set all files to their default security context, so we have a known starting point, and then “chcon” to give only the startup script the “bin_t” type. Both these commands are in the default packages, so we can use them without problems:

$ sudo restorecon /var/lib/axonserver/*
Relabeled /var/lib/axonserver/axonserver-cli.jar from unconfined_u:object_r:user_home_t:s0 to unconfined_u:object_r:var_lib_t:s0
Relabeled /var/lib/axonserver/axonserver.jar from unconfined_u:object_r:user_home_t:s0 to unconfined_u:object_r:var_lib_t:s0
Relabeled /var/lib/axonserver/start-axonserver.sh from unconfined_u:object_r:user_home_t:s0 to unconfined_u:object_r:var_lib_t:s0
Relabeled /var/lib/axonserver/axonserver.properties from unconfined_u:object_r:user_home_t:s0 to unconfined_u:object_r:var_lib_t:s0
$ sudo chcon -t bin_t /var/lib/axonserver/start-axonserver.sh
$

The “unconfined_u” user is allowed unlimited network access, which will allow Axon Server to open ports and accept incoming connections. Please note that this is still not enough if you want to use the lower numbered ports. For that again other settings are needed, possibly up to creating a custom SELinux user that has those rights.

To round off security, we need to add some simpler things such as enabling access control and adding certificates, and this was discussed in the first article on running Axon Server.

Step 5: Persistent Data

Now, this may sound a bit strange, as we’re already using a Virtual Machine with disks, but we haven’t yet made sure our data will survive updates and other configuration changes. With Kubernetes, you do this by creating Persistent Volumes, and with VMs you need to go and choose storage from your provider. Mostly there will be two types, depending on the abstraction level provided; block storage or file storage. The first is best compared to actual disks but then connected through the network rather than as a physical disk. These come in different shapes and sizes, for example, based on a traditional drive or a solid-state one, and tend to have a strong relation to a geographical region. They may also provide some kind of redundancy, and allow for easy backups and migrations. The other group, file stores, adds the “filesystem” layer, so we can directly talk about files and directories. This added functionality not only makes it easier to hide a layer of redundancy but also allows you to hide the actual storage medium used, including restrictions on size and geographical distribution. The consequence is also that file storage tends to be a managed service, while block storage only manages availability, but not space. A further split often seen with Public Cloud providers is between Network Filesystems and Buckets. The latter option provides a simpler and more bulk-oriented access, at the cost of a specialized interconnection layer.

A VM will have at least one disk, used as the “root” or “boot” device. This is the disk where the operating system is installed and creating a new VM generally starts with a boot disk that is a copy of a distribution image. When we build an image for Axon Server, we use this same boot disk as the base, so we can start a fresh server the same way we would start any other VM. The consequence is of course that in the event of an upgrade we want to be able to simply replace this boot disk with one based on the new version, so we need to ensure that all Axon Server’s state is safely stored elsewhere. File storage is probably easiest, but it also tends to be the most expensive, and in some cases has different limits on usage. Amazon’s EFS for example uses rate-limiting since the service is potentially used by several customers at the same time (even if they can’t see each other) and that again impacts the performance limits of Axon Server. Using Amazon EBS solves that problem, at the cost of having to perform some of the management tasks ourselves.

For this example, it doesn’t matter which way we choose, as that is not visible in the configuration of Axon Server. We’ll assume we have three volumes mounted:

  • A big one for the Event Store, where we’ll put the events and snapshots.
  • A smaller one for the replication logs and control database, and some other small items.
  • A shared small one for read-only files that we want to keep in a shared location.

The reasoning here is that the event store needs large and -if possible- cheap storage, as it may grow into Terabytes depending on usage. For the replication logs and control database, we use a smaller and faster disk. We can use this also to store the license file, which over time will be updated, although it is possible to use the read-only configuration volume, as long as we ignore the complaints that Axon Server is unable to distribute and update it itself. The last volume would actually be perfect for a file storage solution, where we create e.g. a single folder per server that only the DevOps team has write access to.

The resulting setup is:

  1. A large disk, for example, mounted on “/mnt/eventstore”, with subdirectories “events” and “snapshots”.
  2. A small fast disk, for example, mounted on “/mnt/data”, with subdirectories “controldb” and “logs”.
  3. A file store location, for example, mounted on “/mnt/config”.

The Axon Server configuration will point directly to this location, or use symbolic links in “/var/lib/axonserver”. In this example, the configuration volume will contain the license file, a system token, and a properties file.

“Your Mileage may vary…”

There are a lot of things not covered here, and throughout this article I purposefully used words like “in this example” or “here”. The approach described in this article will work and is in fact the basis for what we use at AxonIQ, but it may not fit your situation. You may want to use GCE’s Managed Instance Group or AWS’s Auto-Scaling. Not to create multiple instances when the load increases, because that is not how Axon Server clusters work, but to be able to automatically restart unresponsive instances, and get automation support for upgrade rollouts. How to implement this involves so many details and choices, that I won’t even pretend to give you the ultimate best solution.

To name just a few things you should look at, also when using a container-based platform:

  • How will you deal with new instances and fresh disks? Are those disks already formatted? Mounted at the correct point? Kubernetes PersistentVolumes and provider File Storage solutions are normally ready-to-use.
  • How about the network configuration? Load Balancers for the UI/REST API? Will Axon Framework clients connect from outside the network segment that the Axon Server instances are, and need some form of Proxy or Load Balancer to punch through?
  • How will you go from a description of your required deployment to realization? Some form of CI/CD pipeline, a provider-specific tool, or something more generic like Ansible and Terraform?
  • How will you manage your “secrets”? Will you keep credentials and certificates in a vault? How and when do the instances retrieve those secrets?

You don’t have to solve everything upfront, as long as you eventually do move forward and improve the setup. There are so many moving parts that automation is essential and I promise you, there is nothing more satisfying than to see a cluster being replaced, node by node, while the clients adjust their connections, and keep working. We have a long-running test where we stop random nodes and restart them after a bit, and the test is not considered successful unless all client transactions are. Very useful to test your deployment procedures and hypnotizing to watch.

About the Author

Bert Laverman is a Senior Software Architect and Developer at AxonIQ, with over 25 years of experience, the last years mainly with Java. He was co-founder of the Internet Access Foundation, a nonprofit organization that  was instrumental in unlocking the north and east of the Netherlands to the Internet. Starting as a developer in the nineties he moved to Software Architecture, with a short stint as a strategic consultant and Enterprise Architect at an insurer. Now at AxonIQ, the startup behind the Axon Framework and Axon Server, he works on the development of its products, with a focus on Software Architecture and DevOps, as well as helping customers.

Rate this Article

Adoption
Style

BT