Post

Migrating LIVE to a Multi-Disk Clickhouse Setup to Increase Operability and Decrease Cost

Trouble Reducing Unneeded Disk Space

Recently, my colleague Yoann blogged about our efforts to reduce the storage footprint of our Clickhouse cluster by using the LowCardinality data type. But reducing the actual usage of your storage is only one part of the journey and the next step is to get rid of excess capacity if possible. As every engineer that has worked in a cloud environment knows, growing a virtual disk is easy, but simply shrinking it back once you don’t need the amount of storage unfortunately isn’t possible. Luckily for us, with version 19.15, Clickhouse introduced multi-volume storage which also allows for easy migration of data to new disks. As this is still a somewhat new feature we figured writing down our migration journey might be interesting for others, so here we go.

Multi-disk basics

Before we start, let’s first dive into the basics of multi-volume storage in Clickhouse. After upgrading Clickhouse from a version prior to 19.15, there are some new concepts how the storage is organized.

Clickhouse now has the notion of disks/mount points, with the old data path configured in server.xml being the default disk. Disks can be grouped into volumes and again there has been a default volume introduced that contains only the default disk. Where table data is stored is determined by the storage policy attached to it, and all existing tables after the upgrade will have the default storage policy attached to them, which stores all data into the default volume.

All this is reflected by the respective tables in the system database in Clickhouse:

clickhouse :) select * from system.disks;

SELECT *
FROM system.disks

┌─name────┬─path──────────┬────free_space─┬───total_space─┬─keep_free_space─┐
│ default │ /mnt/data/ch/ │ 1040199974912 │ 2641145643008 │               0 │
└─────────┴───────────────┴───────────────┴───────────────┴─────────────────┘
clickhouse :) select policy_name, volume_name, disks, max_data_part_size, move_factor from system.storage_policies;

SELECT 
    policy_name,
    volume_name,
    disks,
    max_data_part_size,
    move_factor 
FROM system.storage_policies

┌─policy_name─┬─volume_name─┬─disks───────┬─max_data_part_size─┬─move_factor─┐
│ default     │ default     │ ['default'] │                  0 │         0.1 │
└─────────────┴─────────────┴─────────────┴────────────────────┴─────────────┘

clickhouse :) select name, data_paths, storage_policy from system.tables where storage_policy='default';

SELECT 
    name,
    data_paths, 
    storage_policy 
FROM system.tables 
WHERE storage_policy = 'default'

┌─name───┬─data_paths────────────────┬─storage_policy─┐ 
│ foo    │ ['/mnt/data/ch/db/foo/']  │ default        │
│ bar    │ ['/mnt/data/ch/db/bar/']  │ default        │
└────────┴───────────────────────────┴────────────────┘

More details on the multi-volume feature can be found in the introduction article on the Altinity blog, but one thing to note here are the two parameters max_data_part_size and move_factor, that we can use to influence the conditions under which data is stored on one disk or the other.

Migrating our Clickhouse data

With this information in place, how can we now manage to move our existing data of the under-utilized disks onto a new setup? Previously we had 12TB volumes on our data nodes which were only used to ~30% after we finished our project to optimize disk usage and query performance. So we decided to go for a two disk setup with 2.5TB per disk.

The natural thought would be to create a new storage policy and adjust all necessary tables to use it. But the documentation states that

“Once a table is created, its storage policy cannot be changed.”

Even if it would be possible, for our scenario this would not be ideal, as we use the same foundation between our Saas platform and our self-hosted installations. Meaning we would need to do schema migrations to address the change not only on our own clusters, but also would need to ship these to our self-hosted customers or diverge from our goal to have only one source of truth for both worlds.

So, if you can’t change a storage policy in hindsight, how about changing the default storage policy to model your new setup and give you a path for migrating data locally on the node without noteworthy downtime?

First of all, to achieve this we need to make our new disks that we mounted under /mnt/ch/data1 and /mnt/ch/data2 known to the system, by introducing the storage_configuration section in our server.xml (or a referenced config file):


  <storage_configuration>
    <disks>
      <data1>
        <path>/mnt/data1/ch/</path>
      </data1>
      <data2>
        <path>/mnt/data2/ch/</path>
      </data2>
    </disks>
  </storage_configuration>

Note the trailing / in the paths! These are required by clickhouse, otherwise it will not come back up!

With just this change alone, Clickhouse would know the disks after a restart, but of course not use them yet, as they are not part of a volume and storage policy yet. To achieve this, we we enhance the default storage policy that clickhouse created as follows:


  <storage_configuration>
    <disks>
      <data1>
        <path>/mnt/data1/ch/</path>
      </data1>
      <data2>
         <path>/mnt/data2/ch/</path>
      </data2>
    </disks>
    <policies>
      <default>
        <volumes>
          <default>
            <disk>default</disk>
            <max_data_part_size_bytes>50000000</max_data_part_size_bytes>
          </default>
          <data>
            <disk>data1</disk>
            <disk>data2</disk>
          </data>
        </volumes>
        <move_factor>0.97</move_factor>
      </default>
    </policies>
  </storage_configuration>

We leave the default volume which points to our old data mount in there but add a second volume called data which consists of our newly added disks. By adding the max_data_part_size_bytes to the default volume, we make sure Clickhouse doesnt create new parts that are bigger than 50MB, these will already be created on the new disks. By adding the move_factor of 0.97 to the default storage policy we instruct Clickhouse that, if one volume has less than 97% free space, it will start to move parts from that volume to the next volume in order within the policy. So Clickhouse will start to move data away from old disk until it has 97% of free space. During tests we tried to go directly with a move_factor of 1.0, but found that allowing Clickhouse to still write and merge smaller data parts onto the old volume, we take away pressure from the local node until all the big parts have finished moving.

With this configuration in place, after a restart Clickhouse will start doing the work and you will see log messages like this:


2020.05.27 13:15:26.387541 [ 15 ] {} <Information> db.foo: Got 2 parts to move.

During the movement, progress can be checked by looking into the system.parts table to see how many parts are still residing on the old disk:


clickhouse :) select partition, name, formatReadableSize(bytes_on_disk) as 
disk_size, path from system.parts where disk_name='default' and active=1 order 
by bytes_on_disk desc;
SELECT
    partition,
    name,
    formatReadableSize(bytes_on_disk) AS disk_size,
    path
FROM system.parts
WHERE (disk_name = 'default') AND (active = 1)
ORDER BY bytes_on_disk DESC
┌─partition─┬─name───────────┬─disk_size──┬─path─────────────────────────────┐
│ 18439     │ 12345          │ 1.07 GiB   │ /mnt/data/ch/data/db/foo/12345/  │
│ 18317     │ 67890          │ 387.00 B   │ /mnt/data/ch/data/db/foo/67890 / │
└───────────┴────────────────┴────────────┴──────────────────────────────────┘

The number of active parts will start to go down as clickhouse starts to move away parts, starting with small parts first and working its way to the bigger parts. As we still ingest new data this process can take a few hours to complete. Eventually, when there are new bigger parts left to move, you can adjust the storage policy to have a move_factor of 1.0 and a max_data_part_size_bytes in the kilobyte range to make Clickhouse move the remaining data after a restart. Again, with the query above, make sure all parts have been moved away from the old disk. Once this is the case the procedure can be finished with these steps:

Metadata is not moved by Clickhouse, so you need to copy over the metadata folder of clickhouse from your old default disk to one of your new disks (data1 here). Make sure to update file system permissions if you do run this command as a different user, otherwise Clickhouse will not come back up after a restart:

  $ cp -r /mnt/data/ch/metadata /mnt/data1/ch/
  $ chown -R clickhouse:clickhouse /mnt/data1/ch/metadata

Only MergeTree data gets moved, so if you have other table engines in use, you need to move these over too. In our case we only had a TinyLog table that holds our migration state which luckily doesn’t get any live data:

  $ cp -r /mnt/data/ch/data/default /mnt/data1/ch/data/
  $ chown -R clickhouse:clickhouse /mnt/data1/ch/data/default

Adjust your server.xml to remove the old disk and make one of your new disks the default disk (holding metadata, tmp, etc.). There are three elements in the config pointing to the default disk (where path is actually what Clickhouse will consider to be the default disk):

  $ grep '/mnt/data/ch' /etc/clickhouse/server.xml 
  <path>/mnt/data/ch/</path>
  <tmp_path>/mnt/data/ch/tmp/</tmp_path>
  <format_schema_path>/mnt/data/ch/format_schemas/</format_schema_path>

Adjust these to point to the disks where you copied the metadata in step 1.

Then also remove the old disk from the default storage policy:


<storage_configuration>
  <disks>
    <data1>
      <path>/mnt/data1/ch/</path>
    </data1>
    <data2>
      <path>/mnt/data2/ch/</path>
    </data2>
  </disks>

  <policies>
    <default>
      <volumes>
        <data>
          <disk>data1</disk>
          <disk>data2</disk>
        </data>
      </volumes>
    </default>
  </policies>
</storage_configuration>

Now, after restarting Clickhouse, your old disk will not be in use anymore and you can safely remove it.

Troubleshooting Clickhouse Migration

Although the process worked mostly great, it seemed to us the automatic moving isn’t working 100% stable yet and there are sometimes errors occurring. So it is advisable to keep an eye on the logs while the migration is running. In some cases, we saw the following error, although there was no obvious shortage on neither disk nor memory. In these cases restarting Clickhouse normally solved the problem if we catched it early on. If the data was diverging too much from its replica, we needed to use the force_restore_data flag to restart Clickhouse. Once it was back up it picked up where it left.


2020.05.29 14:56:31.035573 [ 20 ] {} <Error> shared.website_monitoring_beacons:
DB::StorageReplicatedMergeTree::queueTask()::<lambda(DB::StorageReplicatedMergeTree:
:LogEntryPtr&)>: Code: 243, e.displayText() = DB::Exception: Cannot
reserve 46.87 MiB, not enought space., Stack trace

Epilogue – Performing a Live Clickhouse Migration

With this procedure, we managed to migrate all of our Clickhouse clusters (almost) frictionless and without noticeable downtime to a new multi-disk setup. As a bonus, the migration happens local to the node and we could keep the impact on other cluster members close to zero. With these capabilities in place, growing storage in the future has become as easy as adding a new disk or volume to your storage policy which is great and improves the operability of Clickhouse a lot. I expect more interesting features to come around this, as has already been the case with TTL moves introduced in a recent version of Clickhouse.

Play with Instana’s APM Observability Sandbox

Start your FREE TRIAL today!

As the leading provider of Automatic Application Performance Monitoring (APM) solutions for microservices, Instana has developed the automatic monitoring and AI-based analysis DevOps needs to manage the performance of modern applications. Instana is the only APM solution that automatically discovers, maps and visualizes microservice applications without continuous additional engineering. Customers using Instana achieve operational excellence and deliver better software faster. Visit https://www.instana.com to learn more.