Perhaps you are running an existing cluster, or maybe even setup your own - fantastic. Running a Ceph cluster can be a surprisingly straight-forward task, until it isn’t. Here are some general tips for keeping your cluster in top shape and ensuring it operates as smoothly as possible.
If you require some ongoing assistance in managing your cluster, don’t forget I am available to support clusters of all types and sizes anywhere in the world. Please see my Consultancy page for further information!
Never use 2x Replication
When designing your CRUSH map, unless you understand the inherent risks and problems that come with using 2x replication, do not use it. It will not provide the HA solution that Ceph is designed to offer. If you feel you are happy running the risk of a 2x replication situation, your use case is likely better suited to running in RAID rather than Ceph. If cost is a factor, looking in to Erasure Coding instead definitely recommended for Luminous (12.X+) clusters.
Don’t leave “noout” set
Flags such as
noout are fantastic to set/unset and are as easy as running:
ceph osd set noout ceph osd unset noout
But knowing the uses and limits of this flag are important. Fundamentally, this flag will prevent Ceph’s attempt to mark OSDs are
down within your cluster, thus halting efforts for the cluster to automatically heal in the event of a failure.
This flag has great utility during maintenance work when hosts or even racks need to be taken down for updates or major changes, but this should only ever be set for as long as this maintenance is ongoing. At the end of every work period, ensure Ceph has returned to
HEALTH_OK before proceeding, and never leave
noout set for extended unsupervised periods.
If the cluster encounters further OSD failures in another failure domain during noout being set, PGs will start to be marked as down and/or inactive.
Don’t skip the CPU/Memory recommendations
Running a Ceph cluster has elements which are more an art than a science such as capacity management and choosing the best hardware for your use case. In this art however, always be sure to follow the latest hardware recommendations from the Ceph developers. Notable should be the following rules:
- 1GB of RAM per 1TB of OSD storage
- 1 CPU core (1GHz+) per OSD
When the cluster is in use, for >95% of the time these resources will be run at very low utility. However, during failover events, scrubbing and orphan searches for example, your cluster will use all the resources available to it. Failing to provision the correct CPU within the cluster will lead to a potential outage as OSDs are unable to process requests quickly enough, causing them to be incorrectly marked as down, and low memory provision can cause a cascading failure of OOM (“out of memory”) errors.
In all of these failure scenarios, that additional OSD maps which need to be distributed throughout the cluster and acted upon will further contribute to the load. I call these events cascading failures, as typically unless the vicious cycle can be broken, you are likely to encounter a full outage of the cluster and potential data loss.
As a rule of thumb, I never recommend to run more than 18 OSDs per host, even if the resources are available to do so.
When diving into building a cluster, it is all too tempting to immediately scale the cluster vertically to get the best bang per buck from your budget. However, keep in mind the basic principles behind Ceph, which is to permit distributed, fault-tolerant storage. If you scale up, instead of scale out, you risk concentrating much of your performance within only a few hosts and exposing your cluster to severely impacted performance in the event even a single host fails.
Ceph will always perform at the speed of the slowest component, thus if you have 3 hosts per failure domain and one fails, the remaining 2 hosts within the failure domain will ultimately have to carry an additional 50% of load from their baseline workload, while also running a recovery which is one of the most intensive operations in the lifecycle of a cluster.
Protect yourself from human error
Mistakes are an unavoidable and difficult to fully mitigate aspect of running any infrastructure. Fortunately, there are a few options within Ceph to help reduce the risk of these mistakes occurring.
Firstly, to prevent accidental pool deletion, always be sure to disable this action from the monitors. You can achieve this through the
ceph.conf files on your monitor hosts using the below. This configuration is now the default setting from Luminous (12.X.X onwards):
mon_allow_pool_delete = false
Further to this, you can set the
nodelete flag on each of your pools using the below:
for pool in $(rados lspools); do ceph osd pool set $pool nodelete true; done
Size your pool PGs conservatively
PGs are a tricky concept to new Ceph users, so much so there are even tools designed to calculate them for you. If in doubt, always stick to the lowest number of PGs you require. You can always increase the number of PGs in a pool, but you can’t reduce them.
Stay (mostly) up to date on Ceph
Keeping your Ceph deployment up to date is crucial for the smooth operation of it. With so many moving parts to the software, even small bugs which do not impact you immediately can quickly become cluster-breaking if you fail to stay up to date.
This being said, in the latest LTS release series (Luminous), a number of the releases (for example) contained serious issues, and so upgrading immediately is not always the best course of action.
I have previously recommended on the mailing lists that individuals with production clusters allow for at least 1 month following a release before upgrading, after thoroughly reading the changelog, to ensure that any issues which have slipped through the test net can be discovered and resolved in a timely manner. If you aren’t already signed up to the ceph-users mailing list, this is the perfect time to as there is no better way of staying informed of the latest releases and any issues than there.
So, there you have my 7 tips for running your first Ceph cluster. Experience is something difficult to transfer into written word however, so I’d recommend getting in touch and reading through my Consultancy page if you require any further advice on getting started with Ceph or operating your cluster, as I can offer everything from personalised advice right through to hardware recommendations and support in operating it.