Experiments in Ceph (with Promox)

HTTP_404_NotFound@lemmyonline.com · 1 year ago

Experiments in Ceph (with Promox)

eros@lemmy.world · 1 year ago

Nice writeup. As long as you can throw fast drives, fast networking and plenty of RAM at it Ceph is happy.

Ceph seems to work fine on my cluster at work. For less than $40k I replaced my whole VMware vSAN cluster and we’re saving as much again in software licensing over the next 5 years with buying support from Proxmox. Also much lighter as far as administrative tasks to keep it up to date and running well.

3x Supermicro SSG-110P-NTR10

Intel Xeon Gold 5713
256 GB RAM
10 Intel D7-P5510 3.84TB NVME
2 Micron 5400 Max
Onboard dual 10GbE
Mellanox ConnectX4 Dual SFP28 25GbE
5 year NBD parts warranty

HTTP_404_NotFound@lemmyonline.com · 1 year ago

Have you done any measurements of IOPs? Just curious to know.

eros@lemmy.world · 1 year ago

I don’t, but I’ll run some and try to remember to post back.

MangoPenguin@lemmy.blahaj.zone · 1 year ago

Ceph seems neat, but the fact that it can’t even function with normal SSDs points to something very wrong with how it’s designed. It seems like it has an absurd overhead.

HTTP_404_NotFound@lemmyonline.com · 1 year ago

I believe its a data-safety thing, similar to how ZFS’s ZIL works.

That is, a write isn’t completed until its actually written. In the case of consumer SSDs, this means, waiting for the write to complete. In the case of enterprise SSDs, this means the write-cache, (due to PLP, power loss protection).

With anything though, you can disable those safety features.

absurd overhead.

Actually a massive understatement. I threw together over 5 million IOPs worth of disks, to barely squeeze 100k IOPs out of the cluster! Its EXTREMELY inefficient, compared to… well, pretty much any other option. I mean, writing encrypted zip files to SD card storage can be faster in some circumstances. lol

But, its reliable, fault-tolerant storage, which is instantly available(ie, no replication, syncing, etc).

30021190@lemmy.cloud.aboutcher.co.uk · edit-2 1 year ago

Ceph works best if you have identical osd, quantity, type and capacity across the cluster, also works best on a 3+ node cluster.

I ran a mixed sata SSD/HDD 256gb/4tb cluster and it was always a bit pants. Now I have 7x1tb SSD per node (4nodes) and it works fantastic now.

Proxmox uses replica 3/2 failure at host level but you may find that EC works better for your mixed infra as you noticed you can’t meed the 3 host failure and so setting to osd failure level means data may be kept on a single host so would need to traverse the network to the other machine.

You may also need more than a single 10Gb nic too as you might start hitting bandwidth issues.

HTTP_404_NotFound@lemmyonline.com · 1 year ago

Proxmox uses replica 3/2 failure at host level

I ended up having to set the failure domain to OSD, rather then host… at least, until the next group of 5 enterprise SSDs arrives to properly distribute data across all three nodes. But… once the next group of 5 arrives, it will allow me to setup a fairly even distribute of data across all three 10G nodes.

You may also need more than a single 10Gb nic too as you might start hitting bandwidth issues.

Knock on wood, I don’t “think” I have enough heavy bandwidth loads for this to be a huge issue, at least, with the exception of when the backups are running. Most of my workloads use fast random I/O. (databases, kubernetes, etc.)

BUT… I do have 40g networking on the r730xd already, and I have enough 40G NICs laying around to build a full mesh 40G network between those three nodes if needed.

30021190@lemmy.cloud.aboutcher.co.uk · 1 year ago

So my production setup is 2x10Gb bonded NICs for networking and 2x10Gb bonded NICs for Ceph/Cluster stuff. I suspect that when ceph is being heavily used you may see bottlenecks however once you have host based failure then in theory your data should be closer to the correct host and not have an issue. But it’s on a basic level like have 3 copies of data, one on each host so it doesn’t save you any storage, just reduces the risks during failure.

Thinking about it, you may actually see better results with ZFS and replicate jobs. As there’s fewer overheads and the ZFS sending is incremental. You’d obviously just loose X minutes of data instead of ceph being X seconds.

HTTP_404_NotFound@lemmyonline.com · 1 year ago

you may actually see better results with ZFS and replicate jobs

Oh, I know the performance is drastically better doing that. I did play with it, and it works for the most part. Performance is dramatically better, but I have peace of mind knowing that is a host just magically craps itself, the data is already ready to go and the machine has already fired up on the new host without any issues.

Also, there is something fun about literally tossing over 6 million IOPs worth of SSDs into my cluster, just to barely squeeze 50k IOPs out of ceph!

I have 5 more “enterprise” NVMes arriving tuesday, which will complete my ceph cluster.

Current, I have 4 of the enterprise SATA SSDs in place, and a single 980 as a placeholder.

Nothing at all to write home about. BUT, I do think the lack of distributed drives is making an impact. My most powerful host, doesn’t have any OSDs yet, still waiting on the NVMe to arrive.

During heavy benchmarking, the limitations of the consumer 980 evo became pretty apparent, when its latency spiked through the moon.

The addition of the new 5 NVMe should make a pretty dramatic difference. If I can squeeze 100k IOPs, I will be happy. (Despite… over 6 million IOPs worth of SSDs…)

redcalcium@lemmy.institute · 1 year ago

How is Ceph latency compared to plain old NFS with the same (single) hardware? Especially when your apps requires reading a lot of small files where latency matter more than raw speed? NFS is pretty awful for this so I’m interested if there are any good alternatives.

HTTP_404_NotFound@lemmyonline.com · 1 year ago

I am going to guess- normal NFS is going to be faster…

There is really nothing about ceph that even remotely says, “Fast” to me.

One alternative might for you, might be minio, if object storage works for you. In my experience, it performs pretty well.

ShatteredScales@lemmy.world · 1 year ago

Well that’s some weird behavior on the latency.

I have several Samsung 870 Evos across three hosts, and they’re all ~7ms.

HTTP_404_NotFound@lemmyonline.com · 1 year ago

Might be due to the load?

Or, perhaps cache setting. I think one of the issues the consumer drives have, is lack of PLP.

https://forum.proxmox.com/threads/vm-i-o-performance-with-ceph-storage.120929/

This particular thread had some really good info around half way down.

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/2P5ROTWHR5Y2VWI6MA3IKQKUTC3WKYFB/

Experiments in Ceph (with Promox)

Experiments in Ceph (with Promox)

Cluster Details

Attempt number one.

Attempt / Experiment Number 2.

A few notes-

Future - Attempt #3