模拟CEPH节点慢盘与网络异常
CEPH节点慢盘与网络异常
- 通过CgropuV2 限制CEPH-OSD进程访问硬盘的IO。来模拟慢盘场景
- 通过tc来模拟网络延迟增加和丢包
参考资料
- 详解Cgroup V2
- cgroups(7) — Linux manual page
- systemd.resource-control — Resource control unit settings
- 探究 Rootless Containers:通过 systemd 配置 cgroup v2
- 容器原理之cgroup
- linux下使用tc(Traffic Control) 流量控制命令模拟网络延迟和丢包
Ceph 环境与Linux
- 操作系统采用
debian 12.4 (bookworm)
,内核6.1.0-35-amd64
。 - Ceph版本为
19.2.2 squid
,Ceph 部署不再重复,可看之前分享文章。
主机名 | IP | CPU/内存 | 数据硬盘 |
---|---|---|---|
ceph1 | 192.168.1.121 | 4C16G | 2 * 100G |
ceph2 | 192.168.1.122 | 4C16G | 2 * 100G |
ceph3 | 192.168.1.123 | 4C16G | 2 * 100G |
root@ceph2:~# ceph -s
cluster:
id: 3e07d43f-688e-4284-bfb7-3e6ed5d3b77b
health: HEALTH_WARN
noout flag(s) set
services:
mon: 3 daemons, quorum ceph1,ceph2,ceph3 (age 98s)
mgr: ceph3(active, since 90s), standbys: ceph2, ceph1
mds: 1/1 daemons up
osd: 6 osds: 6 up (since 65s), 6 in (since 4h)
flags noout
data:
volumes: 1/1 healthy
pools: 4 pools, 161 pgs
objects: 16.75k objects, 38 GiB
usage: 8.2 GiB used, 592 GiB / 600 GiB avail
pgs: 161 active+clean
root@ceph2:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.58612 root default
-3 0.19537 host ceph1
0 hdd 0.09769 osd.0 up 1.00000 1.00000
5 hdd 0.09769 osd.5 up 1.00000 1.00000
-5 0.19537 host ceph2
1 hdd 0.09769 osd.1 up 1.00000 1.00000
4 hdd 0.09769 osd.4 up 1.00000 1.00000
-7 0.19537 host ceph3
2 hdd 0.09769 osd.2 up 1.00000 1.00000
3 hdd 0.09769 osd.3 up 1.00000 1.00000
slow OSD模拟测试
- 经过各种尝试和查找资料,ceph-osd进程是systemd管理,新版systemd会自动生成关联进程的cgroup配置。
不要手动修改systemd - cgroup ,否则会变得不幸!!!
github - systemd/docs /CGROUP_DELEGATION.md
- 停止某个OSD,改为手动后台启动
# 停止osd-2
root@ceph3:~# systemctl status ceph-osd@2.service
* ceph-osd@2.service - Ceph object storage daemon osd.2
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; preset: enabled)
Drop-In: /etc/systemd/system.control/ceph-osd@2.service.d
`-50-IOAccounting.conf, 50-MemoryMax.conf
Active: active (running) since Sun 2025-06-29 01:04:31 CST; 1min 14s ago
Process: 1197 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 2 (code=exited, status=0/SUCCESS)
Main PID: 1217 (ceph-osd)
IO: 650.6M read, 874.9M written
Tasks: 58
Memory: 307.6M (max: 9.3G available: 9.0G)
CPU: 6.755s
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@2.service
`-1217 /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
Jun 29 01:04:31 ceph3 systemd[1]: Starting ceph-osd@2.service - Ceph object storage daemon osd.2...
Jun 29 01:04:31 ceph3 systemd[1]: Started ceph-osd@2.service - Ceph object storage daemon osd.2.
Jun 29 01:04:38 ceph3 ceph-osd[1217]: 2025-06-29T01:04:38.666+0800 7f028b97a880 -1 osd.2 238 log_to_monitors true
Jun 29 01:03:43 ceph3 ceph-osd[1217]: 2025-06-29T01:03:43.868+0800 7f0282ec76c0 -1 osd.2 238 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
root@ceph3:~# systemctl stop ceph-osd@2.service
root@ceph3:~# ps -aux | grep ceph-osd
ceph 1218 3.2 1.0 697160 173472 ? Ssl 01:03 0:06 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
root 2529 0.0 0.0 3328 1444 pts/0 S+ 01:06 0:00 grep ceph-osd
# 改为手动nohup 启动
root@ceph3:/sys/fs/cgroup/ceph_osd# nohup /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph > /home/allen/ceph-osd@2.log &
[1] 3892
root@ceph3:/sys/fs/cgroup/ceph_osd# nohup: ignoring input and redirecting stderr to stdout
root@ceph3:/sys/fs/cgroup/ceph_osd#
root@ceph3:/sys/fs/cgroup/ceph_osd# ps -aux |grep ceph-osd
ceph 1218 1.6 1.0 697160 179652 ? Ssl 01:03 0:11 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
ceph 3892 32.7 1.0 669364 167284 pts/0 Sl 01:14 0:03 /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
root 4350 0.0 0.0 3328 1532 pts/0 S+ 01:14 0:00 grep ceph-osd
- 创建Cgroup规则目录,并配置IO限速
# 切换到/sys/fs/cgroup配置目录,创建测试用配置
# 这里创建两层,是不知为何如果直接对OSD进程配置io quota会报错,但是可以把OSD进程放进子目录,在父目录配置io quota。
root@ceph3:/sys/fs/cgroup# mkdir ceph_osd
root@ceph3:/sys/fs/cgroup# mkdir ceph_osd/ceph_osd
# 将osd进程PID加入子目录配置中
root@ceph3:/sys/fs/cgroup# echo 3892 > ceph_osd/ceph_osd/cgroup.procs
root@ceph3:/sys/fs/cgroup# cat ceph_osd/ceph_osd/cgroup.procs
3892
# 找到硬盘与OSD对应关系
# 这里的记录为节点重启后补充,盘符已经漂移,与下一步记录不一样
root@ceph3:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 100G 0 disk
`-ceph--9cfa8e9f--085c--4d93--b45c--b5faa1ad5bb7-osd--block--709ef864--5921--482a--848d--e69f2c951448 254:0 0 100G 0 lvm
sdb 8:16 0 50G 0 disk
|-sdb1 8:17 0 487M 0 part /boot
|-sdb2 8:18 0 1K 0 part
`-sdb5 8:21 0 49.5G 0 part
|-debian12--vg-root 254:1 0 16.8G 0 lvm /
|-debian12--vg-swap_1 254:2 0 976M 0 lvm [SWAP]
`-debian12--vg-home 254:3 0 31.8G 0 lvm /home
sdc 8:32 0 100G 0 disk
`-ceph--0f54cf95--c06f--417a--9cb2--4d4598a70efe-osd--block--8c38f4ba--fa4e--4257--b723--1c0f90c44668 254:4 0 100G 0 lvm
sr0 11:0 1 1024M 0 rom
root@ceph3:~# ls -l /var/lib/ceph/osd/ceph-2/
total 28
lrwxrwxrwx 1 ceph ceph 93 Jun 29 12:28 block -> /dev/ceph-9cfa8e9f-085c-4d93-b45c-b5faa1ad5bb7/osd-block-709ef864-5921-482a-848d-e69f2c951448
-rw------- 1 ceph ceph 37 Jun 29 12:28 ceph_fsid
-rw------- 1 ceph ceph 37 Jun 29 12:28 fsid
-rw------- 1 ceph ceph 55 Jun 29 12:28 keyring
-rw------- 1 ceph ceph 6 Jun 29 12:28 ready
-rw------- 1 ceph ceph 3 Jun 29 12:28 require_osd_release
-rw------- 1 ceph ceph 10 Jun 29 12:28 type
-rw------- 1 ceph ceph 2 Jun 29 12:28 whoami
# 在父目录中增加IO写入限速
# 匹配得知osd-2的硬盘MAJ:MIN是8:16
root@ceph3:/sys/fs/cgroup# echo "8:16 wiops=50" > ceph_osd/io.max
root@ceph3:/sys/fs/cgroup# cat ceph_osd/io.max
8:16 rbps=max wbps=max riops=max wiops=50
- 打入负载,观察出现slow ops
# 无限速场景下 rbd bench结果
root@ceph1:~# rbd ls
img
root@ceph1:~# rbd bench rbd/img --io-type write --io-size 4K --io-pattern rand --io-threads 32 --io-total 100M
bench type write io_size 4096 io_threads 32 bytes 104857600 pattern random
SEC OPS OPS/SEC BYTES/SEC
1 3264 3309.12 13 MiB/s
2 6016 3048.28 12 MiB/s
3 10048 3368.87 13 MiB/s
4 14496 3642.8 14 MiB/s
5 19104 3836.28 15 MiB/s
6 24192 4185.46 16 MiB/s
elapsed: 6 ops: 25600 ops/sec: 3836.8 bytes/sec: 15 MiB/s
root@ceph1:~#
# 开启限速后
root@ceph1:~# rbd bench rbd/img --io-type write --io-size 4K --io-pattern rand --io-threads 32 --io-total 100M
bench type write io_size 4096 io_threads 32 bytes 104857600 pattern random
SEC OPS OPS/SEC BYTES/SEC
7 1792 240.245 961 KiB/s
12 1920 158.797 635 KiB/s
17 2080 122.02 488 KiB/s
22 2144 97.6277 391 KiB/s
27 2432 89.2723 357 KiB/s
32 2496 28.0379 112 KiB/s
37 2720 31.2489 125 KiB/s
42 2880 31.5148 126 KiB/s
47 3040 35.1416 141 KiB/s
# 观察此时iostat 和 osd perf
avg-cpu: %user %nice %system %iowait %steal %idle
0.51 0.00 0.76 24.43 0.00 74.30
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 20.00 0.19 26.00 56.52 0.25 9.60 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40
sdb 0.00 0.00 0.00 0.00 0.00 0.00 10.00 0.20 40.00 80.00 0.40 20.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
root@ceph3:/sys/fs/cgroup# ceph osd perf
osd commit_latency(ms) apply_latency(ms)
5 8 8
4 5 5
3 2 2
2 6642 6642
1 4 4
0 10 10
# 集群健康状态出现慢请求
root@ceph2:~# ceph -s
cluster:
id: 3e07d43f-688e-4284-bfb7-3e6ed5d3b77b
health: HEALTH_WARN
1 OSD(s) experiencing slow operations in BlueStore
noout flag(s) set
8 slow ops, oldest one blocked for 67 sec, osd.2 has slow ops
services:
mon: 3 daemons, quorum ceph1,ceph2,ceph3 (age 23m)
mgr: ceph3(active, since 22m), standbys: ceph2, ceph1
mds: 1/1 daemons up
osd: 6 osds: 6 up (since 11m), 6 in (since 5h)
flags noout
data:
volumes: 1/1 healthy
pools: 4 pools, 161 pgs
objects: 22.97k objects, 60 GiB
usage: 9.7 GiB used, 590 GiB / 600 GiB avail
pgs: 127 active+clean
34 active+clean+laggy
io:
client: 62 KiB/s rd, 179 KiB/s wr, 48 op/s rd, 51 op/s wr
root@ceph2:~# ceph health detail
HEALTH_WARN 1 OSD(s) experiencing slow operations in BlueStore; noout flag(s) set; 6 slow ops, oldest one blocked for 82 sec, osd.2 has slow ops
[WRN] BLUESTORE_SLOW_OP_ALERT: 1 OSD(s) experiencing slow operations in BlueStore
osd.2 observed slow operation indications in BlueStore
[WRN] OSDMAP_FLAGS: noout flag(s) set
[WRN] SLOW_OPS: 6 slow ops, oldest one blocked for 82 sec, osd.2 has slow ops
网络延迟与丢包模拟测试
- 直接使用Linux原生的tc流控进行模拟
- 延迟增大场景模拟
root@ceph3:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:0c:29:47:93:f7 brd ff:ff:ff:ff:ff:ff
altname enp11s0
inet 192.168.1.123/24 brd 192.168.1.255 scope global ens192
valid_lft forever preferred_lft forever
inet6 240e:3ba:30e9:2d30:20c:29ff:fe47:93f7/64 scope global dynamic mngtmpaddr
valid_lft 259183sec preferred_lft 172783sec
inet6 fe80::20c:29ff:fe47:93f7/64 scope link
valid_lft forever preferred_lft forever
# 给ens192 增加一个100ms的延迟
root@ceph3:~# tc qdisc add dev ens192 root netem delay 100ms
# ping 测试出现延迟
root@ceph1:~# ping ceph3
PING ceph3 (192.168.1.123) 56(84) bytes of data.
64 bytes from ceph3 (192.168.1.123): icmp_seq=1 ttl=64 time=0.320 ms
64 bytes from ceph3 (192.168.1.123): icmp_seq=2 ttl=64 time=0.426 ms
64 bytes from ceph3 (192.168.1.123): icmp_seq=3 ttl=64 time=0.369 ms
64 bytes from ceph3 (192.168.1.123): icmp_seq=4 ttl=64 time=0.418 ms
64 bytes from ceph3 (192.168.1.123): icmp_seq=5 ttl=64 time=0.416 ms
^C
--- ceph3 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4091ms
rtt min/avg/max/mdev = 0.320/0.389/0.426/0.040 ms
root@ceph1:~# ping ceph3
PING ceph3 (192.168.1.123) 56(84) bytes of data.
64 bytes from ceph3 (192.168.1.123): icmp_seq=1 ttl=64 time=100 ms
64 bytes from ceph3 (192.168.1.123): icmp_seq=2 ttl=64 time=100 ms
64 bytes from ceph3 (192.168.1.123): icmp_seq=3 ttl=64 time=100 ms
64 bytes from ceph3 (192.168.1.123): icmp_seq=4 ttl=64 time=100 ms
# 施加一个rbd bench 的负载
root@ceph1:~# rbd bench rbd/img --io-type write --io-size 4K --io-pattern rand --io-threads 32 --io-total 2G
bench type write io_size 4096 io_threads 32 bytes 2147483648 pattern random
SEC OPS OPS/SEC BYTES/SEC
1 1536 1567.95 6.1 MiB/s
2 2464 1235.6 4.8 MiB/s
3 3904 1303.27 5.1 MiB/s
4 6048 1516.91 5.9 MiB/s
5 8864 1779.14 6.9 MiB/s
6 13248 2344.19 9.2 MiB/s
7 18560 3234.62 13 MiB/s
8 23648 3964.52 15 MiB/s
9 27008 4201.94 16 MiB/s
10 28864 3996.67 16 MiB/s
11 31584 3661.22 14 MiB/s
12 34688 3225.49 13 MiB/s
13 39296 3126.99 12 MiB/s
14 44128 3423.88 13 MiB/s
15 48800 3990.26 16 MiB/s
16 53120 4307.05 17 MiB/s
17 57792 4620.64 18 MiB/s
# 集群出现心跳超时异常
root@ceph1:~# ceph -s
cluster:
id: 3e07d43f-688e-4284-bfb7-3e6ed5d3b77b
health: HEALTH_WARN
noout flag(s) set
Slow OSD heartbeats on back (longest 1252.515ms)
Slow OSD heartbeats on front (longest 1322.394ms)
services:
mon: 3 daemons, quorum ceph1,ceph2,ceph3 (age 6m)
mgr: ceph3(active, since 6m), standbys: ceph1, ceph2
mds: 1/1 daemons up
osd: 6 osds: 6 up (since 6m), 6 in (since 5h)
flags noout
data:
volumes: 1/1 healthy
pools: 4 pools, 161 pgs
objects: 26.03k objects, 95 GiB
usage: 19 GiB used, 581 GiB / 600 GiB avail
pgs: 148 active+clean
13 active+clean+laggy
io:
client: 748 KiB/s wr, 0 op/s rd, 186 op/s wr
root@ceph1:~# ceph health detail
HEALTH_WARN noout flag(s) set; Slow OSD heartbeats on back (longest 1252.515ms); Slow OSD heartbeats on front (longest 1322.394ms)
[WRN] OSDMAP_FLAGS: noout flag(s) set
[WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 1252.515ms)
Slow OSD heartbeats on back from osd.3 [] to osd.4 [] 1252.515 msec
Slow OSD heartbeats on back from osd.3 [] to osd.5 [] 1158.306 msec
Slow OSD heartbeats on back from osd.3 [] to osd.0 [] 1158.168 msec
Slow OSD heartbeats on back from osd.3 [] to osd.1 [] 1158.110 msec
Slow OSD heartbeats on back from osd.1 [] to osd.2 [] 1044.267 msec
Slow OSD heartbeats on back from osd.1 [] to osd.3 [] 1044.239 msec
Slow OSD heartbeats on back from osd.5 [] to osd.3 [] 1026.766 msec
Slow OSD heartbeats on back from osd.5 [] to osd.2 [] 1026.008 msec
Slow OSD heartbeats on back from osd.2 [] to osd.5 [] 1001.730 msec
Slow OSD heartbeats on back from osd.4 [] to osd.2 [] 1001.571 msec
Truncated long network list. Use ceph daemon mgr.# dump_osd_network for more information
[WRN] OSD_SLOW_PING_TIME_FRONT: Slow OSD heartbeats on front (longest 1322.394ms)
Slow OSD heartbeats on front from osd.3 [] to osd.4 [] 1322.394 msec
Slow OSD heartbeats on front from osd.3 [] to osd.5 [] 1306.912 msec
Slow OSD heartbeats on front from osd.3 [] to osd.0 [] 1160.421 msec
Slow OSD heartbeats on front from osd.3 [] to osd.1 [] 1159.052 msec
Slow OSD heartbeats on front from osd.1 [] to osd.3 [] 1044.372 msec
Slow OSD heartbeats on front from osd.1 [] to osd.2 [] 1044.159 msec
Slow OSD heartbeats on front from osd.5 [] to osd.3 [] 1028.001 msec
Slow OSD heartbeats on front from osd.5 [] to osd.2 [] 1026.569 msec
Slow OSD heartbeats on front from osd.2 [] to osd.0 [] 1002.276 msec
Slow OSD heartbeats on front from osd.4 [] to osd.3 [] 1001.902 msec
Truncated long network list. Use ceph daemon mgr.# dump_osd_network for more information
- 网络丢包常见模拟
# 给ens192 增加一个80%的丢包率
root@ceph3:~# tc qdisc add dev ens192 root netem loss 80%
# ping 测试出现丢包
root@ceph1:~# ping ceph3
PING ceph3 (192.168.1.123) 56(84) bytes of data.
64 bytes from ceph3 (192.168.1.123): icmp_seq=1 ttl=64 time=0.294 ms
64 bytes from ceph3 (192.168.1.123): icmp_seq=3 ttl=64 time=0.397 ms
64 bytes from ceph3 (192.168.1.123): icmp_seq=8 ttl=64 time=0.365 ms
64 bytes from ceph3 (192.168.1.123): icmp_seq=11 ttl=64 time=0.341 ms
64 bytes from ceph3 (192.168.1.123): icmp_seq=16 ttl=64 time=0.429 ms
^C
--- ceph3 ping statistics ---
18 packets transmitted, 5 received, 72.2222% packet loss, time 17406ms
rtt min/avg/max/mdev = 0.294/0.365/0.429/0.046 ms
root@ceph1:~#
# 这时候检查集群,服务直接异常。
root@ceph1:~# ceph -s
cluster:
id: 3e07d43f-688e-4284-bfb7-3e6ed5d3b77b
health: HEALTH_WARN
1/3 mons down, quorum ceph1,ceph2
noout flag(s) set
2 osds down
1 host (2 osds) down
Slow OSD heartbeats on back (longest 1833.074ms)
Degraded data redundancy: 26026/78078 objects degraded (33.333%), 156 pgs degraded
services:
mon: 3 daemons, quorum ceph1,ceph2 (age 34s), out of quorum: ceph3
mgr: ceph1(active, since 3s), standbys: ceph2
mds: 1/1 daemons up
osd: 6 osds: 4 up (since 23s), 6 in (since 5h)
flags noout
data:
volumes: 1/1 healthy
pools: 4 pools, 161 pgs
objects: 26.03k objects, 95 GiB
usage: 12 GiB used, 388 GiB / 400 GiB avail
pgs: 26026/78078 objects degraded (33.333%)
156 active+undersized+degraded
5 active+undersized
root@ceph1:~# ceph health detail
HEALTH_WARN 1/3 mons down, quorum ceph1,ceph2; noout flag(s) set; 2 osds down; 1 host (2 osds) down; Slow OSD heartbeats on back (longest 1833.074ms); Degraded data redundancy: 26026/78078 objects degraded (33.333%), 156 pgs degraded
[WRN] MON_DOWN: 1/3 mons down, quorum ceph1,ceph2
mon.ceph3 (rank 2) addr [v2:192.168.1.123:3300/0,v1:192.168.1.123:6789/0] is down (out of quorum)
[WRN] OSDMAP_FLAGS: noout flag(s) set
[WRN] OSD_DOWN: 2 osds down
osd.2 (root=default,host=ceph3) is down
osd.3 (root=default,host=ceph3) is down
[WRN] OSD_HOST_DOWN: 1 host (2 osds) down
host ceph3 (root=default) (2 osds) is down
[WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 1833.074ms)
Slow OSD heartbeats on back from osd.4 [] to osd.3 [] (down) 1833.074 msec
[WRN] PG_DEGRADED: Degraded data redundancy: 26026/78078 objects degraded (33.333%), 156 pgs degraded
pg 2.5 is active+undersized+degraded, acting [0,1]
pg 2.6 is active+undersized+degraded, acting [5,4]
pg 2.7 is active+undersized+degraded, acting [4,5]
pg 2.10 is active+undersized+degraded, acting [0,4]
pg 2.11 is active+undersized+degraded, acting [4,5]
pg 2.12 is active+undersized+degraded, acting [5,1]
pg 2.13 is active+undersized+degraded, acting [4,5]