记录一下日常运维中,遇到的比较罕见的ceph相关问题。
ceph osd aio_submit 异常问题记录(一)
- 在某次存储节点硬件驱动升级变更中,MON和OSD混合部署的节点重启后,OSD无法正常启动。
问题现象
- ceph 版本 14.2.8,MON和OSD混合部署
- 因为硬件驱动和内核原因,该节点短时间内反复重启过多次。对应的OSD和MON进程也反复重启
- OSD报错日志如下图。进程正常启动,但ceph osd tree 显示为down。重启OSD进程或者重建OSD无效。
处理过程
- 最开始尝试重启OSD,无效。因为对应OSD为NVME SSD,恢复速度很快,也尝试重建OSD。无效。
- Google 搜索关键词
wait_auth_rotating
。找到BUG issue 网页链接 - bug repeat contiditon:
1、change auth_service_ticket_ttl 120s To speed up the secret update
2、restarting mon repeatedly on one node,restarting mds repeatedly on another node,Neither node is the primary mon node
3、When primary mon happens to be updating the secret to other mons and one mon just restart at same time,mon need
reselect leader,the following occurs:
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 check_rotate updated rotating
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).paxosservice(auth 5508..5687) propose_pending
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 encode_pending v 5688
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader) e3 log_health updated 0 previous 0
2021-05-12T15:07:43.669+0800 7f5f761f9700 5 mon.a@0(leader).paxos(paxos updating c 409133..409838) queue_pending_finisher 0x557067a8a690
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader).paxos(paxos updating c 409133..409838) trigger_propose not active, will propose later
- bug occured reason:
1、when trigger_propose failed,the primary mon secret version is different from the other mons,primary mon have new secret version but other mon just have old secret version
2、when mds or osd need update secret from other mon not leader,Get the rotate secret of an older version cause bug happend
3、when mon leader update secret next time,trigger_propose succeed,other mons update new secret version.mds and osd update new secret version too. cluster returns to normal.
- 根据BUG发生原因第三点,需要触发主mon更新secret,尝试直接重启主MON,触发重新选举后secret更新。OSD恢复正常。
ceph osd aio_submit 异常问题记录(二)
- 某次在OSD硬盘故障重建时候,ceph-volume lvm create 命令正常执行。OSD进程也正常启动,但是OSD未能注册到集群。OSD日志中记录到在反复充实提交aio请求。
问题现象
- ceph 版本 14.2.8
- 集群中某个OSD硬件故障,因为使用bcache方式,有概率热更换HDD后,bcache无法创建,因此后续值班工程师更换HDD均会重启一次节点。
- 通过辅助脚本自动创建bcache和OSD服务。OSD进程正常启动,等待很久OSD状态依然为DOWN,见检查OSD日志发现卡在
aio_submit
。
$ tail -f /var/log/ceph/ceph-osd.1093.log
2024-01-02 10:13:17.752 7fbdc53a3700 -1 bdev(0x55d60e6d6000 /var/lib/ceph/osd/ceph-1093/block) aio_submit retries 128
2024-01-02 10:14:24.535 7fbdc53a3700 -1 bdev(0x55d60e6d6000 /var/lib/ceph/osd/ceph-1093/block) aio_submit retries 41
2024-01-02 10:14:32.609 7fbdc53a3700 -1 bdev(0x55d60e6d6000 /var/lib/ceph/osd/ceph-1093/block) aio_submit retries 28
2024-01-02 10:14:35.386 7fbdc53a3700 -1 bdev(0x55d60e6d6000 /var/lib/ceph/osd/ceph-1093/block) aio_submit retries 175
2024-01-02 10:14:38.071 7fbdc53a3700 -1 bdev(0x55d60e6d6000 /var/lib/ceph/osd/ceph-1093/block) aio_submit retries 178
2024-01-02 10:14:40.764 7fbdc53a3700 -1 bdev(0x55d60e6d6000 /var/lib/ceph/osd/ceph-1093/block) aio_submit retries 176
2024-01-02 10:14:43.546 7fbdc53a3700 -1 bdev(0x55d60e6d6000 /var/lib/ceph/osd/ceph-1093/block) aio_submit retries 184
2024-01-02 10:14:46.277 7fbdc53a3700 -1 bdev(0x55d60e6d6000 /var/lib/ceph/osd/ceph-1093/block) aio_submit retries 201
2024-01-02 10:16:05.497 7fbdc53a3700 -1 bdev(0x55d60e6d6000 /var/lib/ceph/osd/ceph-1093/block) aio_submit retries 35
2024-01-02 10:16:08.178 7fbdc53a3700 -1 bdev(0x55d60e6d6000 /var/lib/ceph/osd/ceph-1093/block) aio_submit retries 12
2024-01-02 10:16:21.943 7fbdc53a3700 -1 bdev(0x55d60e6d6000 /var/lib/ceph/osd/ceph-1093/block) aio_submit retries 14
- 尝试重启OSD进程,无效果
$ sudo systemctl restart ceph-osd@1093
$ tail -f /var/log/ceph/ceph-osd.1093.log
2024-01-02 10:16:43.006 7f65f59d0a80 0 osd.1093 983984 crush map has features 2738472248550883328 was 8705, adjusting msgr requires for mons
2024-01-02 10:16:43.006 7f65f59d0a80 0 osd.1093 983984 crush map has features 3314933000854323200, adjusting msgr requires for osds
2024-01-02 10:16:43.006 7f65f59d0a80 0 osd.1093 983984 load_pgs
2024-01-02 10:16:43.006 7f65f59d0a80 0 osd.1093 983984 load_pgs opened 0 pgs
2024-01-02 10:16:43.006 7f65f59d0a80 0 osd.1093 983984 using weightedpriority op queue with priority op cut off at 64.
2024-01-02 10:16:43.007 7f65f59d0a80 -1 osd.1093 983984 log_to_monitors {default=true}
2024-01-02 10:16:43.011 7f65f59d0a80 1 bluestore(/var/lib/ceph/osd/ceph-1093) collect_metadata devices span numa nodes 0
2024-01-02 10:16:43.021 7f65f59d0a80 -1 osd.1093 983984 mon_cmd_maybe_osd_create fail: 'osd.1093 has already bound to class 'hdd', can not reset class to 'ssd'; use 'ceph osd crush rm-device-class <id>' to remove old class first': (16) Device or resource busy
2024-01-02 10:16:43.035 7f65f59d0a80 0 osd.1093 983984 done with init, starting boot process
2024-01-02 10:16:43.035 7f65f59d0a80 1 osd.1093 983984 start_boot
2024-01-02 10:16:44.836 7f65dd968700 -1 bdev(0x56200b33c000 /var/lib/ceph/osd/ceph-1093/block) aio_submit retries 33
问题处理过程
- 搜索引擎查找关键词
aio_submit retries
,找到相关文章。
[ceph-users] Log message –> “bdev(/var/lib/ceph/osd/ceph-x/block) aio_submit retries”
[分布式存储] ceph分布式存储,4K随机写过程中,messages中一直提示aio_submit retries
bdev_aio_max_queue_depth # libaio API io_setup的最大队列深度, 默认为1024复制
在测试压力增加的时候libaio队列用完导致的,可以通过增加libaio的队列深度解决。
查看系统里osd的libaio队列深度可以通过:
ceph daemon osd.0 config show | grep bdev_aio_max_queue_depth
可以通过一下命令,增加队列深度
ceph daemon osd.0 config set bdev_aio_max_queue_depth 4096
- 尝试修改参数
# 查看当前生效配置
$ sudo ceph daemon osd.1094 config show | grep aio_max
"bdev_aio_max_queue_depth": "1024",
# 在线修改配置
$ sudo ceph daemon osd.1093 config set bdev_aio_max_queue_depth 4096
{
"success": "bdev_aio_max_queue_depth = '4096' (not observed, change may require restart) "
}
# 检查OSD日志
$ tail -f /var/log/ceph/ceph-osd.1093.log
2024-01-02 10:23:48.696 7f65dd968700 -1 bdev(0x56200b33c000 /var/lib/ceph/osd/ceph-1093/block) aio_submit retries 118
2024-01-02 10:23:51.494 7f65dd968700 -1 bdev(0x56200b33c000 /var/lib/ceph/osd/ceph-1093/block) aio_submit retries 116
2024-01-02 10:23:54.218 7f65dd968700 -1 bdev(0x56200b33c000 /var/lib/ceph/osd/ceph-1093/block) aio_submit retries 143
2024-01-02 10:23:56.952 7f65dd968700 -1 bdev(0x56200b33c000 /var/lib/ceph/osd/ceph-1093/block) aio_submit retries 111
2024-01-02 10:23:59.302 7f65e785d700 -1 osd.1093 990571 set_numa_affinity unable to identify public interface 'bond1.353' numa node: (2) No such file or directory
2024-01-02 10:23:59.302 7f65e785d700 1 osd.1093 990571 set_numa_affinity not setting numa affinity
2024-01-02 10:23:59.303 7f65e785d700 1 bluestore(/var/lib/ceph/osd/ceph-1093) collect_metadata devices span numa nodes 0
2024-01-02 10:23:59.303 7f65e785d700 1 bluestore(/var/lib/ceph/osd/ceph-1093) collect_metadata devices span numa nodes 0
2024-01-02 10:23:59.933 7f65ea062700 1 osd.1093 990571 tick checking mon for new map
2024-01-02 10:24:01.172 7f65db845700 1 osd.1093 990572 state: booting -> active
- OSD正常注册到集群后,再次重启OSD进程。
# 重启OSD进程可正常启动
$ sudo systemctl restart ceph-osd@1093
$ tail -f /var/log/ceph/ceph-osd.1093.log
2024-01-02 10:26:24.094 7ff620e53700 4 rocksdb: [db/compaction_job.cc:1332] [default] [JOB 3] Generated table #18: 243188 keys, 65851900 bytes
2024-01-02 10:26:24.094 7ff620e53700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1704162384095613, "cf_name": "default", "job": 3, "event": "table_file_creation", "file_number": 18, "file_size": 65851900, "table_properties": {"data_size": 64086203, "index_size": 1156768, "filter_size": 608069, "raw_key_size": 10899898, "raw_average_key_size": 44, "raw_value_size": 58454723, "raw_average_value_size": 240, "num_data_blocks": 16949, "num_entries": 243188, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}}
2024-01-02 10:26:24.095 7ff620e53700 4 rocksdb: [db/compaction_job.cc:1395] [default] [JOB 3] Compacted 4@0 files to L1 => 65851900 bytes
2024-01-02 10:26:24.095 7ff620e53700 4 rocksdb: (Original Log Time 2024/01/02-10:26:24.096631) [db/compaction_job.cc:768] [default] compacted to: files[0 1 0 0 0 0 0] max score 0.25, MB/sec: 241.4 rd, 241.4 wr, level 1, files in(4, 0) out(1) MB in(62.8, 0.0) out(62.8), read-write-amplify(2.0) write-amplify(1.0) OK, records in: 243204, records dropped: 10 output_compression: NoCompression
2024-01-02 10:26:24.095 7ff620e53700 4 rocksdb: (Original Log Time 2024/01/02-10:26:24.096647) EVENT_LOG_v1 {"time_micros": 1704162384096640, "job": 3, "event": "compaction_finished", "compaction_time_micros": 272843, "compaction_time_cpu_micros": 220360, "output_level": 1, "num_output_files": 1, "total_output_size": 65851900, "num_input_records": 243198, "num_output_records": 243188, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 1, 0, 0, 0, 0, 0]}
2024-01-02 10:26:24.095 7ff620e53700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1704162384096714, "job": 3, "event": "table_file_deletion", "file_number": 13}
2024-01-02 10:26:24.095 7ff620e53700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1704162384096735, "job": 3, "event": "table_file_deletion", "file_number": 10}
2024-01-02 10:26:24.095 7ff620e53700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1704162384096750, "job": 3, "event": "table_file_deletion", "file_number": 7}
2024-01-02 10:26:24.095 7ff620e53700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1704162384096765, "job": 3, "event": "table_file_deletion", "file_number": 4}
2024-01-02 10:26:24.879 7ff627e61700 1 osd.1093 990572 tick checking mon for new map
2024-01-02 10:26:26.422 7ff619644700 1 osd.1093 990574 state: booting -> active
^C
$ ceph osd tree | grep -A 20 `hostname`
-364 88.03198 host abc
1092 hdd 7.33600 osd.1092 up 1.00000 1.00000
1093 hdd 7.33600 osd.1093 up 0 1.00000
1094 hdd 7.33600 osd.1094 up 1.00000 1.00000
1095 hdd 7.33600 osd.1095 up 1.00000 1.00000
1096 hdd 7.33600 osd.1096 up 1.00000 1.00000
1097 hdd 7.33600 osd.1097 up 1.00000 1.00000
1098 hdd 7.33600 osd.1098 up 1.00000 1.00000
1099 hdd 7.33600 osd.1099 up 1.00000 1.00000
1100 hdd 7.33600 osd.1100 up 1.00000 1.00000
1101 hdd 7.33600 osd.1101 up 1.00000 1.00000
1102 hdd 7.33600 osd.1102 up 1.00000 1.00000
1103 hdd 7.33600 osd.1103 up 1.00000 1.00000
# bdev_aio_max_queue_depth 参数因未固化,重启后回归默认值
$ sudo ceph daemon osd.1094 config show | grep aio_max
"bdev_aio_max_queue_depth": "1024",
问题疑点
- 根据找到的参考链接中反馈,是当集群进行压测,负载较大时,aio队列处理不过来,导致队列等待。
- 本次场景为新创建OSD,该OSD并无业务负载,且当时观察iostat记录到HDD的util并不高。
- 该问题只在第一次向MON注册时候遇见,注册成功后再重启无该问题。