小编给大家分享一下Pciessd异常Readonly致MySQL反复crash怎么办,相信大部分人都还不怎么了解,因此分享这篇文章给大家参考一下,希望大家阅读完这篇文章后大有收获,下面让我们一起去了解一下吧!
创新互联建站是一家专注网站建设、网络营销策划、小程序设计、电子商务建设、网络推广、移动互联开发、研究、服务为一体的技术型公司。公司成立10多年以来,已经为上千家成都石凉亭各业的企业公司提供互联网服务。现在,服务的上千家客户与我们一路同行,见证我们的成长;未来,我们一起分享成功的喜悦。
去年10月份发生一起fio卡变为readonly(和双十一无关),发生一起fio卡变为readonly,造成mysql crash的故障,整理如下。
【机器配置】
System | Dell Inc.; PowerEdge R710; Processors | physical = 2, cores = 12, virtual = 24, hyperthreading = yes # Memory ##################################################### Total | 94.40G Free | 555.50M Swappiness | vm.swappiness = 0 # Disk ##################################################### 2 ioMemory devices in this system Fusion-io driver version: 3.1.5 build 126 Fusion-io ioDrive 640GB *2 –> mdadm –>/dev/md0 ibdata,ib_logfile,bin_log,relay_log on SAS 600GB raid1
【问题表现】
13:28,监控发出***个db_ping告警
mysql的alert log如下:
/u01/mysql/libexec/mysqld: Can’t create/write to file ‘/u01/mysql/tmp/ibU5kXB4′ ( Errcode: 30) 121104 13:28:10 InnoDB: Error: unable to create temporary file; errno: 30 121104 13:28:10 [ERROR] Plugin ‘InnoDB’ init function returned error. 121104 13:28:10 [ERROR] Plugin ‘InnoDB’ registration as a STORAGE ENGINE failed. 121104 13:28:10 [ERROR] Aborting InnoDB: Error: tried to read 16384 bytes at offset 0 41517056. InnoDB: Was only able to read -1. 121104 13:14:59 InnoDB: Operating system error number 5 in a file operation. InnoDB: Error number 5 means ‘Input/output error’. InnoDB: Some operating system error numbers are described at InnoDB: http://dev.mysql.com/doc/refman/5.1/en/operating-system-error-codes.html InnoDB: File operation call: ‘read’. InnoDB: Cannot continue operation. mysqld: my_new.cc:51: int __cxa_pure_virtual(): Assertion `! “Aborted: pure virtual method called.”‘ failed. 121104 13:14:59 – mysqld got signal 6 ;
由上判断IO设备有问题,此时touch /u01/mysql/tmp/ibd:
touch: cannot touch `/u01/mysql/tmp/ibd’: Read-only file system |
由于是核心集群,有数据强一致需求,通过DBA手工强制主备切换,故障排除。
【问题原因】
fusionIO卡出现readonly /var/log/message Nov 4 13:14:59 my160130.cm6 kernel: : fioerr Fusion-io ioDrive 640GB 0000:07:00.0: Single Bit Event Upset Error Dete4ted – interrupt: val[0]: 000ff16 fio-status -a fct1 Failed: DEVICE IS OFFLINE. ALL READS AND WRITES WILL FAIL! ioDrive 640GB MLC, Product Number:2TTK9, SN:436946 !! —> There are active errors or warnings on this device! Read below for details. ioDrive 640GB MLC, PN:00214401201 Located in slot 0 Center of Pseudo Low-Profile ioDIMM Adapter SN:436946 WARNING: READ-ONLY MODE. ALL WRITES WILL FAIL! ACTIVE ERRORS: The ioMemory has encountered an internal error and has been temporarily disabled. All reads and writes will fail. The ioMemory is not allowing write operations.
【问题分析】
•SEUs are transient soft errors, and are non-destructive. A reset or rewriting of the device results in normal device behavior thereafter
fio的控制模块是跑在fpga上的,元数据存储在DRAM和SSD上,断电可恢复。2.x的驱动发生该错误后,会rewriting进行修复。3.x的驱动提高了安全性,发生该错误后,会直接reset,卡read_only等待power recycle
•SEU class errors are caused by cosmic ray particles making it’s way into the NAND controller or by a failing NAND controller
FPGA本身的介质损坏或者宇宙射线,都是该错误的诱因。五月份message中有类似Write Path报错,2.x驱动自动rewrite修复了,3.x的驱动安全级别更高,reset后置为readonly
•Write Path Parity Error
这个错误是SEU错误的前驱,绝大多数可修复。同集群中,有3台发生过并自动修复。
•FPGA的成本相比开芯片低廉很多,编程迭代迅速,但健壮性不开芯片
【数据丢失】
因undo,redo,binlog都在u02的SAS盘上日志完整,备库基本没有延迟,故没有数据丢失;
但由于SEU可能导致当时写入的block错误,造成data不一致,保险起见还是重做备库,利用binlog同步所有数据。
•SEU class of error my result in data on the device being corrupted.The database should be verified or restored from backup
【改进措施】
FPGA老化后,有一定几率发生Single Event Upset错误,核心库要及时替换;
FPGA对宇宙射线敏感,需要控制机房环境,并分散机柜上架;
改进更敏感的message,dmesg告警。
以上是“Pciessd异常Readonly致Mysql反复crash怎么办”这篇文章的所有内容,感谢各位的阅读!相信大家都有了一定的了解,希望分享的内容对大家有所帮助,如果还想学习更多知识,欢迎关注创新互联行业资讯频道!