FPs

Raid 常见操作速查一

raid

盘阵初始化

使用lspci 命令或者PCI 插槽信息,判定Raid 卡类型,选择对应工具。

RAID 控制器

  • LSI系列的RAID卡控制器,使用megacli工具处理。
    • 03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)
  • Adaptec系列RAID卡控制器,使用arcconf工具处理。
    • 02:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)
  • Compaq系列RAID卡控制器,使用hpacucli工具处理
    • 02:01.0 RAID bus controller: Compaq Computer Corporation Smart Array 64xx (rev 01)
  • Dell PowerEdge Expandabl系列RAID卡控制器,使用megacli工具处理。
  • 02:0e.0 RAID bus controller: Dell PowerEdge Expandable RAID controller 5
  • DELL 某系列盘阵控制器,使用omreport工具

非RAID 控制器

  • Atto的SCSI控制器
    • 05:00.0 Serial Attached SCSI controller: Atto Technology SAS Adapter

搜索dmesg日志得到硬盘型号 dmesg | grep -i "ata\|scsi"
SCSI控制器相关硬盘信息查看:

sudo lsscsi

[0:0:0:0]    cd/dvd  Dell     Virtual  CDROM   123   /dev/scd0
[1:0:0:0]    disk    Dell     Virtual  Floppy  123   /dev/sdc
[2:0:0:0]    disk    ATA      Maxtor 7L250S0   1G10  /dev/sda
[2:0:4:0]    disk    SEAGATE  ST3300555SS      T105  /dev/sdb
cat /proc/scsi/scsi

Attached devices:
Host: scsi2 Channel: 00 Id: 00 Lun: 00
Vendor: ATA      Model: Maxtor 7L250S0   Rev: 1G10
Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: Dell     Model: Virtual  CDROM   Rev: 123
Type:   CD-ROM                           ANSI SCSI revision: 02
Host: scsi1 Channel: 00 Id: 00 Lun: 00
Vendor: Dell     Model: Virtual  Floppy  Rev: 123
Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi2 Channel: 00 Id: 04 Lun: 00
Vendor: SEAGATE  Model: ST3300555SS      Rev: T105
Type:   Direct-Access                    ANSI SCSI revision: 05
sudo smartctl -a /dev/sda

=== START OF INFORMATION SECTION ===
Model Family:     Maxtor MaXLine III family (ATA/133 and SATA/150)
Device Model:     Maxtor 7L250S0

sudo smartctl -a /dev/sdb

Device: SEAGATE  ST3300555SS      Version: T105
Serial number: 3LM09L9B

LSI系列初始化

megasasctl/megactl工具能比较直观的获取到Enclosure Device IDSlot Number信息,这两个信息就是盘阵信息和插槽信息。

获取磁盘信息:

>sudo megasasctl
a0       PERC H700 Integrated     encl:1 ldrv:1  batt:good
a0d0      1675GiB RAID 10  3x2  optimal
a0e32s0     558GiB  a0d0  online
a0e32s1     558GiB  a0d0  online
a0e32s2     558GiB  a0d0  online
a0e32s3     558GiB  a0d0  online
a0e32s4     558GiB  a0d0  online
a0e32s5     558GiB  a0d0  online

盘阵处理:

单盘RAID 0:
  megacli   -CfgLdAdd   -r0[E0:S0]   -a0

双盘RAID 1

多盘RAID 6
  megacli   -CfgLdAdd   -r6[17:0,17:1,17:2,17:3,17:4,17:5,17:6,17:7,17:8,17:9,17:10,17:11,17:12,17:13,17:14]   -a0

多盘RAID 5

多盘RAID 10
  2x2 RAID 10
    megacli   -CfgSpanAdd   -r10   -Array0[32:2,32:3]   -Array1[32:4,32:5]   -a0
  4x2 RAID 10
    megacli   -CfgSpanAdd   -r10   -Array0[32:2,32:3]   -Array1[32:4,32:5]   -Array3[32:6,32:7]   -Array4[32:8,32:9]   -a0

Adaptec系列初始化

Reported Channel,Device编号是识别识别物理磁盘的依据,都是唯一的。通过arcconf getconfig 1 pd能够获取。可以参考arcconf create -h

获取磁盘信息:

>sudo arcconf getconfig 1 pd
Device #0
Device is a Hard drive
State                              : Online
Supported                          : Yes
Transfer Speed                     : SATA0 Gb/s
Reported Channel,Device(T:L)       : 0,8(8:0)  #获取编号
Reported Location                  : Enclosure 0, Slot 0
Reported ESD(T:L)                  : 2,0(0:0)
Vendor                             : Hitachi
Model                              : HDS722020ALA330
Firmware                           : JKAOA3EA
Serial number                      : JK11A4B8KBT5YW
Size                               : 1907729 MB
Write Cache                        : Enabled (write  -back)
FRU                                : None
S.M.A.R.T.                         : No
S.M.A.R.T. warnings                : 0
Power State                        : Full rpm
Supported Power States             : Full rpm,Powered off,Reduced rpm
SSD                                : No
MaxCache Capable                   : No
MaxCache Assigned                  : No
NCQ status                         : Enabled

盘阵处理:

单盘RAID 0
  sudo arcconf create 1 logicaldrive max volume 0 31 noprompt

多盘RAID 60
arcconf create 1 LOGICALDRIVE MAX 60 0 8 0 9 0 10 0 11 0 12 0 13 0 14 0 15 0 16 0 17 0 18 0 19 0 20 0 21 0 22 0 23 0 24 0 25 0 26 0 27 0 28 0 29 0 30 0 31 noprompt

逻辑盘初始化

分区

MBR分区表只支持最大为2TB的分区,2TB以上的分区使用GPT分区表。当然2TB以下的分区也是可以使用GPT分区的。

cfdisk/fdisk
parted

选择分区表

sudo parted   -s /dev/sdb mklabel gpt

创建分区

sudo parted   -s   -  - /dev/sdb mkpart primary 0   -1s
parted   -s   -  - /dev/sda mkpart primary 0 100%

修改分区表并分区

sudo parted   -s /dev/sda   -  - mklabel gpt mkpart primary 0   -1

partprobe/partx
testdisk恢复分区表

格式化

mkfs.ext3/ext4支持格式化的时候指定UUID,而mkfs.xfs格式化的时候不支持指定UUID,只能通过xfs_admin手工修改UUID。

sudo xfs_admin   -U cd1b9cdb  -fa95  -4bbe  -9a93  -57196ec5510c /dev/sdg1
修改fstab

磁盘挂载一般通过UUID的方式挂载,不建议通过设备名挂载,因为UUID是唯一的,系统重启的时候设备名会变而UUID是不会变的。

创建挂载点
挂载并修改目录权限:

mount   -a
chown root.root /srv/0

内存文件系统的挂载:

mount   -t tmpfs   -o nosuid,nodev,size=10G,mode=1777 tmpfs /mnt/

挂载移动硬盘:

mount   -t ntfs   -o uid=music,gid=netease,fmask=133,dmask=022 /dev/sdz1 /mnt/usb/

bind 的用法:

/vicepa/unfs /home/unfs none defaults,bind 0 0
/vicepa/dfs /home/dfs none defaults,bind 0 0
sudo mount   -  -bind /mnt/ssd/0/ /data/

挂载nobarrier:

/dev/sdb1 on /mnt/ndir type xfs (rw,noatime,attr2,delaylog,noquota)
UUID=74e18540  -397a  -469d  -9599  -3e404ca23ed2 /mnt/ndir xfs noatime,nobarrier 0  0
/dev/sdb1 on /mnt/ndir type xfs (rw,noatime,attr2,delaylog,nobarrier,noquota)

故障处理

LSI系列坏盘

  • 查看dmesg信息
  • 查看挂点信息,看目录能否读写,目录是否被占用
    • sudo fuser -mv /srv
  • umount目录
  • 查看坏盘
    • megasasctlmegactl
  • 联系机房换盘
  • 新盘初始化

Adaptec系列坏盘

  • 查看dmesg信息,是否有offline信息
  • 查看挂点信息,看目录能否读写,目录是否被占用
    • fuser or losf
  • umount目录
  • 查看坏盘
    • sudo arcconf getconfig 1 pd|grep "Slot"
  • 联系机房换盘
  • 新盘初始化

数据恢复

  • dd_rescue 恢复数据
  • testdisk 恢复分区表
  • ufsxsci 恢复数据

有坏快但盘还没坏的情况下换盘

LSI系列
  • 定位有坏道的物理盘
    • 通过 megasasctlsudo megacli -CfgDsply -a0
  • 查看有坏道的物理磁盘。搜索 \'Media Error Count'
  • 定位物理盘所在的逻辑盘
    • sudo megacli -CfgDsply -a0
    • Virtual Drive: 1 (Target Id: 1)
  • offline 逻辑盘
    • sudo megacli -CfgLdDel -L1 -a0
  • 定位坏盘,使之亮灯,让机房识别
  • 定位某块磁盘(通过控制盘阵上对应的指示灯)
    • sudo megacli -PdLocate -start -PhysDrv[0:5] -a0 // 0:5 是要定位的磁盘的 Enclosure ID 和 Slot Number
    • sudo megacli -PDOffline -PhysDrv[21:0] -a0
  • 机房处理
  • umount 坏盘分区
  • 新盘初始化
Adaptec系列
  • umount 坏盘分区
  • 查看坏道
$ sudo arcconf GETLOGS 1 DEVICE
Controllers found: 1
<ControllerLog controllerID="0" type="0" time="1414569452" version="4" tableFull="false">
<driveErrorEntry smartError="false" vendorID="" serialNumber="W1F2MX9S" wwn="0000000000000000" deviceID="17" productID="ST3000DM" numParityErrors="0" linkFailures="0" hwErrors="0" abortedCmds="0" mediumErrors="30" smartWarning="0" />
<driveErrorEntry smartError="false" vendorID="" serialNumber="W1F2MWQS" wwn="0000000000000000" deviceID="16" productID="ST3000DM" numParityErrors="0" linkFailures="0" hwErrors="0" abortedCmds="0" mediumErrors="35" smartWarning="0" />
<driveErrorEntry smartError="false" vendorID="" serialNumber="W1F2DTLF" wwn="0000000000000000" deviceID="18" productID="ST3000DM" numParityErrors="0" linkFailures="0" hwErrors="0" abortedCmds="1" mediumErrors="10" smartWarning="0" />
<driveErrorEntry smartError="false" vendorID="" serialNumber="W1F2DTRL" wwn="0000000000000000" deviceID="20" productID="ST3000DM" numParityErrors="0" linkFailures="0" hwErrors="0" abortedCmds="0" mediumErrors="42" smartWarning="0" />
<driveErrorEntry smartError="false" vendorID="" serialNumber="W1F2MYGP" wwn="0000000000000000" deviceID="12" productID="ST3000DM" numParityErrors="0" linkFailures="0" hwErrors="0" abortedCmds="0" mediumErrors="39" smartWarning="0" />
<driveErrorEntry smartError="false" vendorID="" serialNumber="W1F2DVC4" wwn="0000000000000000" deviceID="31" productID="ST3000DM" numParityErrors="0" linkFailures="0" hwErrors="0" abortedCmds="0" mediumErrors="9" smartWarning="0" />
<driveErrorEntry smartError="false" vendorID="" serialNumber="W1F2MXSG" wwn="0000000000000000" deviceID="9" productID="ST3000DM" numParityErrors="0" linkFailures="0" hwErrors="0" abortedCmds="2" mediumErrors="10" smartWarning="0" />
<driveErrorEntry smartError="false" vendorID="" serialNumber="W1F2D2YF" wwn="0000000000000000" deviceID="8" productID="ST3000DM" numParityErrors="0" linkFailures="0" hwErrors="0" abortedCmds="1" mediumErrors="10" smartWarning="0" />
<driveErrorEntry smartError="false" vendorID="" serialNumber="W1F2DTXL" wwn="0000000000000000" deviceID="22" productID="ST3000DM" numParityErrors="0" linkFailures="0" hwErrors="0" abortedCmds="4" mediumErrors="31" smartWarning="0" />
<driveErrorEntry smartError="false" vendorID="" serialNumber="W1F2DVJ3" wwn="0000000000000000" deviceID="14" productID="ST3000DM" numParityErrors="0" linkFailures="0" hwErrors="0" abortedCmds="0" mediumErrors="17" smartWarning="0" />
</ControllerLog>

Command completed successfully.
  • 辨别有问题的逻辑盘

    • arcconf 不能删除已经挂载的磁盘logicaldrive,已经挂载的磁盘删除logicaldrive设备会报错
  • mount状态

    • sudo arcconf delete 1 logicaldrive 15
Controllers found: 1
Logical device 15 is mounted on /mnt/dfs/15 and cannot be deleted.
Command aborted.
  • umount状态

sudo arcconf delete 1 logicaldrive 14

Controllers found: 1

WARNING: logical device 14 may contain a partition.
All data in logical device 14 will be lost.
Delete the logical device?
Press y, then ENTER to continue or press ENTER to abort:
  • 给坏盘亮灯
    • sudo arcconf IDENTIFY 1 LOGICALDRIVE 14`
Controllers found: 1
The specified device is blinking.
Press any key to stop the blinking.
  • 机房换盘
  • 新盘初始化

Adaptec系列 JBOD 换盘

  • umount 坏盘分区
  • 用 smartctl 查看磁盘序列号(Serial number)
    • sudo smartctl -a /dev/sde
Vendor:               WDC
Product:              WD3000FYYZ  -01UL1
Revision:             01.0
User Capacity:        2,995,729,203,200 bytes [2.99 TB]
Logical block size:   512 bytes
Logical Unit id:      0x50014ee2b3ed2863
Serial number:             WD  -WCC131257088
Device type:          disk
Local Time is:        Wed Sep 17 12:47:29 2014 CST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK
  • 通过磁盘序列号查找磁盘 chennel id
    • sudo arcconf getconfig 1 pd|less
Device #3
Device is a Hard drive
State                              : Online (JBOD)
Supported                          : Yes
Transfer Speed                     : SATA 6.0 Gb/s
Reported Channel,Device(T:L)       : 0,11(11:0) (需要这个用来亮灯)
Reported Location                  : Enclosure 1, Slot 3
Reported ESD(T:L)                  : 2,1(1:0)
Vendor                             : WDC
Model                              : WD3000FYYZ  -01UL1
Firmware                           : 01.01K02
Serial number                      : WD  -WCC131257088
Size                               : 2861588 MB
Write Cache                        : Enabled (write  -back)
FRU                                : None
S.M.A.R.T.                         : No
S.M.A.R.T. warnings                : 0
Power State                        : Full rpm
Supported Power States             : Full rpm,Powered off,Reduced rpm
NCQ status                         : Enabled
  • 亮灯
    • sudo arcconf IDENTIFY 1 DEVICE 0 11
Controllers found: 1
The specified device is blinking.
Press any key to stop the blinking.
  • 机房换盘

  • 新盘初始化

    • sudo arcconf create 1 JBOD 0 11 noprompt

SCSI 控制器系列坏盘

  • umount 坏盘分区
  • 用smartctl 查看磁盘序列号(Serial number)
    • sudo smartctl -a /dev/sde
Vendor:               WDC
Product:              WD3000FYYZ  -01UL1
Revision:             01.0
User Capacity:        2,995,729,203,200 bytes [2.99 TB]
Logical block size:   512 bytes
Logical Unit id:      0x50014ee2b3ed2863
Serial number:             WD  -WCC131257088
Device type:          disk
Local Time is:        Wed Sep 17 12:47:29 2014 CST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK
  • 关机找机房换盘
    • SCSI 控制器没有亮灯机制,机房所有不好判断是那个盘,会出现拔错盘的情况出现。通过提供硬盘序列号,可以让机房人工核对序列号,防止拔错盘。
  • 机房换盘
  • 新盘初始化

常见问题

LSI系列


  • 创建vdisk的时候出现以下问题
    • sudo megacli -CfgLdAdd -r0[17:6] -a0
Adapter 0: Configure Adapter Failed
FW error description:
The current operation is not allowed because the controller has data in cache for offline or missing virtual drives.
Exit Code: 0x54
  • 解决方法:
sudo megacli   -GetPreservedCacheList   -a0
sudo megacli   -DiscardPreservedCache   -L6   -a0
  • 效果如下:
>sudo megacli   -GetPreservedCacheList   -a0
Adapter #0
Virtual Drive(Target ID 06): Missing.
Exit Code: 0x00
>sudo megacli   -DiscardPreservedCache   -L6   -a0
Adapter #0
Virtual Drive(Target ID 06): Preserved Cache Data Cleared.
Exit Code: 0x00

  • 新换硬盘状态是JBOD状态,导致创建vdisk失败
    • sudo megacli -CfgLdAdd -r0[21:7] -a0
The specified physical disk does not have the appropriate attributes to complete the requested command.
Exit Code: 0x26
  • 查看硬盘JBOD状态

    • sudo megacli -PDList -a0|less|grep JBOD
    • 新换硬盘状态是JBOD状态,导致创建vdisk失败
  • 解决方法:

>sudo megacli   -PDMakeGood   -PhysDrv[21:7]   -force   -a0

Adapter: 0: EnclId  -21 SlotId  -7 state changed to Unconfigured  -Good.
Exit Code: 0x00
>sudo megacli   -CfgLdAdd   -r0[21:7]   -a0

Adapter 0: Created VD 8
Adapter 0: Configured the Adapter!!
Exit Code: 0x00

  • megacli管理工具清理Cache异常报错,不能清理Cache,联系机房手工处理
>sudo megacli   -DiscardPreservedCache   -L5   -a0

Adapter #0
Segmentation fault

  • 设置RAID卡cache策略
  • 查看RAID cache状态
>sudo megacli   -LDGetProp   -Cache   -Lall   -a0

Adapter 0  -VD 0(target id: 0): Cache Policy:WriteThrough, ReadAdaptive, Direct, No Write Cache if bad BBU
Adapter 0  -VD 1(target id: 1): Cache Policy:WriteThrough, ReadAdaptive, Direct, No Write Cache if bad BBU
Adapter 0  -VD 2(target id: 2): Cache Policy:WriteBack, ReadAdaptive, Direct, Write Cache OK if bad BBU
Exit Code: 0x00
  • 设置RAID卡cache策略
>sudo megacli   -LDSetProp ForcedWB   -L2   -a0

Set Write Policy to Forced WriteBack on Adapter 0, VD 2 (target id: 2) success
Exit Code: 0x00
  • cache 策略如果在RAID卡电池出现问题的时候,强制设为ForcedWB的情况下面,存在很多风险,当机器挂了或者断电的情况下面,cache中的数据就没法刷回磁盘,这样就存在数据丢失的情况。 这是一种牺牲安全换取性能的做法,不值得推荐。

查看intel ssd的寿命

LSI 的RAID卡

思路:

  • 通过获取Slot Number
    • sudo megacli -pdlist -a0
    • sudo megasasctl -v
  • 通过smatctl 获取ssd寿命信息
    • sudo smartctl -a -d megaRAID,8 /dev/sdd
    • 表示Slot Number: 8, 设备符为/dev/sdd获取到的smart信息
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct   0x0032   096   096   000    Old_age   Always         -       0
9 Power_On_Hours          0x0032   100   100   000    Old_age   Always         -       544
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always         -       9
170 Unknown_Attribute       0x0033   100   100   010    Pre  -fail  Always         -       0
171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always         -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always         -       0
174 Unknown_Attribute       0x0032   100   100   000    Old_age   Always         -       6
175 Program_Fail_Count_Chip 0x0033   100   100   010    Pre  -fail  Always         -       13044744823
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always         -       0
184 End  -to  -End_Error        0x0033   100   100   090    Pre  -fail  Always         -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always         -       0
190 Airflow_Temperature_Cel 0x0022   084   071   000    Old_age   Always         -       16 (Min/Max 10/29)
192 Power  -Off_Retract_Count 0x0032   100   100   000    Old_age   Always         -       6
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always         -       16
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always         -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always         -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always         -       357286
226 Load  -in_Time            0x0032   100   100   000    Old_age   Always         -       849
227 Torq  -amp_Count          0x0032   100   100   000    Old_age   Always         -       50
228 Power  -off_Retract_Count 0x0032   100   100   000    Old_age   Always         -       12518
232 Available_Reservd_Space 0x0033   100   100   010    Pre  -fail  Always         -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always         -       0
234 Unknown_Attribute       0x0032   100   100   000    Old_age   Always         -       0
  • 获取需要的数据信息
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always         -       0

100 100 表示没有损耗,随着损耗的增加数字递减,当数字降为1时便不再下降,这块ssd便寿终。

参考:

Adaptec系列

提取RAID卡信息给厂商
sudo arcconf SAVESUPPORTARCHIVE
tar   -zpcv   -f Support.tar.gz /var/log/Support/
2016-04-06 Linux Raid Disk