КАРТА САЙТА
Мы рады видеть Вас на Нашем сайте.

Добавленные услуги для организаций

  • logo-ulmart-ru
  • menshealth_door_1x
  • squarespace2
  • shot (1)
  • illustration_for_singularex_1-2_1x
  • truck
  • netvirus_dribble
  • phone_shot16
  • Print
Все услуги

Календарь

Сентябрь 2018
Пн Вт Ср Чт Пт Сб Вс
« Апр    
 12
3456789
10111213141516
17181920212223
24252627282930

Звонок с сайта

... и многое, многое, другое ...
BUTNET |RU

Мониторинг дисков используя smartctl на RAID LSI megaraid

28.03.2016 автор в категории Linux с 0 и 0
Home > Blog > Мониторинг дисков используя smartctl на RAID LSI megaraid

Для этого нам понадобиться тот же megacli, используя который, мы узнаем ID физических дисков и соответствующие им логические носители. Начнем.
Узнаем ID всех физических дисков за мегарейд контроллером ну и номера соответствующих логических дисков.

root@il-nv-s06:~# megacli -LdPdInfo -aALL | grep Id
Virtual Drive: 0 (Target Id: 0)
Device Id: 0
Device Id: 1
Device Id: 2
Device Id: 3
Virtual Drive: 1 (Target Id: 1)
Device Id: 13
Device Id: 12
Virtual Drive: 2 (Target Id: 2)
Device Id: 11
Device Id: 10
Device Id: 9
Device Id: 6
Device Id: 7
Device Id: 8

Расшифрую эту команду:

-LdPdInfo – получить информацию(Info) по логическим (Ld) и физическим(Pd) устройствам …
-aALL – … на всех адаптерах
Теперь видно, что у нас три логических(виртуальных) диска в которые входят по несколько физических дисков с соответствующими ID. Посмотрим на сервере, сколько у нас есть дисков:

root@il-nv-s06:~# ls /dev/sd[a-Z]
/dev/sda /dev/sdb /dev/sdc

Все верно, у нас три логических диска в системе. Проводим аналогию с выводом команды megacli:

Virtual Drive: 0 == /dev/sda и в него входит 4 физических диска с ID=0,1,2,3
Virtual Drive: 1 == /dev/sdb и в него входит 2 физических диска с ID=13,12
Virtual Drive: 2 == /dev/sdc и в него входит 6 физических дисков с ID=6,7,8,9,10,11
Теперь нам осталось запустить SMART проверку по каждому с дисков используя собранные данные.

root@il-nv-s06:~# cat smartcheck.sh
#!/bin/bash
echo "============================================="
echo "================== /dev/sda ================="
echo "============================================="
smartctl -d megaraid,0 -a /dev/sda
smartctl -d megaraid,1 -a /dev/sda
smartctl -d megaraid,2 -a /dev/sda
smartctl -d megaraid,3 -a /dev/sda
echo "============================================="
echo "================== /dev/sdb ================="
echo "============================================="
smartctl -d megaraid,13 -a /dev/sdb
smartctl -d megaraid,12 -a /dev/sdb
echo "============================================="
echo "================== /dev/sdc ================="
echo "============================================="
smartctl -d megaraid,11 -a /dev/sdc
smartctl -d megaraid,10 -a /dev/sdc
smartctl -d megaraid,9 -a /dev/sdc
smartctl -d megaraid,6 -a /dev/sdc
smartctl -d megaraid,7 -a /dev/sdc
smartctl -d megaraid,8 -a /dev/sdc

К примеру возьмем первый диск.

root@il-nv-s06:~# smartctl -d megaraid,0 -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.8.0-26-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor: SEAGATE
Product: ST31000424SS
Revision: 0005
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Logical block size: 512 bytes
Logical Unit id: 0x5000c5002130cd0b
Serial number: 9WK1D0420000C1051TRW
Device type: disk
Transport protocol: SAS
Local Time is: Fri Feb 7 20:24:25 2014 IST
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature: 29 C
Drive Trip Temperature: 68 C
Manufactured in week 32 of year 2010
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 30
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 2
Elements in grown defect list: 0
Vendor (Seagate) cache information
Blocks sent to initiator = 920579338
Blocks received from initiator = 3734205770
Blocks read from cache and sent to initiator = 2669309657
Number of read and write commands whose size <= segment size = 101596876 Number of read and write commands whose size > segment size = 1211
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 24230.63
number of minutes until next internal SMART test = 20

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 3033913199 210 0 3033913409 3033913469 39052.656 60
write: 0 0 0 0 0 4141.743 0
verify: 75533051 10 0 75533061 75533061 1001.100 0

Non-medium error count: 14

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed - 24200 - [- - -]

Long (extended) Self Test duration: 11100 seconds [185.0 minutes]
Как видим у нас есть 60 ошибок с которыми не смогла справиться система исправления ошибок.
Немного расшифрую выводу ошибок:
Журнал ошибок (если он доступен) отображается в отдельных строках:

write error counters – ошибки записи
read error counters – ошибки считывания
verify error counters (отображаются только когда не нулевое значение) – ошибки выполнения
non-medium error counter (определенное число) – число восстанавливаемых ошибок отличных от ошибок записи/считывания/выполнения
Так же может выводиться детальное описание последних ошибок с кодом, если устройство его поддерживает(если нет поддержки – выводиться сообщение “Error Events logging not supported”). К примеру:

Error 3 occurred at disk power-on lifetime: 23855 hours (993 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
— — — — — — —
10 51 08 4c 08 0f e0 Error: IDNF at LBA = 0x000f084c = 985164

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
— — — — — — — — —————- ———————
ca 00 08 4c 08 0f 00 08 19d+06:08:39.873 WRITE DMA
ca 00 08 5c 05 0f 00 08 19d+06:08:39.873 WRITE DMA
c8 00 10 9c a0 25 00 08 19d+06:08:39.866 READ DMA
c8 00 08 94 a0 25 00 08 19d+06:08:39.866 READ DMA
c8 00 08 8c a0 25 00 08 19d+06:08:39.862 READ DMA
Каждая из ошибок имеет различные коды. Оригинал описания кодов взято из мануала по SCSI Seagate дискам:

Errors Corrected by ECC, fast [Errors corrected without substantial delay: 00h]. An error correction was applied to get perfect data (a.k.a. ECC on-the-fly). “Without substantial delay” means the correction did not postpone reading of later sectors (e.g. a revolution was not lost). The counter is incremented once for each logical block that requires correction. Two different blocks corrected during the same command are counted as two events.

Errors Corrected by ECC: delayed [Errors corrected with possible delays: 01h]. An error code or algorithm (e.g. ECC, checksum) is applied in order to get perfect data with substantial delay. “With possible delay” means the correction took longer than a sector time so that reading/writing of subsequent sectors was delayed (e.g. a lost revolution). The counter is incremented once for each logical block that requires correction. A block with a double error that is correctable counts as one event and two different blocks corrected during the same command count as two events.

Error corrected by rereads/rewrites [Total (e.g. rewrites and rereads): 02h]. This parameter code specifies the counter counting the number of errors that are corrected by applying retries. This counts errors recovered, not the number of retries. If five retries were required to recover one block of data, the counter increments by one, not five. The counter is incremented once for each logical block that is recovered using retries. If an error is not recoverable while applying retries and is recovered by ECC, it isn’t counted by this counter; it will be counted by the counter specified by parameter code 01h – Errors Corrected With Possible Delays.

Total errors corrected [Total errors corrected: 03h]. This counter counts the total of parameter code errors 00h, 01h and 02h (i.e. error corrected by ECC: fast and delayed plus errors corrected by rereads and rewrites). There is no “double counting” of data errors among these three counters. The sum of all correctable errors can be reached by adding parameter code 01h and 02h errors, not by using this total. [The author does not understand the previous sentence from the Seagate manual.]

Correction algorithm invocations [Total times correction algorithm processed: 04h]. This parameter code specifies the counter that counts the total number of retries, or “times the retry algorithm is invoked”. If after five attempts a counter 02h type error is recovered, then five is added to this counter. If three retries are required to get stable ECC syndrome before a counter 01h type error is corrected, then those three retries are also counted here. The number of retries applied to unsuccessfully recover an error (counter 06h type error) are also counted by this counter.

Gigabytes processed {10^9} [Total bytes processed: 05h]. This parameter code specifies the counter that counts the total number of bytes either successfully or unsuccessfully read, written or verified (depending on the log page) from the drive. If a transfer terminates early because of an unrecoverable error, only the logical blocks up to and including the one with the uncorrected data are counted. [smartmontools divides this counter by 10^9 before displaying it with three digits to the right of the decimal point. This makes this 64 bit counter easier to read.]

Total uncorrected errors [Total uncorrected errors: 06h]. This parameter code specifies the counter that contains the total number of blocks for which an uncorrected data error has occurred.

С всего этого нас интересует параметр Total uncorrected errors который показывает количество не исправленных ошибок. Если это число велико, то нужно запускать long тест и проверить, дополнительно, параметры физического диска в Megaraid контроллере.

Добавить коментарий

САЙТ СОЗДАН Владимир Ш. ~ СПЕЦИАЛЬНО ДЛЯ BUTNET.RU