Thursday, May 8, 2014

UCS B - Disk failure Troubleshooting via IPMI

Introduction

This document describes several command-line interface (CLI) commands, as well as other troubleshooting techniques, that can help troubleshoot hard disk drive (HDD) issues in UCS B. In this document we discuss the fast and accurate method for troubleshooting HDD issues using IPMI sensor CLI output. Also post check up using show tech.



Troubleshooting Step


UCSM
-Start from UCSM and look for what Fault tab describes the failure
- Got to Equipment > Chassis > Server : Faults tab
- In below example, Local disk1 on server 4/1 complains drive fault.

ucsm_faults_tab.png



IPMI (Intelligent Platform Management Interface) sensor reading.
- You can check the status of the HDD from the IPMI sensor reading output. This method is very quick and useful when you do live troubleshooting. Here is the steps for how to.

Step1. connect to CIMC Debug Firmware Utility Shell

connect cimc <chassis/blade number>

 conn_cimc.png


Step2. type "sensors fault" as above and you now see disk status.

There are two Hard Disks in this case. Each disk has a different status. One is 0x2202 and the other 0x0101. 
What this means?


Code Interpretation :

Bit[15:10] - Unused
Bit[9:8]   - Fault
Bit[7:4]   – LED Color
Bit[3:0]   – LED State

Fault:
0x100 – On Line
0x200 - Degraded

LED Color:
0x10 – GREEN
0x20 – AMBER
0x40 – BLUE
0x80 – RED

LED State:
0x01 – OFF
0x02 – ON
0x04 – FAST BLINK 
0x08 – SLOW BLINK


Example :

0x0101
Fault : 0x100 indicates On line
LED status : 0x01 indicates OFF

So HDD1 is in normal state.

2. 0x2202   
Fault : 0x200 indicates Degraded
LED status: 0x02 indicates ON


So HDD2 is in degraded. This should be replaced. 


If you already have show tech, then look for a following output

1 "show fault" from sam_techsupportinfo

Severity: Major
Code: F0181
Last Transition Time: 2014-03-08T14:45:06.209
ID: 1263592
Status: None
Description: Local disk 1 on server 4/1 operability: inoperable. Reason: Firmware Detected Drive Fault    <<<<<----
Affected Object: sys/chassis-4/blade-1/board/storage-SAS-1/disk-1
Name: Storage Local Disk Inoperable
Cause: Equipment Inoperable
Type: Equipment
Acknowledged: No
Occurrences: 1
Creation Time: 2014-03-08T14:45:06.209
Original Severity: Major
Previous Severity: Major
Highest Severity: Major





2. obfl-log 

5:2014 Mar  8 14:44:59:2.1(1a):selparser:-: selparser.c:667: # 38 02 00 00 01 02 00 00 3A 92 1A 53 20 00 04 0D 97 00 00 00 7F 01 FF FF # 238 | 03/08/2014 14:44:58 | CIMC | Drive slot(Bay) HDD0_INFO #0x97 | LED is on | Asserted
5:2014 Mar  8 14:44:59:2.1(1a):selparser:-: selparser.c:667: # 39 02 00 00 01 02 00 00 3B 92 1A 53 20 00 04 0D 97 00 00 00 7F 05 FF FF # 239 | 03/08/2014 14:44:59 | CIMC | Drive slot(Bay) HDD0_INFO #0x97 | LED color is amber | Asserted
5:2014 Mar  8 14:44:59:2.1(1a):selparser:-: selparser.c:667: # 3A 02 00 00 01 02 00 00 3B 92 1A 53 20 00 04 0D 97 00 00 00 7F 09 FF FF # 23a | 03/08/2014 14:44:59 | CIMC | Drive slot(Bay) HDD0_INFO #0x97 | Degraded | Asserted



3. sel log

 # 38 02 00 00 01 02 00 00 3A 92 1A 53 20 00 04 0D 97 00 00 00 7F 01 FF FF # 238 | 03/08/2014 14:44:58 | CIMC | Drive slot(Bay) HDD0_INFO #0x97 | LED is on | Asserted
 # 39 02 00 00 01 02 00 00 3B 92 1A 53 20 00 04 0D 97 00 00 00 7F 05 FF FF # 239 | 03/08/2014 14:44:59 | CIMC | Drive slot(Bay) HDD0_INFO #0x97 | LED color is amber | Asserted
 # 3A 02 00 00 01 02 00 00 3B 92 1A 53 20 00 04 0D 97 00 00 00 7F 09 FF FF # 23a | 03/08/2014 14:44:59 | CIMC | Drive slot(Bay) HDD0_INFO #0x97 | Degraded | Asserted



4. /var/log/message

5:2014 Mar  8 14:44:59:2.1(1a):selparser:-: selparser.c:667: # 38 02 00 00 01 02 00 00 3A 92 1A 53 20 00 04 0D 97 00 00 00 7F 01 FF FF # 238 | 03/08/2014 14:44:58 | CIMC | Drive slot(Bay) HDD0_INFO #0x97 | LED is on | Asserted
5:2014 Mar  8 14:44:59:2.1(1a):selparser:-: selparser.c:667: # 39 02 00 00 01 02 00 00 3B 92 1A 53 20 00 04 0D 97 00 00 00 7F 05 FF FF # 239 | 03/08/2014 14:44:59 | CIMC | Drive slot(Bay) HDD0_INFO #0x97 | LED color is amber | Asserted
5:2014 Mar  8 14:44:59:2.1(1a):selparser:-: selparser.c:667: # 3A 02 00 00 01 02 00 00 3B 92 1A 53 20 00 04 0D 97 00 00 00 7F 09 FF FF # 23a | 03/08/2014 14:44:59 | CIMC | Drive slot(Bay) HDD0_INFO #0x97 | Degraded | Asserted


No comments:

Post a Comment