How to Monitor Software RAID on Linux Servers
Monitoring your RAID array helps to identify potential failures early, ensuring data integrity and system stability. Regular checks using tools like* mdadm*and smartmontools provide insights into disk health, performance, and potential failures.
By proactively monitoring RAID arrays, you can increase the chances of preventing unexpected downtime and the time-consuming data recovery procedures that follow.
#Instructions to Monitor Software RAID on Linux Servers
Before monitoring your RAID array, it is essential to identify its configuration. Use the following commands to determine your RAID setup. Identifying your RAID setup helps you understand the type of redundancy and performance improvements it provides.
#Step 1: Identify Your RAID Array
-
Check active RAID devices.
Open the terminal and run the following command to check active RAID devices for any degradation or array failures:
cat /proc/mdstatHere is an example output with healthy disks:
Output
root@content-kit:~# cat /proc/mdstat Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 nvme1n1p2[1] nvme0n1p2[0] 249916416 blocks super 1.2 [2/2] [UU] bitmap: 2/2 pages [8KB], 65536KB chunk unused devices: <none>To explain this output further:
- Personalities: Lists the available RAID types supported on the system. In this case, the system supports RAID1, RAID0, RAID6, RAID5, RAID4, and RAID10.
- md0: Indicates the active RAID array, in this case, md0 is configured as a RAID 1 (mirroring) array.
- Devices: The array consists of two NVMe drive partitions: nvme1n1p2 and nvme0n1p2. The numbers inside the square brackets [1] and [0] indicate their order in the array.
- Blocks and version: The RAID array contains 249916416 data blocks and uses the super 1.2 metadata format.
- [2/2] [UU]: This section shows the RAID member count and their status. [2/2] indicates that both disks are active, and [UU] means both disks are functioning correctly. If one disk fails, it will show [U_] or [_U], indicating which disk is degraded.
- Bitmap: The bitmap helps track changes to the RAID set, speeding up re-synchronization by reducing unnecessary data copying. In this example, the bitmap size is 8KB, with a chunk size of 65536KB.
- Unused devices: Indicates that no additional devices are currently unused within the RAID setup.
-
Identify RAID partitions.
To identify RAID partitions and their layout, run:
lsblkThis will visualize your disk layout, showing RAID devices, partitions, and how storage is allocated. An example is:
Output
root@content-kit:~# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS nvme0n1 259:0 0 238.5G 0 disk ├─nvme0n1p1 259:1 0 1M 0 part └─nvme0n1p2 259:2 0 238.5G 0 part └─md0 9:0 0 238.3G 0 raid1 / nvme1n1 259:3 0 238.5G 0 disk ├─nvme1n1p1 259:4 0 1M 0 part └─nvme1n1p2 259:5 0 238.5G 0 part └─md0 9:0 0 238.3G 0 raid1 /Further explained this shows:
- NAME: Lists devices and their partitions. Here, nvme0n1 and nvme1n1 are NVMe drives, each with partitions (nvme0n1p2 and nvme1n1p2) forming the RAID array md0.
- SIZE: Displays device capacity. Both disks are 238.5G, and md0 reflects the combined RAID size.
- TYPE: Identifies the device type - disk for physical drives, part for partitions, and raid1 for the RAID array.
- MOUNTPOINTS: Shows where devices are mounted. The RAID array md0 is mounted at /.
-
Gather detailed RAID information.
To gather detailed information about a specific RAID array, run this command, and replace /dev/md0 with your actual RAID device to retrieve crucial information such as RAID level, disk health, and recovery status:
sudo mdadm --detail /dev/md0An example of this would be:
Output
root@content-kit:~# sudo mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Tue Jan 21 09:26:48 2025 Raid Level : raid1 Array Size : 249916416 (238.34 GiB 255.91 GB) Used Dev Size : 249916416 (238.34 GiB 255.91 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Wed Jan 22 06:56:07 2025 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Consistency Policy : bitmap Name : 246013:0 UUID : fd3e2b9a:da14efcd:73e749f8:50e44710 Events : 911 Number Major Minor RaidDevice State 0 259 2 0 active sync /dev/nvme0n1p2 1 259 5 1 active sync /dev/nvme1n1p2Explanation of the output:
- Version: The RAID metadata version, here 1.2, which defines the format used to store RAID information.
- Creation Time: Indicates when the RAID array was created.
- RAID Level: Specifies the type of RAID configuration; in this case, RAID 1 (mirroring).
- Array Size: Displays the total capacity of the RAID array, which is 238.34 GiB.
- Used Dev Size: Shows the storage utilized by each device.
- Raid Devices / Total Devices: Number of active and total devices in the RAID setup.
- Persistence: Confirms that the RAID superblock is persistent, meaning it retains configuration across reboots.
- State: Displays the current status of the array, clean indicates no issues.
- Active / Working / Failed Devices: Provides counts of functioning, operational, and failed devices, respectively.
- Consistency Policy: Indicates that a bitmap is used to track changes and speed up rebuilds.
- Device list: Shows associated storage devices with their respective RAID roles.
-
Verify the RAID configuration file.
To verify and check RAID configurations stored on your system, run:
sudo cat /etc/mdadm/mdadm.confWhich should return:
Output
root@content-kit:~# sudo cat /etc/mdadm/mdadm.conf ARRAY /dev/md0 metadata=1.2 name=246013:0 UUID=fd3e2b9a:da14efcd:73e749f8:50e44710 MAILADDR alerts@internal-mx.cherryservers.comAn explanation of the output is:
- ARRAY /dev/md0: Specifies the RAID array device managed by* mdadm*. In this case, the array is identified as /dev/md0.
- metadata=1.2: Indicates the metadata version used to store RAID configuration details. The metadata helps the system recognize and rebuild the RAID array upon reboots.
- name=246013:0: This field assigns a unique name to the RAID array, which can help track and manage multiple RAID arrays.
- UUID=fd3e2b9a:da14efcd:73e749f8:50e44710: The unique identifier assigned to the RAID array. This UUID identifies the correct array, even if the device name changes.
- MAILADDR alerts@internal-mx.cherryservers.com: Defines the email address where notifications and alerts regarding RAID events (such as failures or degradations) will be sent. Importance of the configuration:
The mdadm.conf file ensures that the RAID array is assembled automatically during system boot. The MAILADDR setting allows system administrators to receive critical RAID alerts proactively, helping to prevent data loss. For more details on creating and managing RAID arrays, refer to our dedicated guide to creating different types of RAID arrays.
#Step 2: Monitor Your RAID Arrays
Once you have identified your RAID setup, the next step is to monitor it continuously to ensure optimal performance and prevent unexpected failures.
-
Install monitoring tools.
To monitor RAID health, you will need to install the necessary tools using the package manager for your Linux distribution. The commands for popular Linux distributions are:
- Debian/Ubuntu-based distributions:
sudo apt update && sudo apt install mdadm smartmontools -y- RHEL/CentOS-based distributions:
sudo dnf install mdadm smartmontools -yOr for older CentOS versions:
sudo yum install mdadm smartmontools -y- Arch Linux:
sudo pacman -S mdadm smartmontools --noconfirm- openSUSE:
sudo zypper install mdadm smartmontoolsOnce the tools are installed, you can check the status and health of your RAID array. The following are crucial commands for monitoring your RAID status.
-
Check RAID sync and failures. To detect any degraded or syncing issues in the RAID array in real time, run:
cat /proc/mdstatCheck disk health with smartmontools. The smartctl utility provides detailed health reports for individual RAID disks. Run this command, replacing /dev/sda with the appropriate disk identifier for your system (e.g., /dev/nvme0n1 or /dev/sdb).
sudo smartctl -a /dev/sdaSome key things to watch here for are:
- Overall health status (e.g., PASSED or FAILED)
- Disk temperature and SMART attributes
- Reallocated sectors and potential failure indicators
- -H – Quick health check of the disk.
- -i – View basic disk information (model, serial, firmware).
- -t short|long – Perform self-tests to detect errors.
- -l error – Display recent error logs.
You can identify your drives using the lsblk command:
lsblk
#Step 3: Automate Monitoring with Cron Jobs
-
To ensure regular monitoring, you can automate checks using cron jobs. To start, run:
crontab -eIf it's your first time using crontab, you will be prompted to select an editor.
Output
root@content-kit:~# crontab -e no crontab for root - using an empty one Select an editor. To change later, run 'select-editor'. 1. /bin/nano <---- easiest 2. /usr/bin/vim.basic 3. /usr/bin/vim.tiny 4. /bin/ed Choose 1-4 [1]: -
Add the following entry to check RAID health daily at 3 AM, and log it. Ensure that you replace /dev/md0 with your actual RAID array (e.g., /dev/md127):
0 3 * * * /usr/sbin/mdadm --detail /dev/md0 >> /var/log/raid_status.logTo break this down:
- *0 3 * * ** – This specifies the schedule for running the command; - 0 – Minute (0 minutes past the hour); 3 – Hour (3 AM); * * * – Every day, every month, and every day of the week.
- /usr/sbin/mdadm --detail /dev/md0 – This command checks the - detailed status of the RAID array.
- >> /var/log/raid_status.log – This appends the output to the specified log file for later review. You may opt to change the log location by modifying /var/log/raid_status.log to any preferred path (e.g., /home/user/raid_log.txt).
An example configuration would look like this:
GNU nano 7.2 /tmp/crontab.J3M99T/crontab Edit this file to introduce tasks to be run by cron. Each task to run has to be defined through a single line indicating with different fields when the task will be run and what command to run for the task To define the time you can provide concrete values for minute (m), hour (h), day of month (dom), month (mon), and day of week (dow) or use '*' in these fields (for 'any'). Notice that tasks will be started based on the cron's system daemon's notion of time and timezones. Output of the crontab jobs (including errors) is sent through email to the user the crontab file belongs to (unless redirected). For example, you can run a backup of all your user accounts at 5 a.m every week with: 0 5 * * 1 tar -zcf /var/backups/home.tgz /home/ For more information see the manual pages of crontab(5) and cron(8) m h dom mon dow command 0 3 * * * /usr/sbin/mdadm --detail /dev/md0 >> /var/log/raid_status.log
#Step 4: OPTIONAL - Set up Email Alerts
If desired, you can configure email notifications to receive automatic alerts in case of RAID issues, in the mdadm.conf file.
-
Edit the configuration file by running:
sudo nano /etc/mdadm/mdadm.conf -
Add or modify the following line to specify an email address for alerts. Replace the example with your desired email:
MAILADDR alerts@yourdomain.com -
Save and update the RAID configuration using:
sudo mdadm --detail --scan >> /etc/mdadm/mdadm.confsudo update-initramfs -uBy implementing these monitoring solutions and automation methods, you can effectively ensure that your RAID arrays remain healthy and perform optimally. For further guidance on replacing failed disks, please visit our dedicate removing, replacing, and resyncing a disk guide.
Monitoring your RAID array helps to identify potential failures early, ensuring data integrity and system stability. Regular checks using tools like* mdadm*and smartmontools provide insights into disk health, performance, and potential failures.
By proactively monitoring RAID arrays, you can increase the chances of preventing unexpected downtime and the time-consuming data recovery procedures that follow.