269 lines
8.5 KiB
Markdown
269 lines
8.5 KiB
Markdown
# Proxmox Cluster Debugging Implementation Plan
|
|
|
|
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
|
|
|
**Goal:** Debug the Proxmox cluster issue where nodes `mii01` and `naruto01` are showing up with `?` in the Web UI.
|
|
|
|
**Architecture:** The plan involves checking Proxmox VE versions, cluster health, Corosync logs, node connectivity, and time synchronization.
|
|
|
|
**Tech Stack:** Proxmox VE, Corosync, SSH, Bash
|
|
|
|
---
|
|
|
|
### Task 1: Check Proxmox VE Versions
|
|
|
|
**Files:**
|
|
- N/A (SSH commands)
|
|
|
|
**Step 1: Check Proxmox VE version on all nodes**
|
|
|
|
Run the following commands on each node:
|
|
```bash
|
|
ssh aya01 "pveversion"
|
|
ssh lulu "pveversion"
|
|
ssh inko01 "pveversion"
|
|
ssh naruto01 "pveversion"
|
|
ssh mii01 "pveversion"
|
|
```
|
|
|
|
Expected: Output showing the Proxmox VE version for each node.
|
|
|
|
**Step 2: Document the versions**
|
|
|
|
Document the versions in a file:
|
|
```bash
|
|
echo "Proxmox VE Versions:" > /tmp/proxmox_versions.txt
|
|
echo "aya01: $(ssh aya01 "pveversion")" >> /tmp/proxmox_versions.txt
|
|
echo "lulu: $(ssh lulu "pveversion")" >> /tmp/proxmox_versions.txt
|
|
echo "inko01: $(ssh inko01 "pveversion")" >> /tmp/proxmox_versions.txt
|
|
echo "naruto01: $(ssh naruto01 "pveversion")" >> /tmp/proxmox_versions.txt
|
|
echo "mii01: $(ssh mii01 "pveversion")" >> /tmp/proxmox_versions.txt
|
|
```
|
|
|
|
Expected: File `/tmp/proxmox_versions.txt` with the versions of all nodes.
|
|
|
|
### Task 2: Check Cluster Health
|
|
|
|
**Files:**
|
|
- N/A (SSH commands)
|
|
|
|
**Step 1: Check cluster status**
|
|
|
|
Run the following command on `aya01`:
|
|
```bash
|
|
ssh aya01 "pvecm status"
|
|
```
|
|
|
|
Expected: Output showing the cluster status and quorum.
|
|
|
|
**Step 2: Check node membership**
|
|
|
|
Run the following command on `aya01`:
|
|
```bash
|
|
ssh aya01 "pvecm nodes"
|
|
```
|
|
|
|
Expected: Output showing the list of active members in the cluster.
|
|
|
|
### Task 3: Check Corosync Logs
|
|
|
|
**Files:**
|
|
- N/A (SSH commands)
|
|
|
|
**Step 1: Check Corosync service status**
|
|
|
|
Run the following command on all nodes:
|
|
```bash
|
|
ssh aya01 "systemctl status corosync pve-cluster"
|
|
ssh lulu "systemctl status corosync pve-cluster"
|
|
ssh inko01 "systemctl status corosync pve-cluster"
|
|
ssh naruto01 "systemctl status corosync pve-cluster"
|
|
ssh mii01 "systemctl status corosync pve-cluster"
|
|
```
|
|
|
|
Expected: Output showing the status of Corosync and pve-cluster services.
|
|
|
|
**Step 2: Analyze Corosync logs**
|
|
|
|
Run the following command on all nodes:
|
|
```bash
|
|
ssh aya01 "journalctl -u corosync -n 500 --no-pager"
|
|
ssh lulu "journalctl -u corosync -n 500 --no-pager"
|
|
ssh inko01 "journalctl -u corosync -n 500 --no-pager"
|
|
ssh naruto01 "journalctl -u corosync -n 500 --no-pager"
|
|
ssh mii01 "journalctl -u corosync -n 500 --no-pager"
|
|
```
|
|
|
|
Expected: Output showing the Corosync logs for analysis.
|
|
|
|
### Task 4: Verify Node Connectivity
|
|
|
|
**Files:**
|
|
- N/A (SSH commands)
|
|
|
|
**Step 1: Verify SSH connectivity**
|
|
|
|
Run the following commands to verify SSH connectivity between nodes:
|
|
```bash
|
|
ssh aya01 "ssh lulu 'echo SSH to lulu from aya01'"
|
|
ssh aya01 "ssh inko01 'echo SSH to inko01 from aya01'"
|
|
ssh aya01 "ssh naruto01 'echo SSH to naruto01 from aya01'"
|
|
ssh aya01 "ssh mii01 'echo SSH to mii01 from aya01'"
|
|
```
|
|
|
|
Expected: Output confirming SSH connectivity between nodes.
|
|
|
|
### Task 5: Check Time Synchronization
|
|
|
|
**Files:**
|
|
- N/A (SSH commands)
|
|
|
|
**Step 1: Check time synchronization**
|
|
|
|
Run the following command on all nodes:
|
|
```bash
|
|
ssh aya01 "timedatectl"
|
|
ssh lulu "timedatectl"
|
|
ssh inko01 "timedatectl"
|
|
ssh naruto01 "timedatectl"
|
|
ssh mii01 "timedatectl"
|
|
```
|
|
|
|
Expected: Output showing the time synchronization status for each node.
|
|
|
|
### Task 6: Document Findings
|
|
|
|
**Files:**
|
|
- Create: `/tmp/cluster_debugging_findings.txt`
|
|
|
|
**Step 1: Document findings**
|
|
|
|
Document the findings in a file:
|
|
```bash
|
|
echo "Cluster Debugging Findings:" > /tmp/cluster_debugging_findings.txt
|
|
echo "Proxmox VE Versions:" >> /tmp/cluster_debugging_findings.txt
|
|
cat /tmp/proxmox_versions.txt >> /tmp/cluster_debugging_findings.txt
|
|
echo "" >> /tmp/cluster_debugging_findings.txt
|
|
echo "Cluster Status:" >> /tmp/cluster_debugging_findings.txt
|
|
ssh aya01 "pvecm status" >> /tmp/cluster_debugging_findings.txt
|
|
echo "" >> /tmp/cluster_debugging_findings.txt
|
|
echo "Node Membership:" >> /tmp/cluster_debugging_findings.txt
|
|
ssh aya01 "pvecm nodes" >> /tmp/cluster_debugging_findings.txt
|
|
echo "" >> /tmp/cluster_debugging_findings.txt
|
|
echo "Corosync Logs:" >> /tmp/cluster_debugging_findings.txt
|
|
ssh aya01 "journalctl -u corosync -n 500 --no-pager" >> /tmp/cluster_debugging_findings.txt
|
|
echo "" >> /tmp/cluster_debugging_findings.txt
|
|
echo "Time Synchronization:" >> /tmp/cluster_debugging_findings.txt
|
|
ssh aya01 "timedatectl" >> /tmp/cluster_debugging_findings.txt
|
|
ssh lulu "timedatectl" >> /tmp/cluster_debugging_findings.txt
|
|
ssh inko01 "timedatectl" >> /tmp/cluster_debugging_findings.txt
|
|
ssh naruto01 "timedatectl" >> /tmp/cluster_debugging_findings.txt
|
|
ssh mii01 "timedatectl" >> /tmp/cluster_debugging_findings.txt
|
|
```
|
|
|
|
Expected: File `/tmp/cluster_debugging_findings.txt` with all findings.
|
|
|
|
### Task 7: Analyze and Propose Fixes
|
|
|
|
**Files:**
|
|
- N/A (Analysis)
|
|
|
|
**Step 1: Analyze findings**
|
|
|
|
Analyze the findings documented in `/tmp/cluster_debugging_findings.txt` to identify the root cause of the issue.
|
|
|
|
**Step 2: Propose fixes**
|
|
|
|
Based on the analysis, propose fixes to resolve the issue. Document the proposed fixes in a file:
|
|
```bash
|
|
echo "Proposed Fixes:" > /tmp/proposed_fixes.txt
|
|
# Add proposed fixes here
|
|
```
|
|
|
|
Expected: File `/tmp/proposed_fixes.txt` with proposed fixes.
|
|
|
|
### Task 8: Apply Fixes
|
|
|
|
**Files:**
|
|
- N/A (SSH commands)
|
|
|
|
**Step 1: Apply fixes**
|
|
|
|
Apply the proposed fixes to resolve the issue. Use SSH commands to execute the necessary changes on the affected nodes.
|
|
|
|
Expected: Issue resolved and cluster functioning as expected.
|
|
|
|
### Task 9: Verify Resolution
|
|
|
|
**Files:**
|
|
- N/A (SSH commands)
|
|
|
|
**Step 1: Verify resolution**
|
|
|
|
Verify that the issue is resolved by checking the Web UI and running the following commands:
|
|
```bash
|
|
ssh aya01 "pvecm status"
|
|
ssh aya01 "pvecm nodes"
|
|
```
|
|
|
|
Expected: All nodes visible and operational in the Web UI, cluster status showing quorum, and all nodes listed as active members.
|
|
|
|
### Task 10: Document Changes
|
|
|
|
**Files:**
|
|
- Create: `/tmp/cluster_debugging_changes.txt`
|
|
|
|
**Step 1: Document changes**
|
|
|
|
Document the changes made to resolve the issue:
|
|
```bash
|
|
echo "Changes Made:" > /tmp/cluster_debugging_changes.txt
|
|
# Add changes here
|
|
```
|
|
|
|
Expected: File `/tmp/cluster_debugging_changes.txt` with documented changes.
|
|
|
|
### Task 11: Commit Documentation
|
|
|
|
**Files:**
|
|
- Modify: `/home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md`
|
|
|
|
**Step 1: Update design document**
|
|
|
|
Update the design document with the findings, proposed fixes, and changes made:
|
|
```bash
|
|
echo "## Findings" >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
|
|
echo "" >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
|
|
cat /tmp/cluster_debugging_findings.txt >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
|
|
echo "" >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
|
|
echo "## Proposed Fixes" >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
|
|
echo "" >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
|
|
cat /tmp/proposed_fixes.txt >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
|
|
echo "" >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
|
|
echo "## Changes Made" >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
|
|
echo "" >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
|
|
cat /tmp/cluster_debugging_changes.txt >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
|
|
```
|
|
|
|
Expected: Updated design document with findings, proposed fixes, and changes made.
|
|
|
|
**Step 2: Commit changes**
|
|
|
|
Commit the changes to the design document:
|
|
```bash
|
|
git add /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
|
|
git commit -m "docs: update Proxmox cluster debugging design with findings and fixes"
|
|
```
|
|
|
|
Expected: Changes committed to the repository.
|
|
|
|
---
|
|
|
|
**Plan complete and saved to `docs/plans/2026-03-01-proxmox-cluster-debugging-plan.md`. Two execution options:**
|
|
|
|
**1. Subagent-Driven (this session)** - I dispatch fresh subagent per task, review between tasks, fast iteration
|
|
|
|
**2. Parallel Session (separate)** - Open new session with executing-plans, batch execution with checkpoints
|
|
|
|
**Which approach?**
|