8.5 KiB
Proxmox Cluster Debugging Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Debug the Proxmox cluster issue where nodes mii01 and naruto01 are showing up with ? in the Web UI.
Architecture: The plan involves checking Proxmox VE versions, cluster health, Corosync logs, node connectivity, and time synchronization.
Tech Stack: Proxmox VE, Corosync, SSH, Bash
Task 1: Check Proxmox VE Versions
Files:
- N/A (SSH commands)
Step 1: Check Proxmox VE version on all nodes
Run the following commands on each node:
ssh aya01 "pveversion"
ssh lulu "pveversion"
ssh inko01 "pveversion"
ssh naruto01 "pveversion"
ssh mii01 "pveversion"
Expected: Output showing the Proxmox VE version for each node.
Step 2: Document the versions
Document the versions in a file:
echo "Proxmox VE Versions:" > /tmp/proxmox_versions.txt
echo "aya01: $(ssh aya01 "pveversion")" >> /tmp/proxmox_versions.txt
echo "lulu: $(ssh lulu "pveversion")" >> /tmp/proxmox_versions.txt
echo "inko01: $(ssh inko01 "pveversion")" >> /tmp/proxmox_versions.txt
echo "naruto01: $(ssh naruto01 "pveversion")" >> /tmp/proxmox_versions.txt
echo "mii01: $(ssh mii01 "pveversion")" >> /tmp/proxmox_versions.txt
Expected: File /tmp/proxmox_versions.txt with the versions of all nodes.
Task 2: Check Cluster Health
Files:
- N/A (SSH commands)
Step 1: Check cluster status
Run the following command on aya01:
ssh aya01 "pvecm status"
Expected: Output showing the cluster status and quorum.
Step 2: Check node membership
Run the following command on aya01:
ssh aya01 "pvecm nodes"
Expected: Output showing the list of active members in the cluster.
Task 3: Check Corosync Logs
Files:
- N/A (SSH commands)
Step 1: Check Corosync service status
Run the following command on all nodes:
ssh aya01 "systemctl status corosync pve-cluster"
ssh lulu "systemctl status corosync pve-cluster"
ssh inko01 "systemctl status corosync pve-cluster"
ssh naruto01 "systemctl status corosync pve-cluster"
ssh mii01 "systemctl status corosync pve-cluster"
Expected: Output showing the status of Corosync and pve-cluster services.
Step 2: Analyze Corosync logs
Run the following command on all nodes:
ssh aya01 "journalctl -u corosync -n 500 --no-pager"
ssh lulu "journalctl -u corosync -n 500 --no-pager"
ssh inko01 "journalctl -u corosync -n 500 --no-pager"
ssh naruto01 "journalctl -u corosync -n 500 --no-pager"
ssh mii01 "journalctl -u corosync -n 500 --no-pager"
Expected: Output showing the Corosync logs for analysis.
Task 4: Verify Node Connectivity
Files:
- N/A (SSH commands)
Step 1: Verify SSH connectivity
Run the following commands to verify SSH connectivity between nodes:
ssh aya01 "ssh lulu 'echo SSH to lulu from aya01'"
ssh aya01 "ssh inko01 'echo SSH to inko01 from aya01'"
ssh aya01 "ssh naruto01 'echo SSH to naruto01 from aya01'"
ssh aya01 "ssh mii01 'echo SSH to mii01 from aya01'"
Expected: Output confirming SSH connectivity between nodes.
Task 5: Check Time Synchronization
Files:
- N/A (SSH commands)
Step 1: Check time synchronization
Run the following command on all nodes:
ssh aya01 "timedatectl"
ssh lulu "timedatectl"
ssh inko01 "timedatectl"
ssh naruto01 "timedatectl"
ssh mii01 "timedatectl"
Expected: Output showing the time synchronization status for each node.
Task 6: Document Findings
Files:
- Create:
/tmp/cluster_debugging_findings.txt
Step 1: Document findings
Document the findings in a file:
echo "Cluster Debugging Findings:" > /tmp/cluster_debugging_findings.txt
echo "Proxmox VE Versions:" >> /tmp/cluster_debugging_findings.txt
cat /tmp/proxmox_versions.txt >> /tmp/cluster_debugging_findings.txt
echo "" >> /tmp/cluster_debugging_findings.txt
echo "Cluster Status:" >> /tmp/cluster_debugging_findings.txt
ssh aya01 "pvecm status" >> /tmp/cluster_debugging_findings.txt
echo "" >> /tmp/cluster_debugging_findings.txt
echo "Node Membership:" >> /tmp/cluster_debugging_findings.txt
ssh aya01 "pvecm nodes" >> /tmp/cluster_debugging_findings.txt
echo "" >> /tmp/cluster_debugging_findings.txt
echo "Corosync Logs:" >> /tmp/cluster_debugging_findings.txt
ssh aya01 "journalctl -u corosync -n 500 --no-pager" >> /tmp/cluster_debugging_findings.txt
echo "" >> /tmp/cluster_debugging_findings.txt
echo "Time Synchronization:" >> /tmp/cluster_debugging_findings.txt
ssh aya01 "timedatectl" >> /tmp/cluster_debugging_findings.txt
ssh lulu "timedatectl" >> /tmp/cluster_debugging_findings.txt
ssh inko01 "timedatectl" >> /tmp/cluster_debugging_findings.txt
ssh naruto01 "timedatectl" >> /tmp/cluster_debugging_findings.txt
ssh mii01 "timedatectl" >> /tmp/cluster_debugging_findings.txt
Expected: File /tmp/cluster_debugging_findings.txt with all findings.
Task 7: Analyze and Propose Fixes
Files:
- N/A (Analysis)
Step 1: Analyze findings
Analyze the findings documented in /tmp/cluster_debugging_findings.txt to identify the root cause of the issue.
Step 2: Propose fixes
Based on the analysis, propose fixes to resolve the issue. Document the proposed fixes in a file:
echo "Proposed Fixes:" > /tmp/proposed_fixes.txt
# Add proposed fixes here
Expected: File /tmp/proposed_fixes.txt with proposed fixes.
Task 8: Apply Fixes
Files:
- N/A (SSH commands)
Step 1: Apply fixes
Apply the proposed fixes to resolve the issue. Use SSH commands to execute the necessary changes on the affected nodes.
Expected: Issue resolved and cluster functioning as expected.
Task 9: Verify Resolution
Files:
- N/A (SSH commands)
Step 1: Verify resolution
Verify that the issue is resolved by checking the Web UI and running the following commands:
ssh aya01 "pvecm status"
ssh aya01 "pvecm nodes"
Expected: All nodes visible and operational in the Web UI, cluster status showing quorum, and all nodes listed as active members.
Task 10: Document Changes
Files:
- Create:
/tmp/cluster_debugging_changes.txt
Step 1: Document changes
Document the changes made to resolve the issue:
echo "Changes Made:" > /tmp/cluster_debugging_changes.txt
# Add changes here
Expected: File /tmp/cluster_debugging_changes.txt with documented changes.
Task 11: Commit Documentation
Files:
- Modify:
/home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
Step 1: Update design document
Update the design document with the findings, proposed fixes, and changes made:
echo "## Findings" >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
echo "" >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
cat /tmp/cluster_debugging_findings.txt >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
echo "" >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
echo "## Proposed Fixes" >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
echo "" >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
cat /tmp/proposed_fixes.txt >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
echo "" >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
echo "## Changes Made" >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
echo "" >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
cat /tmp/cluster_debugging_changes.txt >> /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
Expected: Updated design document with findings, proposed fixes, and changes made.
Step 2: Commit changes
Commit the changes to the design document:
git add /home/tudattr/workspace/ansible/docs/plans/2026-03-01-proxmox-cluster-debugging-design.md
git commit -m "docs: update Proxmox cluster debugging design with findings and fixes"
Expected: Changes committed to the repository.
Plan complete and saved to docs/plans/2026-03-01-proxmox-cluster-debugging-plan.md. Two execution options:
1. Subagent-Driven (this session) - I dispatch fresh subagent per task, review between tasks, fast iteration
2. Parallel Session (separate) - Open new session with executing-plans, batch execution with checkpoints
Which approach?