There are instances where a VM may be unable to boot properly due to a VM stuck in a boot cycle, causing you to be unable to access a VM via SSH. This article confirms from logs that a VM is stuck in a boot cycle and a resolution to access the VM.
Prerequisites
Solution
1. With the CLI installed, Crusoe offers a way to connect to your VMs by serial-console access. This acts as a "backdoor" if connection to a VM via SSH is inaccessible.
crusoe compute vms serial-console --name $name-of-VM --project-id $project-id
2. If a VM is stuck in a boot cycle, you'll see boot logs like the following:
[4524639.525398] kthread+0x127/0x150
[4524639.525711] ? set_kthread_struct+0x50/0x50
[4524639.526096] ret_from_fork+0x1f/0x30
[4524639.526430] </TASK>
[4524639.785670] watchdog: BUG: soft lockup - CPU#145 stuck for 3413s! [sshd:3415922]
[4524639.786425] Modules linked in: nfsv3 nfsv4 nfs fscache netfs veth rpcsec_gss_krb5 xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc gdrdrv(POE) cuse nvme_fabrics overlay rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) kvm_intel binfmt_misc nls_iso8859_1 kvm mlx5_ib(OE) ib_uverbs(OE) nvidia_uvm(POE) nfsd tcp_bbr sch_fq auth_rpcgss nvidia_peermem(POE) nfs_acl ib_core(OE) lockd dm_multipath grace scsi_dh_rdac scsi_dh_emc scsi_dh_alua knem(OE) efi_pstore sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 multipath linear nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) raid0 crct10dif_pclmul crc32_pclmul drm_kms_helper syscopyarea ghash_clmulni_intel mlx5_core(OE) sha256_ssse3 sysfillrect
[4524639.786472] sysimgblt sha1_ssse3 pci_hyperv_intf aesni_intel mlxdevm(OE) fb_sys_fops mlx_compat(OE) crypto_simd nvme cec cryptd tls rc_core mlxfw(OE) nvme_core psample drm virtio_rng virtio_blk
[4524639.796066] CPU: 145 PID: 3415922 Comm: sshd Tainted: P D OEL 5.15.0-107-generic #117-Ubuntu
[4524639.796937] Hardware name: Cloud Hypervisor cloud-hypervisor, BIOS 0
[4524639.797526] RIP: 0010:__pv_queued_spin_lock_slowpath+0x78/0x2e0
[4524639.798080] Code: 00 00 00 48 c7 03 00 00 00 00 31 f6 65 8b 05 ef 7f 2d 5d c7 43 08 00 00 00 00 89 43 10 c6 43 14 00 eb 07 80 e6 ff 75 23 f3 90 <41> 8b 14 24 66 85 d2 75 10 89 f0 f0 41 0f b0 0c 24 84 c0 0f 84 bd
[4524639.799861] RSP: 0018:ff6a0a4316eabcb8 EFLAGS: 00000246
[4524639.800359] RAX: 0000000000000091 RBX: ff1d683a5f672240 RCX: 0000000000000001
3. Reset the VM using below command:
crusoe compute vms reset <vm-name> --project-id <project-id>
4. Wait for the reset process to complete and try SSH access again.
5. If issue still exists reach out to Crusoe Support.
Additional Resources
- Check the startup and shutdown lifecycle scripts to check if they are causing any of the boot issues. These scripts can be found in the following directory
# ls -al /usr/local/bin/crusoe/
- Avoid adding ephemeral disks to `/etc/fstab` as misconfigured disk mounts can lead to boot issues and VM can boot in emergency mode. Here is How-To Recover a VM booting in Emergency Mode.
- Please ensure third-party drivers are tested for compatibility.
- If you do not have serial console setup or would like to set it up for the VMs in the future, refer to the following article
How To Setup Serial Console
Comments
0 comments
Please sign in to leave a comment.