Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to provision WEF - stalls at "shutting down" #517

Open
OpalSec opened this issue Sep 16, 2020 · 32 comments
Open

Unable to provision WEF - stalls at "shutting down" #517

OpalSec opened this issue Sep 16, 2020 · 32 comments
Labels

Comments

@OpalSec
Copy link

@OpalSec OpalSec commented Sep 16, 2020

  • Operating System Version: Server 2019
  • Deploying via (VirtualBox/VMWare/AWS/Azure/ESXi): VirtualBox
  • Vagrant Version (if applicable): 2.2.9

Please verify that you are building from an updated Master branch before filing an issue.

Description of the issue:

Vagrant provisioning stalls when attempting to shut down the wef host. I've tried destroying and reloading with --provision several times; rebooting the host in case it was in a bad state, but nothing's helped. No debug information to work on to figure out why, either.

image

@OpalSec
Copy link
Author

@OpalSec OpalSec commented Sep 16, 2020

Upgraded to Vagrant 2.2.10 - issue persists.

The VM itself looks to have frozen. This keeps happening even if I kill the process and rebuild with reload --provision or destroy it and build from scratch with vagrant up.
image

@clong
Copy link
Owner

@clong clong commented Sep 16, 2020

Hi @OpalSec , do you mind sharing which version of Virtualbox you're using so I can try to reproduce?

@clong clong added the triage label Sep 16, 2020
@OpalSec
Copy link
Author

@OpalSec OpalSec commented Sep 17, 2020

@clong VirtualBox v6.1.12

@clong
Copy link
Owner

@clong clong commented Sep 17, 2020

@OpalSec can you try:

Edit Vagrant/scripts/join-domain.ps1 and add the following lines after line 37:

Set-Service tiledatamodelsvc -StartupType Disabled
Stop-Service tiledatamodelsvc

The file should then show:

# Stop Windows Update
Write-Host "$('[{0:HH:mm}]' -f (Get-Date)) Disabling Windows Updates and Windows Module Services"
Set-Service wuauserv -StartupType Disabled
Stop-Service wuauserv
Set-Service TrustedInstaller -StartupType Disabled
Stop-Service TrustedInstaller
Set-Service tiledatamodelsvc -StartupType Disabled
Stop-Service tiledatamodelsvc

I ask you to try this because I've seen tiledatamodelsvc be a service that hangs during shutdown. You'll have to destroy and recreate the wef host, though.

vagrant destroy -f wef
vagrant up wef
@OpalSec
Copy link
Author

@OpalSec OpalSec commented Sep 17, 2020

Gets up to the point of executing the Set-Service command but is blocked with a permission error:
image

Doing it interactively from an Administrator PowerShell Terminal does the same:
image

Attempting to stop the service before configuring the startup type also didn't work:
image

@OpalSec
Copy link
Author

@OpalSec OpalSec commented Sep 17, 2020

I should also note that Issue #510 persists, even though it's been closed. Commenting out those lines fixed the problem for me, but it took a while to find this solution.

@clong
Copy link
Owner

@clong clong commented Sep 17, 2020

Are you sure you're on a recent commit? That definitely should not be persisting

@OpalSec
Copy link
Author

@OpalSec OpalSec commented Sep 17, 2020

Yeah, sorry, I reverted to a working snapshot that had an older commit. I fetched the latest and re-attempted and the error is the same as above:
image

But yeah, disregard my comment on #510, that was my mistake.

@clong
Copy link
Owner

@clong clong commented Sep 21, 2020

Hi @OpalSec - sorry, I haven't been able to reproduce this on my system yet and haven't yet had a chance to try it on a Win2019 machine. Will try to get to it this week!

@OpalSec
Copy link
Author

@OpalSec OpalSec commented Sep 22, 2020

No worries. Sorry I'm not being super helpful, let me know if you need me to run more tests in my environment!

@ZeArioch
Copy link
Contributor

@ZeArioch ZeArioch commented Sep 22, 2020

For what it's worth, I'm encountering the same issue with the following setup:

  • Freshly git pull'ed
  • OS: Ubuntu 20.04
  • Provider: Virtualbox 6.1.14 (running in headless mode)
  • Vagrant: 2.2.10

Haven't been able to pinpoint the issue but I'm still poking around.

@liviurosioara
Copy link

@liviurosioara liviurosioara commented Oct 14, 2020

I encountered the same scenario several times few months back both on Windows and macOS, only with VirtualBox. I recall trying 6.1 and 6.0 versions. Exactly WEF remained in shutting down status and I needed to kill the process in order to progress further.

I opened an issue back in March and posted some logs and screenshots, perhaps it helps: #400

@clong
Copy link
Owner

@clong clong commented Oct 14, 2020

Hi folks, I'm really sorry for the delay! See this post for my excuse :) https://twitter.com/DetectionLab/status/1311432542289948672

I'm spinning up a server 2019 baremetal server in an attempt to reproduce this as I'm unable to do so with the Win10 host I have at home.

I don't believe whatever is happening here is directly Vagrant or Virtualbox related - I imagine something in Windows is preventing the OS from shutting down correctly but it's not always easy to narrow that down.

Can anyone try adding the following lines to line 38 in join-domain.ps1?

I just pushed a code change to master that may or may not help with this.

@clong
Copy link
Owner

@clong clong commented Oct 15, 2020

Also, would anyone be willing to jump on a video call with me and screen share so we can potentially pinpoint the issue together? I'm having trouble setting up an environment that reproduces this issue

@liviurosioara
Copy link

@liviurosioara liviurosioara commented Oct 15, 2020

@clong, this is excellent news! Congratulations for the new addition to the Team!!!

I personally think this is not OS related because the Win 16 properly shuts down, but the VirtualBox VM window gets stuck somehow (POWERING_OFF_LS failed, because the VM state is actually OFF).

I have the following log from the previous opened issue:

00:02:00.973624 Changing the VM state from 'POWERING_OFF' to 'OFF'
00:02:00.974201 VMR3PowerOff:
00:02:00.974234 RUNNING -> POWERING_OFF, SUSPENDED -> POWERING_OFF, DEBUGGING -> POWERING_OFF, LOAD_FAILURE -> POWERING_OFF, GURU_MEDITATION -> POWERING_OFF, FATAL_ERROR -> POWERING_OFF, CREATED -> POWERING_OFF, RUNNING_LS -> POWERING_OFF_LS, DEBUGGING_LS -> POWERING_OFF_LS, GURU_MEDITATION_LS -> POWERING_OFF_LS, FATAL_ERROR_LS -> POWERING_OFF_LS failed, because the VM state is actually OFF
00:02:00.974393 VMSetError: F:\tinderbox\win-6.0\src\VBox\VMM\VMMR3\VM.cpp(3617) int __cdecl vmR3TrySetState(struct VM *,const char *,unsigned int,...); rc=VERR_VM_INVALID_VM_STATE
00:02:00.975025 VMSetError: VMR3PowerOff failed because the current VM state, OFF, was not found in the state transition table (old state FATAL_ERROR_LS)
00:02:00.982269 ERROR [COM]: aRC=VBOX_E_VM_ERROR (0x80bb0003) aIID={872da645-4a9b-1727-bee2-5585105b9eed} aComponent={ConsoleWrap} aText={Could not power off the machine. (Error: VERR_VM_INVALID_VM_STATE)}, preserve=false aResultDetail=-1907

@clong
Copy link
Owner

@clong clong commented Oct 15, 2020

@liviurosioara thank you! You are totally right, I'm so used to the issues being OS related I totally glossed over the Virtualbox logs. I came across this forum post which seems to outline a similar-ish issue (although its like 10 years old): https://www.virtualbox.org/ticket/8294

Apparently enabling the host i/o cache on the storage controller fixed their problem. This is a total hail mary, but can you check to see if host i/o cache is enabled on the storage controller for your WEF host?

image

@liviurosioara
Copy link

@liviurosioara liviurosioara commented Oct 16, 2020

Hi,

I wanted to give it a try today with VirtualBox 6.1 on macOS 10.15.7. However I constantly run into the same error that is happening when WEF is being provisioned:

[14:42] Now join the domain...

HasSucceeded : True
ComputerName : wef

WARNING: The changes will take effect after you restart the computer wef.

AutoEndTasks : 1
PSPath : Microsoft.PowerShell.Core\Registry::HKEY_CURRENT_USER\Control Panel\Desktop
PSParentPath : Microsoft.PowerShell.Core\Registry::HKEY_CURRENT_USER\Control Panel
PSChildName : Desktop
PSDrive : HKCU
PSProvider : Microsoft.PowerShell.Core\Registry

[14:42] Disabling Windows Updates and Windows Module Services

Stderr from the command:

powershell.exe : Set-ItemProperty : A parameter cannot be found that matches parameter name 'PropertyType'.
+ CategoryInfo : NotSpecified: (Set-ItemPropert...'PropertyType'.:String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError
At C:\vagrant\scripts\join-domain.ps1:27 char:129

  • ... anager\Power' -Name 'HiberbootEnabled' -Value 0 -PropertyType DWord - ...
  •                                                 ~~~~~~~~~~~~~
    
    • CategoryInfo : InvalidArgument: (:) [Set-ItemProperty], ParameterBindingException
    • FullyQualifiedErrorId : NamedParameterNotFound,Microsoft.PowerShell.Commands.SetItemPropert
      yCommand
@clong
Copy link
Owner

@clong clong commented Oct 16, 2020

My fault, I just introduced this and it's fixed now.

Set-ItemProperty uses "Type" as an argument while New-ItemProperty uses "PropertyType"

Powershell 🙄

@liviurosioara
Copy link

@liviurosioara liviurosioara commented Oct 16, 2020

I tried provisioning with VirtualBox and it's exactly the same scenario: WEF cannot complete a restart and remains stuck in 'Shutting down' status (VM window is grey, so it is not stuck in the OS).

I checked and the cache setting on the controller is enabled on WEF and the other 2 already provisioned Logger and DC.

Vagrant output:
wef: [16:02] Now join the domain...
wef: HasSucceeded : True
wef: ComputerName : wef
wef: WARNING: The changes will take effect after you restart the computer wef.
wef: AutoEndTasks : 1
wef: PSPath : Microsoft.PowerShell.Core\Registry::HKEY_CURRENT_USER\Control Panel\Desktop
wef: PSParentPath : Microsoft.PowerShell.Core\Registry::HKEY_CURRENT_USER\Control Panel
wef: PSChildName : Desktop
wef: PSDrive : HKCU
wef: PSProvider : Microsoft.PowerShell.Core\Registry
wef: [16:02] Disabling Windows Updates and Windows Module Services
==> wef: Running provisioner: reload...
==> wef: Attempting graceful shutdown of VM...
==> wef: Forcing shutdown of VM...

VirtualBox WEF Log:
00:02:41.256982 Console: Machine state changed to 'Stopping'
00:02:41.257742 Console::powerDown(): A request to power off the VM has been issued (mMachineState=Stopping, InUninit=0)
00:02:43.128403 PDMR3PowerOff: 59 978 420 231 ns run time
00:02:43.128422 Changing the VM state from 'POWERING_OFF' to 'OFF'
00:02:43.128442 VMR3PowerOff:
00:02:43.128457 RUNNING -> POWERING_OFF, SUSPENDED -> POWERING_OFF, DEBUGGING -> POWERING_OFF, LOAD_FAILURE -> POWERING_OFF, GURU_MEDITATION -> POWERING_OFF, FATAL_ERROR -> POWERING_OFF, CREATED -> POWERING_OFF, RUNNING_LS -> POWERING_OFF_LS, DEBUGGING_LS -> POWERING_OFF_LS, GURU_MEDITATION_LS -> POWERING_OFF_LS, FATAL_ERROR_LS -> POWERING_OFF_LS failed, because the VM state is actually OFF
00:02:43.128504 VMSetError: /Users/vbox/tinderbox/6.1-mac-rel/src/VBox/VMM/VMMR3/VM.cpp(3338) int vmR3TrySetState(PVM, const char *, unsigned int, ...); rc=VERR_VM_INVALID_VM_STATE
00:02:43.128523 VMSetError: VMR3PowerOff failed because the current VM state, OFF, was not found in the state transition table (old state FATAL_ERROR_LS)
00:02:43.129379 ERROR [COM]: aRC=VBOX_E_VM_ERROR (0x80bb0003) aIID={872da645-4a9b-1727-bee2-5585105b9eed} aComponent={ConsoleWrap} aText={Could not power off the machine. (Error: VERR_VM_INVALID_VM_STATE)}, preserve=false aResultDetail=-1907

@clong
Copy link
Owner

@clong clong commented Oct 17, 2020

@liviurosioara just to double check a few things:

  • Are you using the latest 1.8 boxes?
  • This is happening for you on both Windows and MacOS?
  • Which exact version of Virtualbox are you using? 6.1.x?
  • Is the "host" machine a VM or a physical host?
  • Is VT-x enabled in the BIOS?

I'm really at a loss here, this seems like a Virtualbox issue, but I don't have a good sense of what's causing it. I'll keep trying to reproduce.

@clong clong closed this Oct 17, 2020
@clong
Copy link
Owner

@clong clong commented Oct 17, 2020

Oops, clicked the wrong button

@clong clong reopened this Oct 17, 2020
@liviurosioara
Copy link

@liviurosioara liviurosioara commented Oct 17, 2020

Hi,

  • Are you using the latest 1.8 boxes?: Yes, all are 1.8
  • This is happening for you on both Windows and MacOS?: Yesterday I tested only on MacOS, however I had exactly the same issue on Windows in the past
  • Which exact version of Virtualbox are you using? 6.1.x?: Now it is 6.1.14, but I recall also 6.1.4 and 6.0 were the same
  • Is the "host" machine a VM or a physical host?: Physical iMac, mac mini in the past
  • Is VT-x enabled in the BIOS? On iMac I can't touch this setting, but it seems enabled

I also have a feeling this is a VirtualBox issue. Because it always failed in my case (perhaps except once? which makes the things looking even stranger) I purchased the Vagrant license and now I'm mainly using VMWare Fusion. It is not critical for me anymore, but I hope we could help others as well.

I recall that provisioning WEF individually it is a pass, I think this was one workaround.

@clong
Copy link
Owner

@clong clong commented Oct 17, 2020

@liviurosioara oddly enough I feel like I ran into this exact issue at some point in time months ago, but it seems to have self-resolved somehow. Thanks so much for providing the information, it definitely helps me figure out how I can reproduce the error and narrows down the cause.

@liviurosioara
Copy link

@liviurosioara liviurosioara commented Oct 17, 2020

I just ran another attempt, this is the timeline:

  1. vagrant up --provider=virtualbox

Logger: ok
DC: ok
WEF: stuck at 'shutting down', the only way to move on is to 'Force Quit' the WEF VM, otherwise it stays in this state, close button not working

  1. vagrant destroy wef - to clean-up, left Logger & DC up and running

  2. vagrant up wef --provider=virtualbox

This time the reboot after joining the domain is fine, WEF is being provisioned.

Screenshot 2020-10-17 at 07 27 32

@liviurosioara
Copy link

@liviurosioara liviurosioara commented Oct 17, 2020

I am wondering, what could be the difference between provisioning in sequence vs one by one e.g. 'vagrant up' vs 'vagrant up wef'. The second seems to be fine.

@clong
Copy link
Owner

@clong clong commented Oct 20, 2020

I was finally able to repro this!
image

@OpalSec
Copy link
Author

@OpalSec OpalSec commented Oct 20, 2020

Hey @clong sorry for not responding - I've been caught up with other things.

Not sure if you'd like me to try anything else, but for the record I did upgrade Vagrant to the latest version and download the latest commit before trying a full re-deploy, and it stalled trying to shut down the VM again. I also checked to make sure the Host I/O cache option was selected on wef, and it was.

Unlike @liviurosioara, destroying the wef VM and spinning it up again with vagrant up wef --provider=virtualbox didn't resolve the issue.

Let me know if you'd like any specific logs or for me to try anything on my environment to help out.

Congrats on the kid by the way; I hope you're brushed up on your dad jokes, it's your time to shine!

@mohi1305
Copy link

@mohi1305 mohi1305 commented Nov 4, 2020

I have experienced this issue too, on MacOS. I think it is a Virtualbox problem, in any case destroying WEF and trying to get it up again works, maybe not the first attempt, but eventually will work.

@clong
Copy link
Owner

@clong clong commented Nov 14, 2020

Yeah, I'm continuing to see this problem and the root cause definitely appears to be an issue with VirtualBox. I'll go ahead and open up an issue on their bug tracker, but I don't expect this to be resolved anytime soon :(

@clong
Copy link
Owner

@clong clong commented Nov 14, 2020

Bug report is here: https://www.virtualbox.org/ticket/20035

@clong clong added vbox-bug and removed triage labels Nov 14, 2020
@iGeek098
Copy link

@iGeek098 iGeek098 commented Nov 30, 2020

Dears, @OpalSec @clong @liviurosioara

I had this exact same issue, i tried many things, and it works now, specifically, i converted the CentOS rpm version to .deb (using alien) uninstalled vagrant and then reinstalled the alienated package(someone on medium warned about that with Splunk's attack_range_local) and it seems to work fine now

(btw, the machine had other instability issues before doing the above, idk why ?!, it hanged occasionally while i was troubleshooting on it ... etc)

@iGeek098
Copy link

@iGeek098 iGeek098 commented Nov 30, 2020

Dears, @OpalSec @clong @liviurosioara

I had this exact same issue, i tried many things, and it works now, specifically, i converted the CentOS rpm version to .deb (using alien) uninstalled vagrant and then reinstalled the alienated package(someone on medium warned about that with Splunk's attack_range_local) and it seems to work fine now

(btw, the machine had other instability issues before doing the above, idk why ?!, it hanged occasionally while i was troubleshooting on it ... etc)

unfortunately the error comes again sorry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.