Skip to content
🗿 Archive

Better Signal Handling #29480

🗿 Archive
1y ago
· 24 replies
🗿 Archive
1y ago
· 24 replies

I am not sure whether this is a question, bug report, or feature request, but here we go.

At the moment, it is really difficult to handle exits/signals correctly. Consider this:

async function someTask() {
  try {
    await someOtherTask();
  } finally {
    // MUST PERFORM VERY IMPORTANT CLEAN UP!
  }
}

This async function does not work "correctly". If the process received a signal causing it to terminate, or, for some reason, node decided to exit due to the event loop becoming empty while waiting for the other task, the finally block do not run at all.

In "synchronous" languages like Java, Ruby, Python, signals that are reasonable to handle are sometimes converted into exceptions and delivered to the "currently running code". However, maybe that doesn't quite work in Node/JavaScript's model, since the function above is in fact not "running", and it's unclear who should be getting these exceptions.

I understand that there is no such thing as "surefire cleanup", since your process can always be terminated forcefully without a chance to run any code (SIGKILL, etc), so you must be prepared to deal with the consequence of an unclean exit anyway.

However, the user pressing Ctrl-C (SIGINT) isn't that unusual of a situation, and is usually expected to be as a "clean" exit, so it would be nice if we can properly cleanup after our code. Maybe in our function, we created a PID file that we would like to remove, or maybe we are supposed to shutdown some children processes.

Okay, so maybe we try something like this:

async function fixedFinally(task, cleanup) {
  process.on('SIGINT', cleanup);
  process.on('SIGTERM', cleanup);
  // ...

  try {
    await task();
  } finally {
    process.off('SIGINT', cleanup);
    process.off('SIGTERM', cleanup);
    // ...

    await cleanup();
  }  
}

async function someTask() {
  await fixedFinally(someOtherTask, async () => {
    // MUST PERFORM VERY IMPORTANT CLEAN UP!
  });
}

But now, we caused another problem: when we do receive these signals (e.g. Ctrl-C), the program no longer terminates. This is because by installing any handler for these signal events, it suppresses the default handler's behavior (terminate the program) and, as far as I know, there is no way to invoke that deafult behavior.

Okay, so maybe we can just mimic the behavior of the default handler ourselves?

async function fixedFinally(task, cleanup) {
  async function handler() {
    await cleanup();
    process.exit(0);
  }

  process.on('SIGINT', handler);
  process.on('SIGTERM', handler);
  // ...

  try {
    await task();
  } finally {
    process.off('SIGINT', handler);
    process.off('SIGTERM', handler);
    // ...

    await cleanup();
  }  
}

async function someTask() {
  await fixedFinally(someOtherTask, async () => {
    // MUST PERFORM VERY IMPORTANT CLEAN UP!
  });
}

This caused multiple new problems.

First, the default behavior of the signal handlers actually exit with a non-zero exit code, according to the docs, 128 + signal value. Presumably, this is for good reason and a behavior we want to match. Maybe this is not such a big deal, we could probably read the docs or source code to copy the logic.

But more importantly, by calling process.exit(...) in the signal handler, the rest of the handlers (if any) will be skipped over! So that means if more than one async task wanted to perform cleanup in this manner, only one of them can succeed.

More over, maybe some other code intentionally installed a signal handler here to avoid exiting in the first place. This, in practice, means that it is unsafe for libraries to install signal handlers to perform cleanup.

Okay, so since all we really care about is to perform our cleanup on exit, so perhaps this is all we needed:

async function fixedFinally(task, cleanup) {
  process.on('exit', cleanup);
  // ...

  try {
    await task();
  } finally {
    process.off('exit', cleanup);
    // ...

    await cleanup();
  }  
}

async function someTask() {
  await fixedFinally(someOtherTask, async () => {
    // MUST PERFORM VERY IMPORTANT CLEAN UP!
  });
}

Unfortunately, this does not work either, because the exit event is not fired when the process is terminated by the default SIGINT handler (maybe this is a bug?).

So at this point, I feel like I am not left with any good option to do async cleanup properly, especially if you are writing a library and do not want to influence the global shutdown behavior just to accomplish your clanup goals.

As far as concrete suggestions on what should be done, I'm not quite sure, but here are a few points to consider.

  • From a DX perspective, it would be extremely nice if the async try { ... } finally { ... } Just Worked™ as it would in other environments. I understand why this is perhaps a poor fit for the Node/JavaScript programming model though.

  • It would have been nice if the events handlers here are designed like the browser's event APIs, where you are just attaching some extra logic to run, and does not cancel the browser's default behavior by default, but there are ways to do that if that's what you wish to do (preventDefault/stopPropagation). However, that ship has probably sailed.

  • Under the current design, some way to invoke what-would-have-been the default handler's behavior would be nice. The current behavior is like click handlers in the browser have preventDefault() on by default with no way to opt back in.

  • However, while the above is useful for "apps", where get to decide what shutdown behavior you want, and just want to run some extra logic before handing things back to the default handler, it is not enough to solve the problem for libraries.

  • For libraries, you really need a way to say "Okay, I did what I needed to do, now just keep going as if I wasn't here", which means 1) calling the rest of the signal handlers and 2) exit if no one else objects.

  • Maybe the last point can be accomplished by just firing the exit event more consistently. But I'm worried that there are other edge cases that causes the exit event to not fire, other than the SIGINT example I discovered.

  • It is also possible that I just missed some common patterns or idioms to solve this problem?

Replies

I have finished typing! Reopening!

1y

@devsnek devsnek
Collaborator

It's worth noting that SIGINT doesn't necessarily mean the program will exit. It means the system has requested that the program interrupt whatever work it's actively doing. For example in the REPL, sending SIGINT stops whatever is currently evaluating, it doesn't close the REPL. This is also why SIGINT doesn't guarantee the exit event.

I think a key point here is that SIGINT is top-level, big public behavior, so individual functions shouldn't have a concept of SIGINT for cleanup, since other things could also cause cleanup. You'd want to bundle that all into some sort of "cleanup" interface and then flow your various conditions for cleanup into that.

Also, there is also some work going into defining better ways of cleaning up resources: https://github.com/tc39/proposal-explicit-resource-management

The specific example of SIGINT not causing is this:

console.log(`PID ${process.pid}`);

process.on("exit", (code, signal) => {
  console.log("exit called!", { code, signal });
});

setInterval(() => {
  console.log('Still alive!')
}, 1000);

setTimeout(() => {
  console.log("Goodbye!");
  process.exit(0);
}, 10000);

process.stdin.resume();

If you run this and then send Ctrl-C (or kill -2), it does not fire the exit event. Same for SIGTERM.

Also, there is also some work going into defining better ways of cleaning up resources: https://github.com/tc39/proposal-explicit-resource-management

I was there for the presentation, but I don't think it helps. It's not that try/finally in JavaScript does not work in correctly in general, it's that signals and exits don't cause an exception (or return, or any other kind of completion event) from the perspective of the async function, so the finally block is not called. I don't think the proposal changes anything in this regard.

@devsnek devsnek
Collaborator

your example there is interesting. in general we can't fire an event for sigint exits if js is still running, the idea is to interrupt it after all, and js doesn't really work with reentrancy. I would, however, expect the event to be fired it if happens while waiting on the event loop. Perhaps it doesn't for consistency reasons.

I would, however, expect the event to be fired it if happens while waiting on the event loop.

Right, that's why I added the setInterval to illustrate that it is not stuck in some blocking sync code.

1y

@chancancode chancancode
Author

I think a key point here is that SIGINT is top-level, big public behavior, so individual functions shouldn't have a concept of SIGINT for cleanup, since other things could also cause cleanup. You'd want to bundle that all into some sort of "cleanup" interface and then flow your various conditions for cleanup into that.

What should libraries do?

Let's say, you want to provide a withTmpDir that creates a temp directory, pass it to the user's callback, and the remove the temp directory afterwards. Something like this:

async withTmpDir(callback) {
  let dir = createTmpDir();
  try {
    return await callback(dir);
  } finally {
    removeTmpDir(dir);
  }
}

We already know that this doesn't cleanup the temp directory if you exit/SIGINT/SIGTERM before callback resolves/rejects, the question is, if it's really import for this function to remove the temp directory "whenever possible" – namely handling "graceful" exit/SIGINT/SIGTERM, is there a way to do it correctly, and so far my conclusion is there isn't.

The nature of the cleanup work here isn't really that important, you can substitute "removing a temp directory" with a different task, like shutting down detached child processes, closing database connections, etc.

It would be great if Node can provide the coordination needed to "bundle that all into some sort of "cleanup" interface. If the exit event does fire on SIGINT and SIGTERM, that is perhaps sufficient for this particular case.

@devsnek devsnek
Collaborator

it's that signals and exits don't cause an exception (or return, or any other kind of completion event) from the perspective of the async function

ah i see the pattern you're going for now. yes, we wouldn't deliver such occurrences as exceptions because of a sort of "cardinal rule" which the ecosystem generally follows which is that the world isn't supposed to mutate while js is running, you'd have to wait until after the js finishes running to know if the outside world requested that the js stop running.

What should libraries do?

It's a good question. I've had the same issue working on libraries that mount virtual filesystems, and will easily crash the OS if not unmounted correctly. Lots of things actually have this issue, for example I have vim set up to run eslint automatically, and every once in a while eslint leaves some orphaned processes behind because i killed vim. Here is prior art on trying to make this less messy that I searched up, hopefully it can provide additional information: #20804 #2853 #9050

@bnoordhuis bnoordhuis
Collaborator

There's no real solution here. Either you accept that:

  1. the process terminates immediately, or that

  2. termination can be postponed indefinitely because cleanup code can spawn new tasks, which spawns new tasks, which spawns new tasks, ad infinitum.

If (2) is an acceptable middle ground for you, then you can already accomplish that through the process.on('beforeExit') event.

The one thing Node.js can (perhaps/arguably) do better is wait for promises to resolve on normal exit, when there is no more work to do.

That's being discussed in #29355 and doesn't need changes to Node.js core, strictly speaking.

Even if you accept (2), or a softer variant of (2) (up to some time limit, etc), I'm don't think it can really be done today.

For one thing, signals and explicit exit does not emit 'beforeExit'. It's still the case that I believe there is no way to "do some cleanup work if the SIGINT is going to cause termination, but do nothing otherwise".

@bnoordhuis bnoordhuis
Collaborator

signals and explicit exit does not emit 'beforeExit'.

For the record: that's by design. process.exit() means "stop right now."

SIGINT works the same way (unless you handle it) because that's the traditional/expected behavior for a UNIX application.

@Fishrock123 Fishrock123
Collaborator

signals and explicit exit does not emit 'beforeExit'.

Correct, 'beforeExit' is not named well. It should be something like (on) 'eventLoopEmpty'.

1y

@benjamingr benjamingr
Maintainer

There's no real solution here.

Well, the "real solution" here is promise cancellation, which would have solved this - unfortuunately that was rejected by tc39 for a more explcit proposal that does no let us do this.

Other languages do "creative" things like running finally blocks on generator GC (ahem python).

I would also like to challenge the assumption we can't fix this - we can (as the platform and from the V8 side) throw an AsyncFunctionInterruptedError error into the current async function which would cause the finally to run. Unfortunately without third-state semantics that would also cause catch blocks to run. I don't think we can do this without breaking JavaScript which none of us want to do. I think we need to take this use case up to TC39.


On a side note good issue and reproducing example @chancancode thanks 🙏

The more I think about this the more I become convinced this is a footgun in how we deal with async functiions. If in this case:

async function someTask() {
  try {
    await someOtherTask();
  } finally {
    // MUST PERFORM VERY IMPORTANT CLEAN UP!
  }
}

Sending a SIGINT (when no SIGINT signal handler is set up) could reject the promise returned from someOtherTask() which would cause the cleanup to run.

@devsnek devsnek
Collaborator

I agree that we should continue looking for creative ways to solve this. Generally though, we should avoid growing the event loop (in fact we ignore new tasks during a normal exit), so i don't think we can rely on the promises themselves in the exit flow.

@devsnek I am really not sure what we can or should do but I am really not a fan of the user stories in our current behavior and I would love to see how this can be fixed from the language side and not from the platform side because this is essentially the same issues web-pages have if the user navigates from them during an operation or when a browser stops running JavaScript on a page to save power.

@devsnek devsnek
Collaborator

Perhaps one route could be registering https://github.com/tc39/proposal-explicit-resource-management scopes in a big list and calling them. We still end up with the problem that we can't dispatch this in a sync context due to design and implementation choices, and we can't dispatch this in an async context because we can't put tasks on the queue...

we still end up with the problem that we can't dispatch this in a sync context due to design and implementation choices, and we can't dispatch this in an async context because we can't put tasks on the queue...

I mean, I get why we might not want to do either and I am not even sure we should propose either (I opened an issue in that repo) but I am not sure why we can't do either?

@devsnek devsnek
Collaborator

the decisions made in the past are: we don't want to do it synchronously because the state of the world can't change while js is running. we don't want to do it asynchronously because queuing more tasks means the process lives longer and we're trying to end the process.

Obviously we can change our minds about these decisions, but they've been in place for a while.

1y

@benjamingr benjamingr
Maintainer

Ok, so I think I understand, your point is that cleanup for stuff like this is typically asynchronous in nature anyway and we cannot (safely) delay the process exiting by waiting for asynchronous actions to happen. That makes sense and I agree.

We theoretically could also have different behavior for a "clean" exit (like a SIGINT) and a "dirty" exit (like a process.exit(1)) but given that cleanup is "best effort" anyway I can totally also see not wanting to do anything here anyway.

It appears that Python does the same thing we do in async functions. (You would do loop.add_signal_handler() on your event loop and handle it there but it doesn't get stuff thrown onto the async function). In C# you would be expected to pass a CancelToken throughout your app in a patten that looks like this:

const token = { cancel() { throw new CancelError('cancelled')} };

async function someTask(token) {
  try {
    await someOtherTask(token);
  } finally {
    // MUST PERFORM VERY IMPORTANT CLEAN UP!
  }
}

And your interrupt handler would be process.on('SIGINT', () => token.cancel()).

And users would be expected to know to filter the CancelError (well CancelException since this example is in C#) in their instrumentation and logs.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

I think at its heart this issue is about the guarantees provided by the finally clause, and therefore about its suitability for important resource cleanup tasks, especially after unexpected failures.

In Python, we love using the finally clause precisely because of its guarantees. The code you put in there is run, and it can be an incredibly powerful and easy-to-reason-about technique.

I am still new to this ecosystem, but the question whether I should do the same or not in NodeJS/JavaScript was one of the first ones I researched more deeply. I found the following discussion to be incredibly insightful: https://esdiscuss.org/topic/garbage-collection-in-generators

Andreas Rosberg said about

function getResults*() {
     try {
         var resource = acquire();
         for(const item of resource) yield process(item);
     } finally {
         release(resource);
     }

that it is a

"bogus forms of resource management". This is an anti-pattern in ES6. It won't work correctly. We should never have given the illusion that it does.

and Benjamin Gruenbaum (@benjamingr) was kind of representing my perspective in this discussion, saying that

try/finally is commonly used for resource management and garbage collection is a form of automatic resource management. There is also the assumption that finally is always run. Two other languages have opted into the "run finally" behavior so I wouldn't call it crazy

and that

I'm very curious why languages like Python have chosen to call finally blocks in this case - this was not a hindsight and according to the PEP. They debated it and explicitly decided to call release.

The fundamental difference between the approaches might be as of garbage collection, seemingly, as Mark S miller says:

JavaScript GC is not specified to be precise, and so should be assumed conservative. [...] C++ RAII and Python refcounting are completely different: they are precise, prompt, predictable, and deterministic.

but on the other hand Andreas Rossberg said

Try-finally has nothing to do with GC, it's just control flow

I don't think the discussion was concluded in a really satisfying way, but my big takeaway was one should not rely on the finally clause in generators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
7 participants
Converted from issue
Beta
You can’t perform that action at this time.