When I last wrote about async cancellation in Rust, I touched briefly on the question of how cancellation interacts with panic. Mostly I left it as an exercise for the reader and left a rough sketch for how I thought it would work. More recently, Boats touched on the issue in a little more detail, but I think there are still a lot of open questions. In this post, I'd like to experiment with unwinding using my cancellation prototype and build on some of the previous work in this area.

It's not as easy as I thought🔗

In the sketch I laid out before, I expected the core idea of supporting cancellation during unwinding would be to have the executor, and any mini-executors like race and join, would basically wrap calls to poll with catch_unwind, then in the Err case, call poll_cancel to completion and then call resume_unwind. In pseudo-code, that would look something like:

loop {
    match catch_unwind(|| task.poll(cx)) {
        Ok(Poll::Ready(x)) => return x,
        Ok(Poll::Pending) => continue,
        Err(panic) => {
            while Poll::Pending = task.poll_cancel(cx) {}
            resume_unwind(panic);
        }
    }
}

Unfortunately this doesn't work. It turns out I had some inkling this might be the case when I wrote:

There are other challenges though. One is that the poll_cancel functions will need to be written to be aware of the fact that they might be called during unwinding, which means the internal state for the future might be inconsistent.

To understand what's wrong, recall that I desugared cancellation-aware async blocks into coroutines. Rust coroutines only have one entry point, which is the resume method. I simulated two entry points (poll and poll_cancel) by passing another argument into resume. The thing is, once resume panics, coroutines cannot be resumed again and they will panic if you try. Since poll and poll_cancel are backed by the same resume method, this means we can't call poll_cancel after poll panics.

Some of this is an artifact of the way this experiment is structured. If we had proper compiler support for multiple entry points to a coroutine, we might be able to make this work. But I think it's more composable and more in line with existing precedent to follow a rule where all unwinding or cancellation work needs to finish before a panic leaves the poll call.

An approach that actually works🔗

This realization that we need to process and cancellations before unwinding out of poll felt constraining at first, but it actually simplifies a lot of the design. I thought we'd need to wrap basically every call to poll in catch_unwind, but in most cases this is unnecessary and we can instead let the usual unwinding machinery proceed as normal. The places where we do care are when we know of multiple futures and if one of them panics we need to cancel the rest.

Let's do on_cancel as an example. While I don't think on_cancel would be a great API to support in production, it is useful to focus on the specifics of cancellation behavior.

In the last post, I was thinking of on_cancel almost as an approximation of an exception handler. For our purposes today, I think it's more useful to think of it as a kind of future combinator. In this view, on_cancel produces a new future from two others, one that is the normal execution path, and another future that is run only when the future is cancelled.¹

Looking at it this way, we can see what we should do when the poll function on the main future panics. We aren't allowed to poll the future that's panicking anymore, because its internal state might be inconsistent. We have to trust that as poll was unwinding, the future ran any cancellation handlers that were on the stack. But, since we want cancel-on-unwind semantics, the on_cancel combinator needs to catch the panic, run the cancellation future to completion, and then resume unwinding.

Deriving the implementation🔗

Now let's see how to add cancellation on panic behavior to our existing on_cancel implementation. My last post didn't really go into the details on this, so let's start with a rough sketch of the previous on_cancel implementation. Throughout this section I'm going to ignore details like pinning and unsafe so we can focus on the main idea. I have a complete working implementation of the ideas in this section available at https://github.com/eholk/explicit-async-cancellation.

The on_cancel method returns a future that's carries a cancellation handler. While the details are hidden in the surface API, the struct and future implementation returned looks like this:

struct<F, H> OnCancel {
    future: F,
    on_cancel: Option<H>,
}

impl<F, H> Future for OnCancel<F, H>
where
    F: Future,
    H: Future,
{
    type Output = F::Output;

    fn poll(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output> {
        self.future.poll(cx)
    }

    fn poll_cancel(self: Pin<&mut Self>, cx: &mut Context) -> Poll<()> {
        // run the cancellation handler if it's still present
        if let Some(on_cancel) = self.on_cancel {
            match on_cancel.poll(cx) {
                // if cancellation is complete, clear the handler so we won't try to run it again
                Poll::Ready(()) => self.on_cancel = None,
                // cancellation is not finished, so yield to the caller.
                Poll::Pending => return Poll::Pending,
            }
        }

        // run any cancellation handlers on the inner future
        self.future.poll_cancel(cx)
    }
}

The poll function is pretty uninteresting. We just forward it to the inner future. The poll_cancel function is a little more subtle. The main thing we need to do is run the cancellation handler, which we do by calling poll on it. However, the inner future might also have nested cancellation handlers, so we need to call poll_cancel on it as well. This is also why I chose to wrap the cancellation hook in an Option, since I can use that as a flag to indicate whether the cancellation hook is finished.

As an aside, I chose to do outside-in cancellation semantics here since drop also runs outside-in. I'm not sure this was the right choice. For example, unwinding is inside-out instead. I think it's worth thinking harder about what the right ordering is, but for now it's easy to change and independent of our focus today.

Okay, so now that we have a basic on_cancel implementation, let's handle what happens if the call to the nested future's poll panics. In short, we need to wrap the call to poll in catch_unwind.

fn poll(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output> {
    match catch_unwind(|| self.future.poll(cx)) {
        Ok(poll) => poll,
        Err(panic) => todo!("run the cancellation hook at then resume unwinding"),
    }
}

Now let's think about the Err case. Basically, we need to cancel ourselves, which we can do by calling poll_cancel. Then we need to resume unwinding. Because poll_cancel might take several tries to finish, we need to save the panic information so we can resume unwinding after it's done. So we'll add another field to OnCancel to optionally store the panic information.

struct<F, H> OnCancel {
    future: F,
    on_cancel: Option<H>,
    panic: Option<Box<dyn Any + Send + 'static>>,
}

impl<F, H> Future for OnCancel<F, H>
where
    F: Future,
    H: Future,
{
    type Output = F::Output;

    fn poll(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output> {
        match catch_unwind(|| self.future.poll(cx)) {
            Ok(poll) => poll,
            Err(panic) => {
                self.panic = Some(panic);
                match self.poll_cancel(cx) {
                    Poll::Ready(()) => resume_unwind(self.panic.take().unwrap()),
                    Poll::Pending => Poll::Pending,
                }
            },
        }
    }

    fn poll_cancel(self: Pin<&mut Self>, cx: &mut Context) -> Poll<()> {
        todo!("we'll come back to this in a minute")
    }
}

We're part of the way there, but we still have some problems. Assuming poll_cancel were correct (it's not, but we'll get there), we'd be okay if cancellation finished promptly. But if not, it will return Pending, which we'll bubble up to the caller. The caller doesn't know we're panicking, since we've hidden the panic information away in our panic field, so it will eventually call poll on us again. Unfortunately, this means we'll poll the inner future, which we've previously said is not allowed. So we need to make a small change to check if we're in the process of panicking when we're polled.

fn poll(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output> {
    if self.panic.is_some() {
        match self.poll_cancel(cx) {
            Poll::Ready(()) => resume_unwind(self.panic.take().unwrap()),
            Poll::Pending => return Poll::Pending,
        }
    }

    match catch_unwind(|| self.future.poll(cx)) {
        Ok(poll) => poll,
        Err(panic) => {
            self.panic = Some(panic);
            match self.poll_cancel(cx) {
                Poll::Ready(()) => resume_unwind(self.panic.take().unwrap()),
                Poll::Pending => Poll::Pending,
            }
        },
    }
}

And now we're all set. If we're polled when there's panic information present then we never get to the call to self.future.poll(cx).

Now it's time to revisit poll_cancel. To share some logic, I had the panic path in poll call into poll_cancel, but this means we need to update poll_cancel to recognize that it can be called while panicking. Here's how:

fn poll_cancel(self: Pin<&mut Self>, cx: &mut Context) -> Poll<()> {
    // run the cancellation handler if it's still present (this part stays the same)
    if let Some(on_cancel) = self.on_cancel {
        match on_cancel.poll(cx) {
            // if cancellation is complete, clear the handler so we won't try to run it again
            Poll::Ready(()) => self.on_cancel = None,
            // cancellation is not finished, so yield to the caller.
            Poll::Pending => return Poll::Pending,
        }
    }

    // if we aren't panicking, run any cancellation handlers on the inner future
    // otherwise, resume unwinding
    match self.panic {
        None => self.future.poll_cancel(cx)
        Some(_) => resume_unwind(self.panic.take().unwrap()),
    }
}

The first part, where we run the cancellation hook, stays the same as before. In the second part, we would normally cancel the inner future, but remember that if we are panicking we aren't allowed to poll it again.

It's worth asking what we should do in the Some line though. At this point we know we are in the process of unwinding, and all cleanup code has finished. One option is to return Poll::Ready(()) here, and if we're called from poll then we could count on it calling resume_unwind. However, it could also be that while we were waiting on the cancellation to finish, the executor decided to cancel us. In this case, if we returned Poll::Ready(()) then we would swallow the exception. So instead, the right answer is to resume_unwind here as well.

So there we have it: how to cancel a future when polling it panics.

Should we do this?🔗

Your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should.

We've shown that it's at least somewhat possible to support async cleanup code while unwinding. I'll admit, beyond a basic smoke test, I haven't really probed the limits of this design. For example, what happens if we panic while running the cancellation handler as a result of another panic? Or what actually happens if the executor cancels us while we are cleaning up before resuming a panic? If we were to RFC something like this, these are all questions that we'd need to explore.

The reason I decided to go ahead and write this post without answering those questions is that in this post I think we've already learned enough that we can start evaluating this design and inform future options.

First of all, something about suspending while in the process of unwinding just feels fundamentally weird and uncomfortable. That said, I think we can develop a reasonable semantics for this behavior if we decide we want it.

But this also leads to a shortcoming that I'm not sure how to resolve. This prototype cannot work in #[no_std] environments, because catch_unwind and resume_unwind represent panic information as a Box<dyn Any + Send + 'static>, meaning we need an allocator. This is a non-starter for something that we'd want to consider building in as a core Rust language feature. The whole async/await system has been carefully designed not to need an allocator, and we need to preserve this property. After all, async/await has found a lot of success in microcontroller environments!

Is this necessary though? Or is it an artifact of trying to prototype a system purely in library code without compiler support? As an analogy, we could imagine prototyping destructors using catch_unwind, but rustc is able to generate code to run destructors during unwinding without needing to reify the exception.

Unfortunately I don't think we can avoid the issue in the same way. The problem is that normal unwinding doesn't suspend the execution at all, while we very much need to be able to do that to await in the unwinding path. This means the exception does need to be stored somewhere (presumably with the future), and we need to be able to resume unwinding later. If you're using a work-stealing executor, this means it's even possible that your task could start unwinding on one thread and finish on another. So we need somewhere to store the exception that's not ephemeral in the way that it is during the Rust-generated unwind code.

There might be other options that could work. For example, the executor could reserve some space for each task that's large enough to hold most panics. Most likely the way we'd accomplish this is by attaching something to the Context that gives access to it. Maybe it'd be specific to panics, or maybe it'd be a more general task-local bump allocator or something like that. At any rate, we could add API surface for a minimal allocator to support awaiting while unwinding without needing a full-blown global allocator. These could be made optional, which would give executors the option of aborting if they cannot or don't want to support async unwinding.

Another option would be to have the compiler not automatically generate calls to poll_cancel while unwinding, and instead provide something like an async version of catch_unwind. I think something like this is what boats was proposing. The nice thing about this option is that we can completely give up on supporting #[no_std]. Furthermore, we don't have to worry about being "zero cost," since the fact that the user called async_catch_unwind signals that they're willing to pay the cost that's needed.

That said, it's not clear how that should interact with do ... final blocks if we were to add them.² For example, the final block would presumably run during unwinding in sync code, so it seems like we'd also need to do it while unwinding in async code. Unfortunately, as far as I can tell that will run into the same allocation problems.

So to go back to the question of whether we should do this, I think we need more exploration. There are some options, but from my exploration here it seems like it's hard to satisfy all our requirements. But maybe one of these, or some other option, can strike a decent compromise.

With a small tweak, we could approximate a finally clause, by making it so we run the cancellation future even if the main future completes successfully.↩

I really like the idea of do ... final! I had hoped to explore that some in this post but I felt there was enough material here without it.↩