As I've written about before, an important goal for async Rust is to support async functions everywhere, including in trait objects (dyn Trait). To this end, we are adding a new experimental type called dyn* that will give us more flexibility to support dynamic dispatch for async methods. We now have experimental support for dyn* in nightly Rust now, so we can start to kick the tires and use our experience to inform future development.

One thing we'd like to ensure is that using dyn* does not impose significant costs above what's already incurred by dyn. Ideally we'd be able to generate essentially the same code for dyn* Trait that we do for dyn Trait. With that in mind, in this post, I'd like to look at some of the code we currently generate. We'll start by looking at dyn Trait objects, and then we'll see how things change for dyn* Trait.

# dyn Trait

Let's start off by looking at how Rust represents trait objects. Trait objects (aka dyn Trait) in Rust are kind of weird.1 They are an example of an unsized type, which means in practice you generally don't work directly with them. For example, I might want to write something like:

fn debug_print(x: dyn Debug) {
println!("{x:?}")
}


but if we do this the compiler gives us a confusing quite helpful error message2:

error[E0277]: the size for values of type (dyn Debug + 'static) cannot be known at compilation time
--> src/lib.rs:3:16
|
3 | fn debug_print(x: dyn Debug) {
|                ^ doesn't have a size known at compile-time
|
= help: the trait Sized is not implemented for (dyn Debug + 'static)
help: function arguments must have a statically known size, borrowed types always have a known size
|
3 | fn debug_print(x: &dyn Debug) {
|                   +


The compiler tells us something we can do to fix this, so let's give it a try.

fn debug_print(x: &dyn Debug) {
println!("{x:?}")
}


This compiles and works just fine.

The compiler told us that borrowed types always have a known size, so how big are they? Well, & is a pointer, and on 64-bit machines like mine, pointers are 64 bits or 8 bytes. But if I print out the size of x instead, I get 16 bytes. What's going on?

Well, pointers in Rust aren't always 8 bytes. Sometimes they are bigger so you can attach extra data to them. In particular, this happens with pointers to unsized data, like str, [T], and our focus right now, dyn Trait. In each of these cases, we'd see that the size of &str, &[T], and &dyn Trait is two words or 16 bytes. We sometimes call these wide pointers or fat pointers because they are twice the size of regular pointers. But what is that extra word used for? For &str and &[T], the first word points to the actual data (either the characters in the string, or the array of Ts), while the second word stores the length, which the compiler uses for things like dynamic bounds checks. Trait objects are a little different though. These still store a pointer to the underlying data in the first field, but the second field stores a pointer to a vtable. The vtable looks sort of like this:

struct DebugVtable {
size: usize,
alignment: usize,
drop: fn(*mut ()),
fmt: fn(*const (), &mut Formatter<'_>),
}


Vtables always have at least three entries: the size of the underlying value3, the underlying value's alignment, and a pointer to a drop function. The vtable will then have additional entries for each method in the trait. Since our example has been looking at the Debug trait, we have only one extra entry for the fmt method.

The compiler uses this vtable to call methods on the underlying type without knowing any additional particulars about it.

## Virtual method calls

In our examples so far we've been using the Debug trait. This is handy in some ways, since it's readily available in std, is widely implemented, and also widely used. That said, it's mostly used in the context of format strings like format!("{x:?}") which hides the actual calls to Debug methods behind macro-generated code. So instead, let's make a new trait to use as our example. We'll use a Counter trait that has two methods: one to read the counter, and one to add a value to the counter. Together these should be enough to explore a lot of what goes on with trait objects. Here's our Counter trait:

trait Counter {
fn get_value(&self) -> usize;
fn increment_by(&mut self, amount: usize);
}


We can think of a pointer to a dyn Counter object as a tuple or a struct, like this:

struct DynCounter {
this: *mut (),
vtable: *const CounterVtable,
}

struct CounterVtable {
size: usize,
alignment: usize,
drop: fn(*mut ()),
get_value: fn(*const ()) -> usize,
increment_by: fn(*mut (), usize),
}


Now let's look at some code that uses a counter:

fn use_counter(counter: &mut dyn Counter) -> usize {
counter.increment_by(42);
counter.get_value()
}


Rust doesn't know what the underlying type of counter is and therefore cannot generate a direct call to the method. Instead, it uses the vtable to find which method it should use. To see how Rust lowers use_counter to use the vtable, let's start by rewriting the calls using universal function call syntax (UFCS).

fn use_counter(counter: &mut dyn Counter) -> usize {
<dyn Counter as Counter>::increment_by(counter, 42);
<dyn Counter as Counter>::get_value(counter)
}


Making the self parameter explicit makes the transformation to use the vtable pretty much a find and replace affair. Basically, anywhere we see counter, we replace it with counter.this and anywhere we see <dyn Counter as Counter>::, we replace it with counter.vtable.

fn use_counter(counter: DynCounter) -> usize {
counter.vtable.increment_by(counter.this, 42);
counter.vtable.get_value(counter.this)
}


In the interest of clarity, I left out some of the casts and explicit dereferences that make this legal.4 Here it is with all the extra pieces in place:

fn use_counter(counter: DynCounter) -> usize {
unsafe {
((*counter.vtable).increment_by)(counter.this, 42);
((*counter.vtable).get_value)(counter.this)
}
}


Looking at this code we can guess at how many memory references are needed but the most precise way to do this is to look at the generated assembly. Fortunately, both the original version of use_counter and our "manually compiled" version that receives a DynCounter compiles to identical assembly code, which is shown below.

 1 2 3 4 5 6 7 8 9 10 11 12 13  push r14 push rbx push rax mov r14, rsi mov rbx, rdi mov esi, 42 call qword ptr [r14 + 32] mov rdi, rbx mov rax, r14 add rsp, 8 pop rbx pop r14 jmp qword ptr [rax + 24] 

It's worth talking about the generated assembly for a little. There are two instructions that for our purposes count as an indirection. These are the call instruction and line 7 and the jmp instruction on line 13, which are both indirect calls to methods through the vtable. The second call is a jmp because LLVM was able to apply a tail call optimization.

I expected to see an additional pair of loads to read the this and vtable fields of the DynCounter value (or the equivalent in the case of &mut dyn Counter), but it looks like instead Rust has chosen to pass this through a pair of registers, with rsi holding the vtable pointer and rdi holding the this pointer.

Essentially the rest of the instructions are moving data around to make sure we respect the calling conventions, have the data in the right place to pass as arguments, and so on.

When we come back to dyn* ideally we will be able to generate assembly that look essentially identical to this.

## Ownership and drop

Let's take a look at one more variant:

fn use_counter(mut counter: Box<dyn Counter>) -> usize {
counter.increment_by(42);
counter.get_value()
}


Compared to our original &mut dyn Counter variant, this version has an important difference. Box is an owned type, meaning when it falls out of scope we are responsible for destructing whatever value is stored inside the box and then freeing the box. Let's see how this looks in assembly language.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29  push r15 push r14 push rbx sub rsp, 16 mov rbx, rsi mov r14, rdi mov qword ptr [rsp], rdi mov qword ptr [rsp + 8], rsi mov esi, 42 call qword ptr [rbx + 32] mov rdi, r14 call qword ptr [rbx + 24] mov r15, rax mov rdi, r14 call qword ptr [rbx] mov rsi, qword ptr [rbx + 8] test rsi, rsi je .LBB4_5 mov rdx, qword ptr [rbx + 16] mov rdi, r14 call qword ptr [rip + __rust_dealloc@GOTPCREL] .LBB4_5: mov rax, r15 add rsp, 16 pop rbx pop r14 pop r15 ret 

There are some parts that should look familiar. Line 10 has a call to [rbx + 32], which corresponds to the call to increment_by and line 12 has a call to [rbx + 24] which corresponds to the call to get_value. But this time the call get_value is not a tail call; there's more work still to do.

The next interesting part is line 15 where we call [rbx]. This looks like a call to the value in slot 0 in the vtable, which according to Rustc's vtable layout code is a call to drop_in_place for that type. Next, in lines 16 and 17 we load the size of the type out of the vtable and if it is not zero we free the allocation. This is necessary because Box<dyn Counter> might wrap a zero-sized type (ZST), but we aren't allowed to have a zero-sized allocation. Finally, we move the return value from get_value back into the return value register (rax) from where we had previously saved it and return to the caller.

This will become relevant in dyn* too, because dyn* will also be an owned type.

### What about the stores on lines 7 and 8?

You might have noticed lines 7 and 8 seem kind of out of place. We move some registers on the stack and never touch them again. The answer to why this is here is a bit of a tangent from this post, so feel free to skip to the next section, but if you're curious you can find the answer below.

To start with, I didn't include the whole disassembly in the snippet above. Here's the complete code:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47  push r15 push r14 push rbx sub rsp, 16 mov rbx, rsi mov r14, rdi mov qword ptr [rsp], rdi mov qword ptr [rsp + 8], rsi mov esi, 42 call qword ptr [rbx + 32] mov rdi, r14 call qword ptr [rbx + 24] mov r15, rax mov rdi, r14 call qword ptr [rbx] mov rsi, qword ptr [rbx + 8] test rsi, rsi je .LBB4_5 mov rdx, qword ptr [rbx + 16] mov rdi, r14 call qword ptr [rip + __rust_dealloc@GOTPCREL] .LBB4_5: mov rax, r15 add rsp, 16 pop rbx pop r14 pop r15 ret mov r15, rax mov rsi, qword ptr [rbx + 8] mov rdx, qword ptr [rbx + 16] mov rdi, r14 call alloc::alloc::box_free jmp .LBB4_7 mov r15, rax mov rdi, rsp call core::ptr::drop_in_place> .LBB4_7: mov rdi, r15 call _Unwind_Resume@PLT ud2 call qword ptr [rip + core::panicking::panic_no_unwind@GOTPCREL] ud2 

The extra code here relates to unwinding. The actual call edges don't show up, but what happens is the Rust compiler generates some extra data that says essentially "if this call panics, return to here to run cleanup code." The way this happens is through LLVM's invoke instruction, which takes a to label which is the normal return path and an unwind label that is used when the call throws an exception (panic in Rust terminology) and we need to unwind.

It looks to me like there are two different landing pads created, one starting at line 31 and the other starting at line 38. I'm guessing these are used depending on which calls throw the exception. For example, the first one on line 31 appears to only call box_free to deallocate the box without calling drop on the value in the box. On the other hand, the landing pad on line 38 calls drop_in_place to both drop the contents of the box and deallocate the box itself. Note that these steps are the same that happen in lines 14-21, except that there LLVM has inlined the call to drop_in_place. LLVM seems to be much more conservative about inlining in landing pads, which seems reasonable given that landing pads are normally rarely executed so saving code size is more important than unwinding as fast as possible.

At any rate, it's in that second landing pad where we see the mystery stores from lines 7 and 8 get used. In line 39 we move the stack pointer into rdi, which is the register used for the first argument to the function. If we look at the signature of drop_in_place, we see that it takes a *mut to the value we are dropping. Originally the Box<dyn Counter> was passed in the register pair rdi:rsi, but registers don't have a memory address. Thus, in order to be able to drop during unwinding, we need to copy the box onto the stack so we can use that as its memory address.

# Method calls with dyn*

Because we have dyn* available in nightly Rust, let's compare its generated assembly.

Here's a variation of our example that uses dyn*.

fn use_counter(mut counter: dyn* Counter) -> usize {
counter.increment_by(42);
counter.get_value()
}


This compiles into the following assembly code:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19  push rbx sub rsp, 16 mov rax, rsi mov qword ptr [rsp], rdi mov qword ptr [rsp + 8], rsi mov rdi, rsp mov esi, 42 call qword ptr [rax + 32] mov rax, qword ptr [rsp + 8] mov rdi, rsp call qword ptr [rax + 24] mov rbx, rax mov rax, qword ptr [rsp + 8] mov rdi, rsp call qword ptr [rax] mov rax, rbx add rsp, 16 pop rbx ret 

Much of this should look familiar. We have calls to something + 32 and something + 24, which correspond to the two method calls in earlier examples. Additionally, at the end we have a call to [rax], which corresponds to the call to the destructor, like we saw with the Box example.

But here the interesting part seems to be lines 4-6. It appears that we're copying our input argument that was stored across rdi and rsi onto the stack, and then passing that pointer as the self argument when we make the call through the vtable. But why?

To make it a little more clear, let's rewrite this example in UFCS:

fn use_counter(mut counter: dyn* Counter) -> usize {
<dyn* Counter as Counter>::increment_by(&mut counter, 42);
<dyn* Counter as Counter>::get_value(&counter)
}


Notice that before we took counter as &mut dyn Counter, but in the function call we did <dyn Counter as Counter> without the &mut part. This type, counter comes in as a dyn* Counter and we use <dyn* Counter as Counter> to find the method. The two types match exactly this time, but this means the compiler has to insert an auto-borrow because dyn* Counter is not a reference type, but the methods take self by reference. This is why in the argument position we had to add &mut for increment_by and & for counter, while we did not have to add that the last time we converted to UFCS.

The compiler inserts these auto-borrows basically any time you do a method call. In many cases they end up being completely free due to things like inlining, calling conventions, data layout, etc. In this case, they are not free and we end up copying the dyn* onto the stack so we can take its address. There are several reasons, but the main thing is that the compiler has no information about the underlying type and therefore it can't inline the method call.

# Conclusion: Can we make this better?

I'm going to wrap up this post by posing the question, "can we make this better?" I'm going to punt on answering that for now and hopefully save that for a later post. In short, we don't know, but there are some ideas that might help.

One challenge is that currently dyn* supports inline storage for small values. In other words, if you have a value that implements the trait and is pointer sized, like usize, then we can just store the value of the object directly in the dyn*'s data field rather than storing a pointer to it. This means that unlike &dyn Trait and friends, we don't know that a dyn* always wraps a pointer. Sometimes it is and sometimes it isn't, but these two cases need to be handled differently. Currently we are treating everything as an inline value, so we support the value case but at the cost of unnecessary indirection when the thing in a dyn* is already a pointer.

We could get around this by requiring that dyn* always wrap an actual pointer rather than just a pointer sized value. This will probably make the implementation more complex, but it might be worth it. This makes the small value case more expensive, but given that the main use case for dyn* is to enable async functions in trait objects and that I suspect it will be relatively rare to have pointer-sized futures, this is probably the right tradeoff to make.

That said, we might be able to get the best of both worlds by adding some extra smarts in the form of conversion traits. Conversion traits really need their own blog post, but if you can't wait we do have some rough design notes from an earlier meeting on the issue.

There's still some disagreement about whether either of these ideas would even work, and if they do work, whether they are the right tradeoff to make. In future posts I hope to help answer these questions by working through more of the questions in depth.

# Acknowledgments

Thanks to Wesley Wiser for helping me make sense of some of the assembly dumps in this post,

1

A secondary motivation for dyn* is to leave dyn better than we found it. We have an opportunity to remove some of the weirdness and make trait objects work more like other types in Rust.

2

I'm legit impressed by this message. I was hoping to be able to say something like "look how inscrutable this error message is, that's because trait objects are confusing" but I should have known better.

3

Somewhat surprisingly to me, std::mem::size_of_val can read this field and tell you the size of the underlying data your &dyn Trait object is referencing even if you don't know the concrete type. See this playground example. I assume std::mem::align_of_val works similarly.

4

Rust also makes use of the Deref and DerefMut traits in calling these methods, which I've completely omitted in this post.