By: Andrew Halle Repo: byo-libffi
Part of build-your-own
How does code written in one language (say, Python) call written in a different language (say, C). It's clear that this is necessary (e.g. for perforance concerns) but, if you're anything like me, it seems like black magic that this should be possible. How can this possibly work.
A Foreign Function Interface is the means why which code written in one language calls code written in another language.
libffi is a low-level library that implements the calling convention for a particular platform. In this post I'll re-implement a subset of the functionality provided by libffi and prove that it works by replacing the real libffi and showing that the Python interpreter still works.
How does a function actually get called? On x86, we have the call instruction,
which has the following definition:
Saves procedure linking information on the stack and branches to the called procedure specified using the target operand.
We can see this in action via a simple example (Compiler Explorer is a great tool for looking at compiler output).
| C | Assembly |
|---|---|
xint add(int x, int y) { return x + y;}int main(void) { int x = add(1, 2);}
xxxxxxxxxxadd: push rbp mov rbp, rsp mov DWORD PTR [rbp-4], edi mov DWORD PTR [rbp-8], esi mov edx, DWORD PTR [rbp-4] mov eax, DWORD PTR [rbp-8] add eax, edx pop rbp retmain: push rbp mov rbp, rsp sub rsp, 16 mov esi, 2 mov edi, 1 call add mov DWORD PTR [rbp-4], eax mov eax, 0 leave retThis simple example reveals a lot about what the compiler does for us. Let's look closer at this snippet in particular.
xxxxxxxxxxmov esi, 2mov edi, 1call addWhy is the compiler putting the arguments to our add function into registers? Why these
registers, in this order? This is because of the calling convention as previously mentioned.
The calling convention is the contract between compilers that allows separate compilation to
happen. The add function can assume that its first argument will be in edi and its second
argument will be in esi (more on this later). Moreover, the calling convention allows the calling
code (in this case, the main function) allows the calling code to assume that, after a
function is called, its return value will be in eax.
When a dynamic language like Python needs to call into some already compiled C code, how does it marshall arguments into the appropriate registers? This is the job of libffi, and what we'll implement in the following sections.
In order to call native code, the Python interpreter must load in a previously compiled
shared library (a .so file on Linux). This is accomplished via the functions dlopen and
dlsym. dlopen loads a shared object and returns a void* handle to that object. The
dlsym function takes a handle to a shared object and a string symbol to search for in
that shared object. As an example, to call an add function in a shared library libadd.so
you can write the following C code:
xxxxxxxxxxtypedef int (*add_func)(int, int);int main(void) { void *handle = dlopen("./libadd.so", RTLD_NOW); add_func add = (add_func) dlsym(handle, "add"); printf("1 + 2 = %d\n", add(1, 2));}This code opens the shared object (loading it immediately because of the flag RTLD_NOW),
grabs a pointer to the add function, casts that void pointer to a pointer to a function
taking two ints and returning an int (notice the typedef at the top of the file).
The typedef at the top of the file gives the compiler the information to generate code
to call this function. Our first goal will be to call this function without the typedef
(so we have to write the code to call this function, the compiler can't help us).
Let's start by defining a struct to hold some information for us, namely the address of the function to call, and the arguments we're going to pass into that function.
xxxxxxxxxxtypedef struct { int x; int y; void *func;} callable;If this were a real library, storing the arguments inside the callable would be a bad
idea, because we might want to call this function with different arguments. However, it
will suffice for now.
We'd like to be able to write code like this:
xxxxxxxxxxint main(void) { void *handle = dlopen("./libadd.so", RTLD_NOW); void *add = dlsym(handle, "add"); callable c = { 1, 2, add }; int retval; runtime_call(&c, &x); printf("1 + 2 = %d\n", retval);}What about this mysterious function runtime_call? runtime_call needs to do a few things:
rdirsicall the function pointed to by funceax) into &retvalSince we need direct register access (and the ability to issue a raw call) we'll need to
write this function in assembly (I use the NASM assembler).
Compiler Explorer can do most of the work for us here. This example:
| C | Assembly |
|---|---|
xxxxxxxxxxint add(int x, int y) { return x + y;}typedef int (*add_func)(int, int);typedef struct { int x; int y; void *func;} callable;void runtime_call(void* c, void* ret) { callable *c1 = (callable*) c; int *r1 = (int*) ret; add_func a = (add_func) c1->func; *r1 = a(c1->x, c1->y);}int main() { callable c = { 1, 2, add }; int x; runtime_call(&c, &x); printf("1 + 2 = %d\n", x);}
xxxxxxxxxxadd: push rbp mov rbp, rsp mov DWORD PTR [rbp-4], edi mov DWORD PTR [rbp-8], esi mov edx, DWORD PTR [rbp-4] mov eax, DWORD PTR [rbp-8] add eax, edx pop rbp retruntime_call: push rbp mov rbp, rsp sub rsp, 48 mov QWORD PTR [rbp-40], rdi mov QWORD PTR [rbp-48], rsi mov rax, QWORD PTR [rbp-40] mov QWORD PTR [rbp-8], rax mov rax, QWORD PTR [rbp-48] mov QWORD PTR [rbp-16], rax mov rax, QWORD PTR [rbp-8] mov rax, QWORD PTR [rax+8] mov QWORD PTR [rbp-24], rax mov rax, QWORD PTR [rbp-8] mov edx, DWORD PTR [rax+4] mov rax, QWORD PTR [rbp-8] mov eax, DWORD PTR [rax] mov rcx, QWORD PTR [rbp-24] mov esi, edx mov edi, eax call rcx mov rdx, QWORD PTR [rbp-16] mov DWORD PTR [rdx], eax nop leave ret.LC0: .string "1 + 2 = %d\n"main: push rbp mov rbp, rsp sub rsp, 32 mov DWORD PTR [rbp-16], 1 mov DWORD PTR [rbp-12], 2 mov QWORD PTR [rbp-8], OFFSET FLAT:add lea rdx, [rbp-20] lea rax, [rbp-16] mov rsi, rdx mov rdi, rax call runtime_call mov eax, DWORD PTR [rbp-20] mov esi, eax mov edi, OFFSET FLAT:.LC0 mov eax, 0 call printf mov eax, 0 leave retclearly parallels what we're trying to do. And indeed, if we copy the assembly output
into a file runtime-call.s, build it with nasm -f elf64 runtime-call.s and link the
resulting object file with our main.c we'll successfully call this function! (this state
of the code is given by commit ca946232a8545bdb7389be7159abf504d3f5a168
of the repo for this post).
Okay, the last section was kind of cheating. We only allowed for one possible signature of
the function we might call, so we could copy the compiler output to put the arguments in
the right place. We haven't done anything yet! In this section, we'll make our runtime_call
function generic enough to handle functions that take any number of ints as arguments, and
return a single int.
(the leap from here to supporting functions that take and return variables of different type is not trivial, but I think also not very instructive. In order to actually finish this post, I decided to stop here and restrict my libffi to only work with functions of this type. For more information on supporting functions that take arbitrary types of arguments, see one of the links in the Resources)
For this section, we'll actually need to figure out what the calling convention on our platform
is (I'm using Linux, and include a Vagrantfile for a Linux VM in the repo). We'll be
implementing the calling convention for 64bit linux, which is known officially as
"System V AMD64 ABI" (see this link for
more information).
We're interested in the following aspects of this ABI (the following points are quotes from the above link)
On x86-64 Linux, the first six function arguments are passed in registers %rdi, %rsi, %rdx, %rcx, %r8, and %r9, respectively. The seventh and subsequent arguments are passed on the stack, about which more below. The return value is passed in register %rax.
That link uses AT&T syntax for registers, while Compiler Explorer and NASM default to using Intel syntax.
If we imagine we have the following high-level functions (Python syntax for clarity):
push_stack(value): pushes value onto the hardware stack, equivalent to the push
instructionoverwrite_register(register, value): puts value into register (where register
is the string name of the register), equivalent to the instruction mov reg, valuecall(func): jump to the function at address func, equivalent to the instruction
callregister_value(register): returns the value of register (where register is the
string name of the register)then we could implement this ABI in the following way
xxxxxxxxxxdef runtime_call(func, args): argument_registers = ['rdi', 'rsi', 'rdx', 'rcx', 'r8', 'r9'] # push additional arguments onto the stack in reverse order while len(args) > 6: push_stack(args.pop()) # put first six arguments in the correct registers for i in range(len(args)): overwrite_register(argument_registers[i], args[i]) call(func) return register_value('rax')There's one complication to writing this function in assembly. The overwrite_register
function takes a string description of the register and loads a value into it. We don't
want to actually resort to string comparisons to determine the register we're loading into.
We'll instead include the instructions for loading all 6 arguments into registers in the
runtime_call function, and then jump over the ones we don't want to execute. Since assembly
executes instructions sequentially, this means we have to list out these instructions in reverse
order (since if we have less than 6 arguments, we'll have to jump over the instructions which
load the later arguments that we don't actually have).
The interesting part of the assembly is listed below (look in the repo for the full function):
xxxxxxxxxx.loop_start: cmp eax, 6 ; we want to do this for as long as the number of arguments jle .loop_done ; (currently in eax) is greater than 6 mov r10d, DWORD [rbx + 4 * (rax - 1)] ; rbx stores the pointer to the arguments push r10 dec eax jmp .loop_start.loop_done: cmp eax, 6 ; since 64bit NASM doesn't allow relative jumps (as je .six_args ; far as I could tell) we need to explicitly label cmp eax, 5 ; the lines to jump to, and figure out which one je .five_args ; we're jumping to with comparisons on n_args cmp eax, 4 je .four_args cmp eax, 3 je .three_args cmp eax, 2 je .two_args cmp eax, 1 je .one_arg cmp eax, 0 je .zero_args.six_args: mov r9d, DWORD [rbx + 4 * 5] ; order of registers is the calling convention.five_args: mov r8d, DWORD [rbx + 4 * 4].four_args: mov ecx, DWORD [rbx + 4 * 3].three_args: mov edx, DWORD [rbx + 4 * 2].two_args: mov esi, DWORD [rbx + 4 * 1].one_arg: mov edi, DWORD [rbx + 4 * 0].zero_args: mov rbx, QWORD [rbp-40] ; move addr of function into rbx mov rbx, QWORD [rbx] call rbx mov rbx, QWORD [rbp-48] ; move addr of retval into rbx mov DWORD [rbx], eax ; move retval (currently in eax) into retvalSome notes about this assembly:
eax refers to the 32bit version
of rax (the upper 32 bits will be zeroed) and r8d refers to the 32bit version of r8
(the d for "double word" indicates this is 4 bytes, as the x86 word size is 16 bits)DWORD because we've previously stated we only want to work with ints
which on my platform (not all platforms!) is 4 bytes.Our C code will also have to change to support an arbitrary number of arguments. We'll implement
this by changing the structure to include a pointer to the args, and a couple of functions to
initialize this structure with a malloc'd array and add args to it.
xxxxxxxxxxtypedef struct { void *func; int *args; int n_args;} callable;void init_callable(callable *c) { c->args = (int*) malloc(16 * sizeof(int)); c->n_args = 0;}void add_arg_callable(callable *c, int arg) { c->args[c->n_args] = arg; c->n_args++;}We can now write C code like the following:
xxxxxxxxxxint main(void) { void *handle = dlopen("./libadd.so", RTLD_NOW); void *add = dlsym(handle, "add"); callable c = { add }; init_callable(&c); add_arg_callable(&c, 1); add_arg_callable(&c, 2); add_arg_callable(&c, 3); add_arg_callable(&c, 4); add_arg_callable(&c, 5); add_arg_callable(&c, 6); add_arg_callable(&c, 7); add_arg_callable(&c, 8); add_arg_callable(&c, 9); add_arg_callable(&c, 10); int x; runtime_call(&c, &x); printf("Result: %d\n", x);}where the add function has been separately compiled to have the appropriate number of
arguments.
And that's it! We can now call dynamically loaded functions at runtime, without specifying their signatures at compile time. All that's left to do is make this compatible with the real libffi, and we should be able to use our library with Python.
First, let me prove that the Python interpreter does in fact use libffi to accomplish this task. Let us consider the test program
xxxxxxxxxxfrom ctypes import cdlllibadd = cdll.LoadLibrary('./libadd.so')result = libadd.add(1,2,3,4,5,6,7,8,9,10)print(result)where libadd.so is our previously compiled add function, in this case, with 10 arguments.
Running this in our VM (vagrant up; vagrant ssh; python test.py) produces the correct answer
of 55.
Now, let's look for libffi.so (we know from the libffi docs that we link with libffi using
-lffi so the file is probably called libffi.so per convention). If we run
xxxxxxxxxxldconfig -p | grep ffiwe'll see the output
xxxxxxxxxxlibffi.so.7 (libc6,x86-64) => /usr/lib/libffi.so.7libffi.so (libc6,x86-64) => /usr/lib/libffi.so
So if we compile our libffi.so and replace these files, we can inject our version of
libffi into the system.
NOTE: don't actually do this on your system. libffi is incredibily important for a lot of things (even LLVM links with libffi) so if you remove it on your actual computer, you won't be able to do a lot of important things. I've included a Vagrantfile for testing, you can do this on any virtual machine
I've restructured our implementation of libffi into ffi.h and ffi.c, which we can now
compile into libffi.so with
xxxxxxxxxxnasm -f elf64 runtime-call.sclang -shared -fPIC -o libffi.so ffi.c runtime-call.oIf we replace the system libffi (again, not on your actual system) with our compiled artifact, and try to run the test python program, we'll get the following error
xxxxxxxxxx$ python test.pyTraceback (most recent call last):File "test.py", line 1, in <module>from ctypes import cdllFile "/usr/lib/python3.8/ctypes/__init__.py", line 7, in <module>from _ctypes import Union, Structure, ArrayImportError: /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so: undefined symbol: ffi_closure_alloc, version LIBFFI_CLOSURE_7.0
Okay, now we're getting somewhere. This error is happening because Python is loading in our
libffi.so and looking for symbols that don't currently exist, but should exist according
to the API offered by libffi. Let's look at the man pages for libffi
xxxxxxxxxx$ man ffiFFI(3) BSD Library Functions Manual FFI(3)NAMEFFI — Foreign Function InterfaceLIBRARYlibffi, -lffiSYNOPSIS#include <ffi.h>ffi_statusffi_prep_cif(ffi_cif *cif, ffi_abi abi, unsigned int nargs, ffi_type *rtype, ffi_type **atypes);voidffi_prep_cif_var(ffi_cif *cif, ffi_abi abi, unsigned int nfixedargs, unsigned int ntotalargs,ffi_type *rtype, ffi_type **atypes);voidffi_call(ffi_cif *cif, void (*fn)(void), void *rvalue, void **avalue);DESCRIPTIONThe foreign function interface provides a mechanism by which a function can generate a call to an‐other function at runtime without requiring knowledge of the called function's interface at com‐pile time.SEE ALSOffi_prep_cif(3), ffi_prep_cif_var(3), ffi_call(3)February 15, 2008
Clearly we'll need a struct ffi_cif (which is the equivalent of our callable) and functions
ffi_prep_cif (the equivalent of our init function) and ffi_call (the equivalent of our
runtime_call). Note, I didn't learn all this just from the output above, I had to read the other
man pages for libffi (listed in the SEE ALSO).
One interesting thing to note, I don't see ffi_closure_alloc anywhere in this API. As far as I
can tell, these are optional symbols included by libffi for offering closure type functionality.
I think I got incredibly lucky in that they don't actually have to do anything for the simple case
we've already implemented, they just have to exist. So, we can define empty functions with these
names (going back and forth with python to see what it complains about missing) to move forward.
The resulting code is:
xxxxxxxxxxvoid ffi_closure_alloc(void) {}void ffi_closure_free(void) {}void ffi_prep_cif(void) {}void ffi_call(void) {}void ffi_prep_closure_loc(void) {}When we now run our test script, we get
xxxxxxxxxx$ python test.py$ python test.pyTraceback (most recent call last):File "test.py", line 4, in <module>result = libadd.add(1,2,3)RuntimeError: ffi_prep_cif failed
Okay, at this point, we have to actually look at the header file ffi.h provided by libffi
to figure out what value we're supposed to return from ffi_prep_cif to indicate success,
and what ffi_prep_cif and ffi_call are supposed to do.
ffi_prep_cif is supposed to set up the ffi_cif object with the types the function is
supposed to take. Since we've restricted ourselves to functions that only take integers and
return an integer, we're only interested in n_args.
ffi_call passes a pointer to an array of void* which are the arguments we're passing to
the function, and a void* indicating where to put the reuslt. At this point, you may notice
our callable struct is not exactly mappable to ffi_cif because the callable struct stores
its arguments, whereas ffi_cif does not. We can get around this in a hacky way (if we were
writing an actual library, I would actually fix this). We'll do the following: when ffi_prep_cif
is called, store the number of arguments we're expecting. When ffi_call is called, grab n_args
from the callable, then re-init it, and add the arguments using our previously defined function.
Then, call runtime_call to actually invoke the function.
The final code is below:
xxxxxxxxxx// ffi.htypedef enum { FFI_OK = 0,} ffi_status;// only one supportedtypedef enum { SYSVAMD64 = 0,} ffi_abi;typedef struct { void *func; int *args; int n_args;} callable;typedef callable ffi_cif;void init_callable(callable *c);void add_arg_callable(callable *c, int arg);void runtime_call(void *c, void *ret);// ffi.cvoid init_callable(callable *c) { c->args = (int*) malloc(16 * sizeof(int)); c->n_args = 0;}void add_arg_callable(callable *c, int arg) { c->args[c->n_args] = arg; c->n_args++;}void ffi_closure_alloc(void) {}void ffi_closure_free(void) {}ffi_status ffi_prep_cif( ffi_cif *cif, ffi_abi abi, unsigned int nargs, void *ignored1, void *ignored2) { cif->n_args = nargs; return FFI_OK;}void ffi_call(ffi_cif *cif, void *fn, void *retval, void **args) { unsigned int n_args = cif->n_args; cif->func = fn; init_callable(cif); // actually add the arguments to the callable for (int i = 0; i < n_args; i++) { add_arg_callable(cif, *((int*) args[i])); } runtime_call((void*) cif, retval);}void ffi_prep_closure_loc(void) {}}When we compile this, move it to the path normally occupied by libffi, and run our test script, it works! We've successfully written an implementation of libffi that's API-compatibile with the real libffi and supports a subset of it's functionality.
At this point, we're done. We've accomplished our goal of calling functions at runtime without specifying their type signatures at compile-time, we've made our clone API-compatible with the real libffi (through a pattern I quite like, we defined our own API that makes sense to us, then written a compatibility layer for backward compatibility), and we've (if you're at all like me) become extremely grateful that high quality implementations of this functionality already exist. Every new language that wants to offer a foreign function interface doesn't have to write their own implementation of this, and libffi is written with more platforms in mind than just x86, so this will work across many platforms.
I made use of the following links/resources while writing this post: