Friday Q&A 2012-11-09: dyld: Dynamic Linking On OS X by [Gwynne Raskind](http://blog.darkrainfall.org/)  

In the course of a recent job interview, I had an opportunity to study some of the internals of `dyld`, the OS X dynamic linker. I found this particular corner of the system interesting, and I see a lot of people having trouble with linking issues, so I decided to do an article about the basics of dynamic linking. Some of the deeper logic is new to me, so sorry in advance for any inaccuracies.

* * *

****WARNING****  

_Because the precise details of how `dyld` works are quite complicated and change frequently, and because I don't yet know all of those details myself, most of my examination of it in this article is simplified, and in some places purely conceptual. If you're curious about the particulars, I strongly recommend `dyld`'s source code, which is publicly available at [http://opensource.apple.com](http://opensource.apple.com)._

**Static linking**  

So, let's start by talking about static linking, generally referred to simply as 'linking'. This is the step that typically happens after compiling, where the machine language the compiler churned out from your source code, the object files, are 'linked' together into a single binary file.

Why does static linking matter to dynamic linking? Because the static linker, `ld` (and `ld64`) is responsible for transforming symbol references in your source code into indirect symbol lookups for `dyld` to use later. Here's a very simple example:

    // This is the actual full declaration of main() on OS X. The "apple"
    //  parameter is the path to the executable, i.e. _NSGetProgname().
    int main(int argc, char **argv, char **envp, char **apple)
    {
        puts("Hello, world!\n");
        return 0;
    }


The (optimized) assembly for this, as generated by `clang -S test.c -o test.s -Os` and stripped of a bit of debug info, is:

            .section        __TEXT,__text,regular,pure_instructions
            .globl  _main
    _main:                                  ## @main
            pushq   %rbp
            movq    %rsp, %rbp
            leaq    L_str(%rip), %rdi
            callq   _puts
            xorl    %eax, %eax
            popq    %rbp
            ret
            .section        __TEXT,__cstring,cstring_literals
    L_str:                                  ## @str
            .asciz  "Hello, world!"


Seems straightforward enough. Let's compile it into an object file and dump the fully compiled version (`clang -c test.c -o test.o -Os`, `otool -tv test.o`):

    _main:
    0000000000000000        pushq   %rbp
    0000000000000001        movq    %rsp,%rbp
    0000000000000004        leaq    0x00000000(%rip),%rdi
    000000000000000b        callq   0x00000010
    0000000000000010        xorl    %eax,%eax
    0000000000000012        popq    %rbp
    0000000000000013        ret

Whoops, our symbol names are gone! The compiler has replaced them with sets of zero bytes. For the `leaq` instruction, the result is a load from the current value of `rip`. The `callq` instruction is a "signed offset" jump, which means that the offset of 0 calls the very next instruction in the code (address `0x10` in this case). Never fear, the compiler has generated relocation entries which tell the linker where to update all these zeroes (`otool -r test.o`):

    Relocation information (__TEXT,__text) 2 entries
    address  pcrel length extern type    scattered symbolnum/value
    0000000c 1     2      1      2       0         4
    00000007 1     2      1      1       0         0


The first entry says, "At offset `0xc` in the `__TEXT,__text` section, there is an unscattered, external, PC-relative `X86_64_RELOC_BRANCH` reference of length 'long word' to the symbol at index 4 in the symbol table." A peek at the symbol table (`nm -ap`) gives us:

    0000000000000014 s L_str
    0000000000000048 s EH_frame0
    0000000000000000 T _main
    0000000000000060 S _main.eh
                     U _puts

The symbol at index 4 (the fifth entry) is `_puts`. Similarly, the symbol at index 0 is `L_str`, which will be relocated at offset `0x7` of the object file (three bytes into the `leaq` instruction). Finally, let's look at the result of linking this object into an executable (`clang test.c -o test -Os`, `otool -tv test`):

    _main:
    0000000100000f36        pushq   %rbp
    0000000100000f37        movq    %rsp,%rbp
    0000000100000f3a        leaq    0x00000029(%rip),%rdi
    0000000100000f41        callq   0x100000f4a
    0000000100000f46        xorl    %eax,%eax
    0000000100000f48        popq    %rbp
    0000000100000f49        ret

`ld` has:

1.  Located the `__TEXT` segment at the standard executable load address for `x86_64`, `0x0000000100000000`, and the `__TEXT,__text` section at `0xf36` after that. The first `0xf35` (actually, `0xa0f`, since the larger offset doesn't account for the file's Mach-O header) bytes of `__TEXT` are zeroed out. This aligns the `__TEXT` segment flush up against the `__DATA` segment. I don't know exactly why this is done, though I assume it has something to do with cache efficiency.
2.  Replaced `0` with the actual offset from the `leaq` instruction to the `L_str` symbol, which in this case is `0x29`. The resulting address is `0x100000f61`, which a peek at the load commands (`otool -l test`) tells us is the exact beginning of the `__TEXT,__cstring` section.
3.  Replaced `0` with the address of the symbol _stub_ for `puts()`, which comes immediately after `main`. Another peek at the load commands puts this in the `__TEXT,__stubs` section, which we'll look at in detail later.

Static linking, then, combines object files, resolves symbol references to external libraries, applies the relocations for those symbols, and builds a complete executable. Obviously, this is a huge simplification and applies only to executables. The process of linking dynamic libraries is similar, but not identical, and for brevity's sake I won't go into it here.

**What does `dyld` do, anyway?**  

`dyld` is actually responsible for quite a bit of work, all told. It (in roughly this order):

1.  Bootstraps itself based on the very simple raw stack set up for the process by the kernel.
2.  Recursively and cachingly loads all dependent dynamic libraries the executable links to into the process' memory space, including any necessary perusal of search paths from both the environment and the executable's "runpaths".
3.  Links those libraries into the executable by immediately binding non-lazy symbols and setting up the necessary tables for lazy binding.
4.  Runs static initializers for the executable.
5.  Sets up the parameters to the executable's `main` function and calls it.
6.  During the process' execution, handles calls to lazily-bound symbol stubs by binding the symbols, provides runtime dynamic loading services (via the `dl*()` API), and provides hooks for `gdb` and other debuggers to get critical information.
7.  Runs static terminator routines after `main` returns.
8.  In some scenarios, makes the required call to `libSystem`'s `_exit` routine once `main` returns.

I'll examine each step roughly in order.

**Bootstrap**  

`dyld` is the very first code run in a new process. In particular, a symbol by the very descriptive name of `__dyld_start` is called. This happens due to a bit of magic in the kernel which notices the `LC_LOAD_DYLINKER` load command in the main executable and uses the given dynamic linker's entry symbol as the process' initial instruction pointer. `__dyld_start` performs the following pseudocode (the actual implementation is a compact bit of assembly code):

    noreturn __dyld_start(stack mach_header *exec_mh, stack int argc, stack char **argv, stack char **envp, stack char **apple, stack char **STRINGS)
    {
        stack push 0 // debugger end of frames marker
        stack align 16 // SSE align stack
        uint64_t slide = __dyld_start - __dyld_start_static;
        void *glue = NULL;
        void *entry = dyldbootstrap::start(exec_mh, argc, argv, slide, ___dso_handle, &glue);
        if (glue)
            push glue // pretend the return address is a glue routine in dyld
        else
            stack restore // undo stack stuff we did before
        goto *entry(argc, argv, envp, apple); // never returns
    }

In retrospect, I'm not sure that pseudocode is any more sensible than the assembly would have been, but let's walk through it quickly:

1.  Push a 0 onto the stack, and align the stack to SSE requirements.
2.  Calculate the slide of dyld itself by subtracting the address of a symbol whose address is always the same from the current address of `__dyld_start`.
3.  Run `dyld`'s actual bootstrap routine, which sets up some minimal state for `dyld` itself (such as pulling in certain functions from `libSystem` without actually linking to it and setting up Mach messaging) and then runs `dyld`'s real `main` routine, which does loading, linking, and initializers.
4.  If `dyld` detected that the main executable uses the `LC_MAIN` load command to set up its entry point, it returns the address of a glue routine which is responsible for calling `_exit` when the process is done. That address is pushed onto the stack, fooling the entry point into thinking it's the routine's return address; the `ret` instruction at the end of that function will jump to that glue code.
5.  If, on the other hand, `dyld` detected the executable using the older `LC_UNIXTHREAD` load command, it simply restores the stack to its original state and jumps to that entry point, which will be the `start` routine from crt1.o, the C runtime. The C runtime basically redoes all the work that `__dyld_start` just did, minus the actual `dyld` startup, which is one of the reasons it was replaced with the `LC_MAIN` command.
6.  Jump to the entry point.

**Loading**  

Each time `dyld` has to load a dynamic library, whether at application startup or due to a request at runtime, it must locate the correct binary on disk, map the file into memory, parse the Mach-O headers, and record all the data it just generated for use in linking (which in this context means symbol binding). (Boy, "linking" sure has a lot of different uses, doesn't it?)

Locating the correct binary on disk is _usually_ fairly simple. The `LC_LOAD_DYLIB` command will give an absolute path, and the binary is loaded from that path. Of course, sometimes that path contains a special marker that tells `dyld` to look somewhere else:

*   `@executable_path` - Up to OS X 10.3, this was the only marker `dyld` supported, and it had rather limited utility. `dyld` will replace this marker with the full path to the main executable.
*   `@loader_path` - Added in 10.4, this marker is replaced with the full path to the binary which loaded the binary that is currently being loaded. This is not always the main executable, and primarily enabled frameworks to themselves embed frameworks without resorting to the "umbrella framework" mechanism, which Apple never made entirely public and actively discouraged the use of.
*   `@rpath` - When this marker was added in 10.5, there was much rejoicing. This marker is replaced in sequence with each "run path" embedded in the binary's loading binaries (recursively), enabling frameworks and dynamic libraries to finally be built only once and be used for both system-wide installation and embedding without changes to their install names, and allowing applications to provide alternate locations for a given library, or even override the location specified for a deeply embedded library.

There are also default search paths, and in some circumstances, further paths can be specified in the environment and load commands.

**Linking**  

Once a dynamic library is loaded into a process (ignoring for now some manipulations related to address space randomization, and also setting aside code signing issues), its non-lazy symbols must be bound.

At this point, I should take a moment out to explain the different between lazy and non-lazy symbols. It's not complicated; a lazy symbol's binding is deferred until the symbol is called the first time by the executable, while a non-lazy symbol is bound immediately when its containing library is loaded. The actual binding process is identical; the only difference is in how that process is triggered.

Conceptually, binding a symbol is simple. In practice, it's rather interesting:

1.  Look up, in the binding information of the `__LINKEDIT` segment of the executable, the address of the symbol stub for the symbol. Taking our example from above, the stub for `_puts` was at `0xf4a` (plus some, I'm shortening for simplicity's sake!). If we were to disassemble the machine code at that address, we would get:

        Contents of (__TEXT,__stubs) section
        0000000100000f4a        jmp     *0x000000c0(%rip)
        Contents of (__TEXT,__stub_helper) section
        0000000100000f50        leaq    0x000000b1(%rip),%r11
        0000000100000f57        pushq   %r11
        0000000100000f59        jmp     *0x000000a1(%rip)
        0000000100000f5f        nop
        0000000100000f60        pushq   $0x00000000
        0000000100000f65        jmp     0x100000f50

    Wow, a nice simple jump instruction! Unfortunately, it's not _quite_ as simple as replacing the target of the jump with the address of the symbol, since the jump can only be a signed 32-bit offset and the symbol could (and should!) be anywhere in the 64-bit address space. So, the next step is...

2.  Look up, also in the binding information, the address of the symbol pointer for `puts` in the `__DATA,__nl_symbol_ptr` section. If this is a lazy symbol, look it up in the `__DATA,__la_symbol_ptr` section instead. In our example executable, these sections look simply like this (using a hybrid of `otool`'s output):

        Contents of (__DATA,__nl_symbol_ptr) section
        0000000100001000        dq      0x0000000000000000
        0000000100001008        dq      0x0000000000000000
        Contents of (__DATA,__la_symbol_ptr) section
        0000000100001010        dq      0x0000000100000f60

    In short, the non-lazy symbol pointers are just zero bytes, and the lazy symbol pointer points right back to the stub helper section!

3.  Update the address of the symbol pointer in the appropriate `__DATA` section to the real address of the symbol in the loaded library. You're done!

So what, you may be asking, are all this crazy indirection and all these extra sections all about?

Well, for non-lazy symbols, the indirection is necessary for two reasons. First, you can't put writable data in the `__TEXT` section, which is executable code. This means you can't update the jump instruction directly at runtime, even if you had a jump instruction that took an absolute 64-bit address. Secondly, you can't put executable code in the `__DATA` section, which is writable data! So you can't just put a 64-bit jump instruction there either. As a result, the jump instruction is encoded to take an extra level of indirection, as with dereferencing a pointer in C.

All this is true of lazily-bound symbols as well, but with a few caveats. `dyld` does _not_ immediately bind such a symbol, but just leaves it be. The address saved in the lazy symbol pointer by the static linker isn't a simple 0, but rather points to the "stub helper". The stub helper is a bit of code embedded in the `__TEXT,__stub_helper` section (really? who'd've guessed?) which pushes the offset into the lazy symbol pointer table to update onto the stack and jumps to the (not lazily bound!) symbol for `dyld`'s internal symbol binder. It doesn't show up in this very simple example, but the stub helper grows by two instructions for each lazy symbol so that the correct offset is passed to `dyld`. When the lazy binding is finished, the symbol pointer is updated as usual, and the stub helper is never called again for that symbol.

**Static initializers, static terminators, and runtime services**  

Most of the interesting stuff has already happened at this point. `dyld` will run any static initializers in the executable (most often constructors for global C++ objects and `+load` methods for Objective-C classes, though there are also `__attribute__((constructor))` functions for plain C). A list of initializers is stored in a separate `__DATA,__mod_init_func` section in the binary, and is simply a set of addresses into the `__TEXT,__text` section which `dyld` calls in order of appearance. Initializer functions are passed the same arguments as `main`.

When the process exits, `dyld` will also run static terminators, which mostly means static destructors for C++ objects and `__attribute__((destructor))` functions. These are handled just like static initializers, except that they're stored in `__DATA,__mod_term_func` and take no parameters. Static terminators run in the same context as an `atexit()` function.

Finally, `dyld` provides runtime services to binaries it has loaded. The `dl*()` APIs are the preferred interface to `dyld`'s services (and as of 10.5, the only sanctioned interface; the old functions have been deprecated):

*   `dlopen` - Performs the load stage of loading a dynamic library, can optionally partially or completely perform the bind stage.
*   `dlsym` - Look up a symbol in a dynamic library (or the entire process). At its simplest, this is no more than a "name to address" lookup.
*   `dladdr` - The inverse of `dlsym`, transforming an address into a set of symbol information.
*   `dlclose` - Unloads a dynamic library from the process, if no other handles to it are in use. Unloading invalidates all the symbols provided by the dynamic library and can be something of a touchy operation, particularly in an Objective-C environment.

**What's missing**  

While I've gone over quite a bit, I've also left out a _lot_ of information in this article:

*   Two-level namespaces, which prevent trivial symbol collisions in dynamic libraries
*   The `dyld` shared cache, which maintains a systemwide map of already-loaded dynamic libraries for fast binding
*   Rebasing
*   Code signing
*   Dynamic library linking
*   `dyld`'s expansive set of environment variables
*   "Restricted" binaries (particularly `setuid` binaries)
*   Most of the kernel's interaction with `dyld`
*   Compression and encryption in Mach-O binaries
*   How `dyld` itself is built
*   Symbol interposing
*   `dyld`'s operation on i386 and ARM, which is conceptually the same, but both architectures differ significantly in the details
*   Details of the Mach-O binary format
*   How "fat" binaries are handled

I've left these out for two reasons: One, I was a bit behind when writing this article and just didn't have time to put it all in, and two, there really isn't space in one article for all that. However, all of these concepts are at least somewhat documented by Apple, and both the kernel and `dyld` are open-source. Here are what I hope are some useful links (warning, some of these are pretty outdated, as Apple doesn't seem too interested in updating the documentation):

[Apple's Mach-O documentation](https://developer.apple.com/library/mac/#documentation/developertools/conceptual/MachOTopics/0-Introduction/introduction.html)  
 [Apple's Mach-O reference](https://developer.apple.com/library/mac/#documentation/developertools/conceptual/MachORuntime/Reference/reference.html)  
 [The Mach-O "loader" header, a very good reference (also look at other files in the `mach-o/` directory)](file:///usr/include/mach-o/loader.h) [Apple's dyld Reference](https://developer.apple.com/library/mac/#documentation/developertools/Reference/MachOReference/Reference/reference.html)  
 [The dlopen(3) manpage](http://developer.apple.com/library/Mac/#documentation/Darwin/Reference/ManPages/man3/dlopen.3.html)  
 [dyld's Release Notes](http://developer.apple.com/library/mac/#releasenotes/DeveloperTools/RN-dyld/_index.html)  
 [dyld's source code as of 10.8.2](http://opensource.apple.com/source/dyld/dyld-210.2.3/)  
 [Kernel source code as of 10.8.2 (look at `bsd/kern/kern_exec.c` and `bsd/kern/mach_loader.c` in particular)](http://opensource.apple.com/source/xnu/xnu-2050.18.24/)  

**Conclusion**  

`dyld` is one of the most essential parts of OS X; without it, nothing but the kernel would run. With that responsibility inevitably comes significant complexity, and `dyld` has it aplenty. Some of that complexity comes from the massive backwards-compatibility requirements of `dyld`, and some simply from the sheer scope of the tasks it must handle. Most developers will have no need to understand linking in such detail, but maybe the next time you get a strange error message in Xcode from the linker, you'll have a better idea of where to look for the problem. Then again, maybe not; `ld` can be pretty obstructive.

That's all I have for you this week. Come back next week for a special treat from Mike; his next article is particularly awesome!

Did you enjoy this article? I'm selling a whole book full of them. It's available for iBooks and Kindle, plus a direct download in PDF and ePub format. It's also available in paper for the old-fashioned. [Click here for more information](/book.html).

david m.