5. linux/arch/i386/boot/compressed/head.S

We are in bvmlinux now! With the help of misc.c:decompress_kernel(), we are going to decompress piggy.o to get the resident kernel image linux/vmlinux.

This file is of pure 32-bit startup code. Unlike previous two files, it has no ".code16" statement in the source file. Refer to Using as: Writing 16-bit Code for details.

5.1. Decompress Kernel

The segment base addresses in segment descriptors (which correspond to segment selector __KERNEL_CS and __KERNEL_DS) are equal to 0; therefore, the logical address offset (in segment:offset format) will be equal to its linear address if either of these segment selectors is used. For zImage, CS:EIP is at logical address 10:1000 (linear address 0x1000) now; for bzImage, 10:100000 (linear address 0x100000).

As paging is not enabled, linear address is identical to physical address. Check IA-32 Manual (Vol.1. Ch.3.3. Memory Organization, and Vol.3. Ch.3. Protected-Mode Memory Management) and Linux Device Drivers: Memory Management in Linux for address issue.

It comes from setup.S that BX=0 and ESI=INITSEG<<4.

.text
///////////////////////////////////////////////////////////////////////////////
startup_32()
{
        cld;
        cli;
        DS = ES = FS = GS = __KERNEL_DS;
        SS:ESP = *stack_start;  // end of user_stack[], defined in misc.c
        // all segment registers are reloaded after protected mode is enabled

        // check that A20 really IS enabled
        EAX = 0;
        do {
1:              DS:[0] = ++EAX;
        } while (DS:[0x100000]==EAX);

        EFLAGS = 0;
        clear BSS;                              // from _edata to _end

        struct moveparams mp;                   // subl $16,%esp
        if (!decompress_kernel(&mp, ESI)) {     // return value in AX
                restore ESI from stack;
                EBX = 0;
                goto __KERNEL_CS:100000;
                // see linux/arch/i386/kernel/head.S:startup_32
        }

        /*
         * We come here, if we were loaded high.
         * We need to move the move-in-place routine down to 0x1000
         * and then start it with the buffer addresses in registers,
         * which we got from the stack.
         */
3:      move move_rountine_start..move_routine_end to 0x1000;
        // move_routine_start & move_routine_end are defined below

        // prepare move_routine_start() parameters
        EBX = real mode pointer;        // ESI value passed from setup.S
        ESI = mp.low_buffer_start;
        ECX = mp.lcount;
        EDX = mp.high_buffer_star;
        EAX = mp.hcount;
        EDI = 0x100000;
        cli;                    // make sure we don't get interrupted
        goto __KERNEL_CS:1000;  // move_routine_start();
}

/* Routine (template) for moving the decompressed kernel in place,
 * if we were high loaded. This _must_ PIC-code ! */
///////////////////////////////////////////////////////////////////////////////
move_routine_start()
{
        move mp.low_buffer_start to 0x100000, mp.lcount bytes,
          in two steps: (lcount >> 2) words + (lcount & 3) bytes;
        move/append mp.high_buffer_start, ((mp.hcount + 3) >> 2) words
        // 1 word == 4 bytes, as I mean 32-bit code/data.

        ESI = EBX;              // real mode pointer, as that from setup.S
        EBX = 0;
        goto __KERNEL_CS:100000;
        // see linux/arch/i386/kernel/head.S:startup_32()
move_routine_end:
}

For the meaning of "je 1b" and "jnz 3f", refer to Using as: Local Symbol Names.

Didn't find _edata and _end definitions? No problem, they are defined in the "internal linker script". Without -T (--script=) option specified, ld uses this builtin script to link compressed/bvmlinux. Use "ld --verbose" to display this script, or check Appendix B. Internal Linker Script.

Refer to Using LD, the GNU linker: Command Line Options for -T (--script=), -L (--library-path=) and --verbose option description. "man ld" and "info ld" may help too.

piggy.o has been unzipped and control is passed to __KERNEL_CS:100000, i.e. linux/arch/i386/kernel/head.S:startup_32(). See Section 6, “linux/arch/i386/kernel/head.S”.

#define LOW_BUFFER_START      0x2000
#define LOW_BUFFER_MAX       0x90000
#define HEAP_SIZE             0x3000
///////////////////////////////////////////////////////////////////////////////
asmlinkage int decompress_kernel(struct moveparams *mv, void *rmode)
|-- setup real_mode(=rmode), vidmem, vidport, lines and cols;
|-- if (is_zImage) setup_normal_output_buffer() {
|       output_data      = 0x100000;
|       free_mem_end_ptr = real_mode;
|   } else (is_bzImage) setup_output_buffer_if_we_run_high(mv) {
|       output_data      = LOW_BUFFER_START;
|       low_buffer_end   = MIN(real_mode, LOW_BUFFER_MAX) & ~0xfff;
|       low_buffer_size  = low_buffer_end - LOW_BUFFER_START;
|       free_mem_end_ptr = &end + HEAP_SIZE;
|       // get mv->low_buffer_start and mv->high_buffer_start
|       mv->low_buffer_start = LOW_BUFFER_START;
|       /* To make this program work, we must have
|        *   high_buffer_start > &end+HEAP_SIZE;
|        * As we will move low_buffer from LOW_BUFFER_START to 0x100000
|        *   (max low_buffer_size bytes) finally, we should have
|        *   high_buffer_start > 0x100000+low_buffer_size; */
|       mv->high_buffer_start = high_buffer_start
|           = MAX(&end+HEAP_SIZE, 0x100000+low_buffer_size);
|       mv->hcount =  0 if (0x100000+low_buffer_size >  &end+HEAP_SIZE);
|                  = -1 if (0x100000+low_buffer_size <= &end+HEAP_SIZE);
|       /* mv->hcount==0 : we need not move high_buffer later,
|        *   as it is already at 0x100000+low_buffer_size.
|        * Used by close_output_buffer_if_we_run_high() below. */
|   }
|-- makecrc();          // create crc_32_tab[]
|   puts("Uncompressing Linux... ");
|-- gunzip();
|   puts("Ok, booting the kernel.\n");
|-- if (is_bzImage) close_output_buffer_if_we_run_high(mv) {
|       // get mv->lcount and mv->hcount
|       if (bytes_out > low_buffer_size) {
|           mv->lcount = low_buffer_size;
|           if (mv->hcount)
|               mv->hcount = bytes_out - low_buffer_size;
|       } else {
|           mv->lcount = bytes_out;
|           mv->hcount = 0;
|       }
|   }
`-- return is_bzImage;  // return value in AX

end is defined in the "internal linker script" too.

decompress_kernel() has an "asmlinkage" modifer. In linux/include/linux/linkage.h:

#ifdef __cplusplus
#define CPP_ASMLINKAGE extern "C"
#else
#define CPP_ASMLINKAGE
#endif

#if defined __i386__
#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))
#elif defined __ia64__
#define asmlinkage CPP_ASMLINKAGE __attribute__((syscall_linkage))
#else
#define asmlinkage CPP_ASMLINKAGE
#endif

Macro "asmlinkage" will force the compiler to pass all function arguments on the stack, in case some optimization method may try to change this convention. Check Using the GNU Compiler Collection (GCC): Declaring Attributes of Functions (regparm) and Kernelnewbies FAQ: What is asmlinkage for more details.

5.2. gunzip()

decompress_kernel() calls gunzip() -> inflate(), which are defined in linux/lib/inflate.c, to decompress resident kernel image to low buffer (pointed by output_data) and high buffer (pointed by high_buffer_start, for bzImage only).

The gzip file format is specified in RFC 1952.

Table 6. gzip file format

ComponentMeaningByteComment
ID1IDentification 1131 (0x1f, \037)
ID2IDentification 21139 (0x8b, \213) [a]
CMCompression Method18 - denotes the "deflate" compression method
FLGFLaGs10 for most cases
MTIMEModification TIME4modification time of the original file
XFLeXtra FLags12 - compressor used maximum compression, slowest algorithm [b]
OSOperating System13 - Unix
extra fields--variable length, field indicated by FLG [c]
compressed blocks--variable length
CRC32-4CRC value of the uncompressed data
ISIZEInput SIZE4the size of the uncompressed input data modulo 2^32

[a] ID2 value can be 158 (0x9e, \236) for gzip 0.5;

[b] XFL value 4 - compressor used fastest algorithm;

[c] FLG bit 0, FTEXT, does not indicate any "extra field".


We can use this file format knowledge to find out the beginning of gzipped linux/vmlinux.

[root@localhost boot]# hexdump -C /boot/vmlinuz-2.4.20-28.9 | grep '1f 8b 08 00'
00004c50  1f 8b 08 00 01 f6 e1 3f  02 03 ec 5d 7d 74 14 55  |.......?...]}t.U|
[root@localhost boot]# hexdump -C /boot/vmlinuz-2.4.20-28.9 -s 0x4c40 -n 64
00004c40  00 80 0b 00 00 fc 21 00  68 00 00 00 1e 01 11 00  |......!.h.......|
00004c50  1f 8b 08 00 01 f6 e1 3f  02 03 ec 5d 7d 74 14 55  |.......?...]}t.U|
00004c60  96 7f d5 a9 d0 1d 4d ac  56 93 35 ac 01 3a 9c 6a  |......M.V.5..:.j|
00004c70  4d 46 5c d3 7b f8 48 36  c9 6c 84 f0 25 88 20 9f  |MF\.{.H6.l..%. .|
00004c80
[root@localhost boot]# hexdump -C /boot/vmlinuz-2.4.20-28.9 | tail -n 4
00114d40  bd 77 66 da ce 6f 3d d6  33 5c 14 a2 9f 7e fa e9  |.wf..o=.3\...~..|
00114d50  a7 9f 7e fa ff 57 3f 00  00 00 00 00 d8 bc ab ea  |..~..W?.........|
00114d60  44 5d 76 d1 fd 03 33 58  c2 f0 00 51 27 00        |D]v...3X...Q'.|
00114d6e

We can see that the gzipped file begins at 0x4c50 in the above example. The four bytes before "1f 8b 08 00" is input_len (0x0011011e, in little endian), and 0x4c50+0x0011011e=0x114d6e equals to the size of bzImage (/boot/vmlinuz-2.4.20-28.9).

static uch *inbuf;           /* input buffer */
static unsigned insize = 0;  /* valid bytes in inbuf */
static unsigned inptr = 0;   /* index of next byte to be processed in inbuf */
///////////////////////////////////////////////////////////////////////////////
static int gunzip(void)
{
        Check input buffer for {ID1, ID2, CM}, must be
                {0x1f, 0x8b, 0x08} (normal case), or
                {0x1f, 0x9e, 0x08} (for gzip 0.5);
        Check FLG (flag byte), must not set bit 1, 5, 6 and 7;
        Ignore {MTIME, XFL, OS};
        Handle optional structures, which correspond to FLG bit 2, 3 and 4;
        inflate();              // handle compressed blocks
        Validate {CRC32, ISIZE};
}

When get_byte(), defined in linux/arch/i386/boot/compressed/misc.c, is called for the first time, it calls fill_inbuf() to setup input buffer inbuf=input_data and insize=input_len. Symbol input_data and input_len are defined in piggy.o linker script. See Section 2.5, “linux/arch/i386/boot/compressed/Makefile”.

5.3. inflate()

// some important definitions in misc.c
#define WSIZE 0x8000            /* Window size must be at least 32k,
                                 * and a power of two */
static uch window[WSIZE];       /* Sliding window buffer */
static unsigned outcnt = 0;     /* bytes in output buffer */

// linux/lib/inflate.c
#define wp outcnt
#define flush_output(w) (wp=(w),flush_window())
STATIC unsigned long bb;        /* bit buffer */
STATIC unsigned bk;             /* bits in bit buffer */
STATIC unsigned hufts;          /* track memory usage */
static long free_mem_ptr = (long)&end;
///////////////////////////////////////////////////////////////////////////////
STATIC int inflate()
{
        int e;                  /* last block flag */
        int r;                  /* result code */
        unsigned h;             /* maximum struct huft's malloc'ed */
        void *ptr;

        wp = bb = bk = 0;

        // inflate compressed blocks one by one
        do {
                hufts = 0;
                gzip_mark() { ptr = free_mem_ptr; };
                if ((r = inflate_block(&e)) != 0) {
                        gzip_release() { free_mem_ptr = ptr; };
                        return r;
                }
                gzip_release() { free_mem_ptr = ptr; };
                if (hufts > h)
                h = hufts;
        } while (!e);

        /* Undo too much lookahead. The next read will be byte aligned so we
         * can discard unused bits in the last meaningful byte. */
        while (bk >= 8) {
                bk -= 8;
                inptr--;
        }

        /* write the output window window[0..outcnt-1] to output_data,
         * update output_ptr/output_data, crc and bytes_out accordingly, and
         * reset outcnt to 0. */
        flush_output(wp);

        /* return success */
        return 0;
}

free_mem_ptr is used in misc.c:malloc() for dynamic memory allocation. Before inflating each compressed block, gzip_mark() saves the value of free_mem_ptr; After inflation, gzip_release() will restore this value. This is how it "free()" the memory allocated in inflate_block().

Gzip uses Lempel-Ziv coding (LZ77) to compress files. The compressed data format is specified in RFC 1951. inflate_block() will inflate compressed blocks, which can be treated as a bit sequence.

The data structure of each compressed block is outlined below:

BFINAL (1 bit)
    0  - not the last block
    1  - the last block
BTYPE  (2 bits)
    00 - no compression
        remaining bits until the byte boundary;
        LEN      (2 bytes);
        NLEN     (2 bytes, the one's complement of LEN);
        data     (LEN bytes);
    01 - compressed with fixed Huffman codes
        {
        literal  (7-9 bits, represent code 0..287, excluding 256);
                     // See RFC 1951, table in Paragraph 3.2.6.
        length   (0-5 bits if literal > 256, represent length 3..258);
                     // See RFC 1951, 1st alphabet table in Paragraph 3.2.5.
        data     (of literal bytes if literal < 256);
        distance (5 plus 0-13 extra bits if literal == 257..285, represent
                         distance 1..32768);
                     /* See RFC 1951, 2nd alphabet table in Paragraph 3.2.5,
                      *   but statement in Paragraph 3.2.6. */
                     /* Move backward "distance" bytes in the output stream,
                      * and copy "length" bytes */
        }*           // can be of multiple instances
        literal  (7 bits, all 0, literal == 256, means end of block);
    10 - compressed with dynamic Huffman codes
        HLIT     (5 bits, # of Literal/Length codes - 257, 257-286);
        HDIST    (5 bits, # of Distance codes - 1,         1-32);
        HCLEN    (4 bits, # of Code Length codes - 4,      4 - 19);
        Code Length sequence    ((HCLEN+4)*3 bits)
        /* The following two alphabet tables will be decoded using
         *   the Huffman decoding table which is generated from
         *   the preceeding Code Length sequence. */
        Literal/Length alphabet (HLIT+257 codes)
        Distance alphabet       (HDIST+1 codes)
        // Decoding tables will be built from these alphpabet tables.
        /* The following is similar to that of fixed Huffman codes portion,
         *   except that they use different decoding tables. */
        {
        literal/length
                 (variable length, depending on Literal/Length alphabet);
        data     (of literal bytes if literal < 256);
        distance (variable length if literal == 257..285, depending on
                         Distance alphabet);
        }*           // can be of multiple instances
        literal  (literal value 256, which means end of block);
    11 - reserved (error)

Note that data elements are packed into bytes starting from Least-Significant Bit (LSB) to Most-Significant Bit (MSB), while Huffman codes are packed starting with MSB. Also note that literal value 286-287 and distance codes 30-31 will never actually occur.

With the above data structure in mind and RFC 1951 by hand, it is not too hard to understand inflate_block(). Refer to related paragraphs in RFC 1951 for Huffman coding and alphabet table generation.

For more details, refer to linux/lib/inflate.c, gzip source code (many in-line comments) and related reference materials.

5.4. Reference