Click here to Skip to main content
Click here to Skip to main content

Tagged as

Go to top

Writing a Linux 4k intro

, 27 Jan 2014
Rate this:
Please Sign up or sign in to vote.
This summer my old friend motivated me to restart demoscene activity and I started coding a 4k intro for x86 Linux (which was supposed to be released at WeCan'2012, but we missed the deadline). This is an ongoing work, but I decided to share some experience that I consider interesting. 1. Even C is

This summer my old friend motivated me to restart demoscene activity and I started coding a 4k intro for x86 Linux (which was supposed to be released at WeCan'2012, but we missed the deadline).

This is an ongoing work, but I decided to share some experience that I consider interesting.

1. Even C is probably too heavy for the task.

Although a friend of mine has previously written a 4k intro in C, I am having problems getting compiler to generate the compact code I want. One of the reason for that is that even when optimizing the whole program at once, compilers (here and later on by "compilers" I mean ones I tried for the intro, namely clang & gcc) don't relax the ABI restrictions for internal (externally invisible) functions. Sure, you can inline a function altogether but sometimes it's not what you want.

To provide an example: let's say that ABI mandates that you pass arguments via stack. However, according to your particular optimization criteria (be it size or speed) you could generate better code for calling some (not all) functions if arguments were passed via registers. Currently, you can fine-tune that manually (by using 'regparm' function attribute on select functions), but ideally compiler would analyze all the callers of your function and decide that automatically.

Generally, it seems that even whole-program optimizing compilers are having problems crossing function borders, even if all the possible places where a function could be called from are visible to them.

E.g. I would like two (or more) functions to share a common stack frame - NOT inline one into another, as this would produce sub-optimal (for me), larger and worse to pack code, but essentially turn some of the callee functions into local ones. This can be done manually to some extent, but this looks like a task for compiler.

BTW, it turns out that Clang generates smaller code than GCC, although this is mostly (only?) true if you use SSE intrinsics for math: it looks like Clang cannot and will not generate any x87 code even if you pass -mfpmath=387 (at least that's the experience with current SVN trunk and also with version 3.0 shipped with Ubuntu 12.04) and scalar SSE opcodes are large.

2. Compilers are buggy Wink | ;-)

Nothing new here, of course. Just sharing a particular example of peephole optimizer silliness in Clang, which turns this code

    size_t iCnt = 6;
    float * pBuf = s_BufStart;
    do
    {
        SomeFunc( pBuf );
        SomeFunc( pBuf + 4 );
        SomeFunc( pBuf + 8 );
        SomeFunc( pBuf + 12 );
        pBuf += 16;
    }
    while( --iCnt > 0 );

into this (pay attention to instructions in bold)

804988e:       31 db                   xor    %ebx,%ebx
8049890:       83 ec 10                sub    $0x10,%esp
8049893:       8d 83 30 bf 04 08       lea    0x804bf30(%ebx),%eax
8049899:       89 04 24                mov    %eax,(%esp)
804989c:       ff 15 14 be 04 08       call   *0x804be14
80498a2:       83 c4 10                add    $0x10,%esp
80498a5:       83 ec 10                sub    $0x10,%esp
80498a8:       8d 83 40 bf 04 08       lea    0x804bf40(%ebx),%eax
80498ae:       89 04 24                mov    %eax,(%esp)
80498b1:       ff 15 14 be 04 08       call   *0x804be14
80498b7:       83 c4 10                add    $0x10,%esp
80498ba:       83 ec 10                sub    $0x10,%esp
80498bd:       8d 83 50 bf 04 08       lea    0x804bf50(%ebx),%eax
80498c3:       89 04 24                mov    %eax,(%esp)
80498c6:       ff 15 14 be 04 08       call   *0x804be14
80498cc:       83 c4 10                add    $0x10,%esp
80498cf:       83 ec 10                sub    $0x10,%esp
80498d2:       8d 83 60 bf 04 08       lea    0x804bf60(%ebx),%eax
80498d8:       89 04 24                mov    %eax,(%esp)
80498db:       ff 15 14 be 04 08       call   *0x804be14
80498e1:       83 c4 10                add    $0x10,%esp
80498e4:       83 c3 40                add    $0x40,%ebx
80498e7:       81 fb 80 01 00 00       cmp    $0x180,%ebx
80498ed:       75 a1                   jne    8049890

Apparently, clang is trying to set up and tear down an aligned stack parameter, but does not notice for some reason that the same set up and tear down happens four times in a row, thus missing the opportunity to optimize the three of them out.

I tried to come up with a smaller fragment of code that reproes that, but could not so far, that is why instead of filing a bug report I'm just mentioning this on the blog. Notably, gcc does not have any problems with this particular piece of code, but as I mentioned above, gcc's code is ultimately larger and less packer-friendly.

Also, it's possible to get an ICE with clang when doing weird stuff with SIMD intrinsics/registers (especially when aliased unions with floats are involved), but again, a simple isolated repro is hard to come by.

3. x86 code is damn tight (SSE one, anyway).

I experimented with some simple transforms on the resulting binary to improve compression ratio. Using tools from ELF kickers and an excellent distorm disassembler library, I wrote a code rotator to transpose x86 opcodes, i.e. instead of having them placed in memory like

    164a:       0f 28 ea                movaps %xmm2,%xmm5
    164d:       0f 59 ed                mulps  %xmm5,%xmm5
    1650:       0f 59 e9                mulps  %xmm1,%xmm5
    [...]
    165b:       0f 52 ed                rsqrtps %xmm5,%xmm5
    165e:       0f 59 d5                mulps  %xmm5,%xmm2
    1661:       0f 58 e2                addps  %xmm2,%xmm4

that is,

    0f 28 ea 0f 59 ed 0f 59 e9 0f 52 ed 0f 59 d5 0f 58 e2

the rotator places them roughly like

    0f 0f 0f 0f 0f 0f 28 59 59 52 59 58 ea ed e9 ed d5 e2

(it is actually more tricky as x86 opcodes have different sizes).

Guess what? Despite this layout looking more regular, it actually packs worse, at least with gzip and lzma (which are used as 4k code droppers). Trying to add a simple RLE pre-pass didn't help, so this requires a closer (and more thorough) look.

4. Linux needs better 4k tools.

Particularly, lack of crinkler is a major disadvantage. Specialized (and better) loaders than ld do exist, but nothing can make up for the lack of crinkler's excellent packer.

However, I would like to highlight GC masher, a great tool (with awful source code, but still) to conduct a brute-force search of compiler parameters or code layout (e.g. provide alternative paths in code switched by #ifs and use the tool to find the combination with the best compression ratio). With some creative use, the tool can be used to find not the smallest, but the fastest version of resulting binary (within the limits of a brute-force search of course).

5. Demomaking is fun and I missed it.

I have previously coded a 4k intro (MS-DOS, with GUS music), and a couple of demos, too, but I have been inactive for quite a while (crunch doesn't help creativity Wink | ;-) and then there were other reasons, too). Now that I returned to coding scene stuff, I feel myself a teenager again Smile | :)

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

RCL_SPD

Russian Federation Russian Federation
No Biography provided

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Mobile
Web02 | 2.8.140916.1 | Last Updated 27 Jan 2014
Article Copyright 2014 by RCL_SPD
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid