Table of Contents
Well, I wouldn't want to interfere with what you're doing, but here is some advice from the hard-earned experience.
Assembly can express very low-level things:
you can access machine-dependent registers and I/O
you can control the exact code behavior in critical sections that might otherwise involve deadlock between multiple software threads or hardware devices
you can break the conventions of your usual compiler, which might allow some optimizations (like temporarily breaking rules about memory allocation, threading, calling conventions, etc)
you can build interfaces between code fragments using incompatible conventions (e.g. produced by different compilers, or separated by a low-level interface)
you can get access to unusual programming modes of your processor (e.g. 16 bit mode to interface startup, firmware, or legacy code on Intel PCs)
you can produce reasonably fast code for tight loops to cope with a bad non-optimizing compiler (but then, there are free optimizing compilers available!)
you can produce hand-optimized code perfectly tuned for your particular hardware setup, though not to someone else's
you can write some code for your new language's optimizing compiler (that is something what very few ones will ever do, and even they not often)
i.e. you can be in complete control of your code
Assembly is a very low-level language (the lowest above hand-coding the binary instruction patterns). This means
it is long and tedious to write initially
it is quite bug-prone
your bugs can be very difficult to chase
your code can be fairly difficult to understand and modify, i.e. to maintain
the result is non-portable to other architectures, existing or upcoming
your code will be optimized only for a certain implementation of a same architecture: for instance, among Intel-compatible platforms each CPU design and its variations (relative latency, through-output, and capacity, of processing units, caches, RAM, bus, disks, presence of FPU, MMX, 3DNOW, SIMD extensions, etc) implies potentially completely different optimization techniques. CPU designs already include: Intel 386, 486, Pentium, PPro, PII, PIII, PIV; Cyrix 5x86, 6x86, M2; AMD K5, K6 (K6-2, K6-III), K7 (Athlon, Duron). New designs keep popping up, so don't expect either this listing and your code to be up-to-date.
you spend more time on a few details and can't focus on small and large algorithmic design, that are known to bring the largest part of the speed up (e.g. you might spend some time building very fast list/array manipulation primitives in assembly; only a hash table would have sped up your program much more; or, in another context, a binary tree; or some high-level structure distributed over a cluster of CPUs)
a small change in algorithmic design might completely invalidate all your existing assembly code. So that either you're ready (and able) to rewrite it all, or you're tied to a particular algorithmic design
On code that ain't too far from what's in standard benchmarks, commercial optimizing compilers outperform hand-coded assembly (well, that's less true on the x86 architecture than on RISC architectures, and perhaps less true for widely available/free compilers; anyway, for typical C code, GCC is fairly good);
And in any case, as moderator John Levine says on comp.compilers,
"compilers make it a lot easier to use complex data structures,
and compilers don't get bored halfway through
and generate reliably pretty good code."
They will also correctly propagate code transformations throughout the whole (huge) program when optimizing code between procedures and module boundaries.
All in all, you might find that though using assembly is sometimes needed, and might even be useful in a few cases where it is not, you'll want to:
minimize use of assembly code
encapsulate this code in well-defined interfaces
have your assembly code automatically generated from patterns expressed in a higher-level language than assembly (e.g. GCC inline assembly macros)
have automatic tools translate these programs into assembly code
have this code be optimized if possible
All of the above, i.e. write (an extension to) an optimizing compiler back-end.
Even when assembly is needed (e.g. OS development), you'll find that not so much of it is required, and that the above principles retain.
See the Linux kernel sources concerning this: as little assembly as needed, resulting in a fast, reliable, portable, maintainable OS. Even a successful game like DOOM was almost massively written in C, with a tiny part only being written in assembly for speed up.