Click here to Skip to main content
Click here to Skip to main content

Optimization: Your Worst Enemy

By , 12 Aug 2000
 

Introduction

Now there's a title to grab your attention! But I'm serious!

First, a little background on me: my PhD was one of the earliest on the automatic creation of optimizing compilers from formal machine descriptions ("Machine-Independent Generation of Optimal Local Code", CMU Computer Science Department, 1975). After my PhD, I spent three years at CMU as a senior researcher on the C.mmp multiprocessor computer system which used our home-grown Hydra operating system, a secure, capability-based operating system. I then went back to compiler research on the PQCC (Production Quality Compiler-Compiler) project, which ultimately led to the formation of Tartan Laboratories (later, just Tartan, and now absorbed by Texas Instruments), a compiler company, where I was part of the tooling group. I spent a decade-and-a-half writing and using performance measurement tooling.

This essay has several parts and represents much of my own experience. The stories I tell are true. The names have not been changed, although a couple are carefully omitted.

Optimization: When and What

A reasonably skilled programmer will not write a grossly inefficient program. At least not deliberately. Optimization is what you do when the performance is insufficient. Sometimes the optimizations are easy, sometimes they are hard. Sometimes they fly in the face of your original design, and sometimes they require that you grossly violate your beautiful abstractions in your class system. But always, and I repeat, always, my experience has been that no programmer has ever been able to predict or analyze where performance bottlenecks are without data. No matter where you think the time is going, you will be surprised to discover that it is going somewhere else.

You optimize because you have a problem in performance. Sometimes it is computational optimization: your bitmap manipulation is just too slow. Sometimes it is data access optimization: it just takes too long to get the data into the machine. And sometimes it is algorithmic optimization: you're doing it wrong. If you don't understand the difference between an n2 sort and an n log n sort, you're probably already in trouble, but that knowledge alone is not useful.

Some years ago, I was working on a complex program which had to perform a semantic cross-check between the "statements" of the program and the "declarations" (it was actually a 4GL constraint equation system, but the details don't matter). I discovered that the operation was of n3 complexity (well, actually m * n2, but m and n would be of comparable size most of the time). There are three paths that you can follow here:

  • The naive path. You don't even realize you've got an n3 problem. You're probably in trouble, because if it is the bottleneck, you didn't know it was there.
  • The formal academic path. You realize you've got an n3 problem, and know it is intrinsically evil, and rewrite your algorithms.
  • The engineering path. You realize you've got an n3 problem, but you instrument the system to discover its actual impact on the system.

The only valid path to optimization is the engineering path. I measured the performance, and on the largest "real" example we had, I discovered that n was almost always 1, sometimes 2, rarely 3, and had exactly one instance of 4. This was too small to matter. Sure, the algorithm was n3, but with n that small, there was no need to rewrite the code. Rewriting the code would have been incredibly complex, delayed the whole project for a couple weeks, and used up a couple pointers in each node of the tree in an already-tight minicomputer address space.

I also wrote the storage allocator that everyone used. I spent a lot of hours tweaking its performance to be the fastest allocator of its class. These adventures are detailed in the book IDL: The Language and its Implementation, now, alas, out of print. (Nestor, Newcomer, Gianinni and Stone, Prentice-Hall, 1990). One group that used it used a "PC sampling" performance tool on Unix. This is a performance tool that samples where the program counter (PC) is executing, and over a long execution derives a "density histogram" of where the program is spending its time. It clearly showed that a massive amount of time was being spent in the storage allocator. This made no sense to me, but fingers were being pointed in my direction.

So I wrote a little hook in the allocator that counted the number of times it was called. It turns out it was called over 4,000,000 times. No call took longer than the minimum measurement interval of 10µs (approximately ten instructions on our 1-MIPS machine), but 40,000,000 microseconds is 40 seconds. Actually, it was more than that because there were 4,000,000 free operations as well, which were even faster, but still the resulting time was more than 50% of the total execution time.

Why did this happen? Because, unknown to the programmers, a critical function they were calling in the inner loop of several algorithms would actually allocate a 5-to-10 byte working buffer, do its thing, and release it. When we changed this to be a 10-byte stack local, the time spent in the storage allocator dropped to about 3% of the total program time.

Without the data, we would not have known why the allocator accounted for so much time. PC-sampling performance tools are a very weak class of tools and their results are intrinsically suspect. See my article "Profiling for Performance", in Dr. Dobb's Journal (18,1) January, 1993, pp.80-87.

A classic blunder in optimization was committed some years ago by one of the major software vendors. We had their first interactive timesharing system, and it was a "learning experience" in a variety of ways. One such experience was that of the FORTRAN compiler group. Now any compiler writer knows that the larger the hash table you use for a symbol table, the better your performance on lookup will be. When you are writing a multipass compiler in a 32K mainframe, you end up using a relatively small symbol table, but you create a really, really good hash algorithm so that the probability of a hash collision is reduced (unlike a binary seach, which is log n, a good hash table has constant performance, up to a certain density, so as long as you keep the density below this threshold you might expect that you will typically have an order 1 or 2 cost to enter or lookup a symbol, on the average. A perfect hash table (which is usually precomputed for constant symbols), has a constant performance between 1.0 and 1.3 or thereabouts; if it gets to 1.5 you rework the hashing to get it lower).

So anyway, this compiler group discovers that they no longer have 32K, or 128K, or 512K. Instead, they now have a 4GB virtual address space. "Hey, let's use a really big hash table!" you can hear them saying, "Like, how about 1 MB table?" So they did. But they also had a very sophisticated compiler technology designed for small and relatively dense hash tables. So the result was that the symbols were fairly uniformly distributed over the 256 4K pages in that 1MB, which meant that every symbol access caused a page fault. The compiler was a dog. When they finally went back to a 64K symbol table, they found that although the algorithm had poorer "absolute" performance from a purely algorithmic viewpoint (taking many more instructions to look up a symbol), because it did not cause nearly as many page faults, it ran over an order of magnitude faster. So third-order effects do matter.

Also, beware of C. No, not the speed of light. When we talk about performance, the algorithmic performance for n is expressed as a function C × f(n). Thus an n2 algorithm is formally C × n2, meaning that the performance is a constant multiple of the square of the number of elements being considered. We shorten this to O(n2), meaning "order of n2", and in common parlance just drop the "order of" designation. But never forget the C is there. Some years ago, I was doing a project that produced summary set of listings, sorted in a variety of ways. In the first attempt (this was in the days before C and qsort) I just did an ordinary bubble sort, an O(n2) algorithm. After initial testing, I fed it some live data. Ten minutes later, after it had printed the console message "Starting reports", it had not yet produced any reports. A series of probes showed that most of the time was in the sort routine. OK, I was done in by my laziness. So I pulled out my trusty heapsort (n log n) sort and spent an hour modifying it to work in my application (remember, I said qsort did not yet exist). Having solved the problem, I started running it again. Seven minutes into the report phase, nothing had yet appeared. Some probes revealed something significant: it was spending most of its time in the equivalent of strcmp, comparing the strings. While I'd fixed the O issue, I had serious neglected the C issue. So what I did was do one sort of the composite symbol table, all of the names, and then assigning an integer to each symbol structure. Thereafter, when I had to sort a substructure, I just did an integer sort on its ordinal position. This reduced C to the point where less than 30 seconds were required to do the entire report phase. A second-order effect, but a significant one.

So algorithmic performance, particularly paging performance, can matter. Unfortunately, we have neither the proper tools for measuring paging hits nor for reorganizing code to minimize paging of code pages.

Some performance tools measure the total time spent in user space, and treat kernel time as free. This can mask the impact the application has on the kernel time. For example, a few years ago we were measuring the performance of a program whose performance was exceptionally poor. No "hotspot" showed up in terms of program time wasted. However, at one point I was looking at the trace data and noticed that the input routine was called about a million times, which is not surprising when you are reading a megabyte of data, but something seemed odd to me. I looked. Each time it was called, it called the kernel to read a single byte of the file! I changed it to read 8K bytes and unpack the buffer itself, and got a factor of 30 performance improvement! Note the lesson here: kernel time matters, and kernel transitions matter. It is not accidental that the GDI in NT 4.0 is no longer a user-level process but integrated into the kernel. The kernel transitions dominated the performance.

So what to optimize is easy: those parts of the program that are consuming the time. But local optimizations that ignore global performance issues are meaningless. And first-order effects (total time spent in the allocator, for example), may not be the dominant effects. Measure, measure, measure.

Living Well is the Best Revenge

(OK, this is really a sidebar, and a bit of an ego trip. You can skip to the next section if you don't want to read it. You've been warned).

Back in the early days of C, the C storage allocator was one of the worst-performing storage allocators in existence. It was "first fit", which meant that the way it worked was the allocator went traipsing down the free list, looking for a block at least as big as the one requested, and if it found one, it split it and returned the residue to the free list. This had the advantages of being as slow as possible and fragmenting memory as badly as possible. In fact, it was worse than you can imagine. It actually walked the list of all blocks of storage, free and allocated, and had to ignore the allocated blocks. So as you got more and more blocks, its performance degraded, and as the blocks got too small to be usable, they simply added to the overhead without adding to the value.

I was at CMU on a one-year research contract. My first remark upon using the Unix environment was to turn to one of the people and say "How can you live this way?" The software technology in 1990 was exactly what it had been a decade previous when I left CMU, except that in the modern case the compiler didn't work (it generated incorrect code for simple C constructs), the debugger didn't work, the backtrace (being entirely a list of hex addresses with no symbols) was useless, the linker didn't work, and there wasn't anything resembling a decent document production system. Other than these minor defects, I suppose it was an OK environment. Having used Microsoft C, with CodeView, and even the earliest Visual C environment, I had certain high standards of what I expected out of my tooling, and Unix (at least in that era) fell short. By miles. I was, of course, outspoken on this subject.

One day we were discussing an algorithm, and it required doing some storage allocation. I was assured that this was unacceptable because storage allocation was very expensive. I said something like "Well, of course, if you use the brain-dead Unix allocator, you're bound to have performance problems. A decent storage allocator makes this a non-issue". One of the people at the meeting immediately challenged me: "I'm sick of hearing you put down Unix. What do you know about storage allocators, anyway?". I said "hold that thought, I'll be right back". I went to my office, where I had a copy of the IDL book, brought it back, and held it open to the chapter labeled "Storage allocation". "See this?" "Yes." "What is the title of this chapter?" "Storage allocation." I closed the book and pointed to the cover. "Recognize this name?" "Yes, of course, that's you." "Fine. I wrote that chapter. It is on how to write a high-performance, minimum-fragmentation storage allocator. So you asked what I know about storage allocation. Well, I wrote the book on it."

I was never again challenged when I put down Unix.

Just as an aside, the allocators used in NT work very much like the algorithm I described in the IDL book, which is based on the "quickfit" work that Chuck Weinstock did for his PhD dissertation at CMU around 1974.

When not to optimize

Do not do clever optimizations that have no meaning. For example, people who try to "optimize" the GUI interface. Hardwired constants, distributed enabling, cute algorithms. The result is something that is hard to develop, difficult to debug, and absolutely impossible to maintain.

Optimization is meaningless here. For more details, you might want to read my essay on the only sensible way I've found to manage dialog controls. I'll summarize the key idea here.

Why doesn't efficiency matter when you're updating menus or controls? Look at the human factors. A mouse is held approximately 2 feet from the ear. Sound travels at approximately 1100 ft/sec. This means that it takes approximately 2ms for the sound of the mouse click or keystroke to reach the ear. The neural path from the fingertip to the brain of an adult is approximately 3 feet. Propagation of nerve impulses is approximately 300 ft/sec, meaning the sensation of the mouse click or keystroke takes approximately 10ms to reach the brain. Perceptual delay in the brain can add between 50 and 250ms more.

Now, how many Pentium instructions can you execute in 2ms, 10ms, or 100ms? In 2ms, on a 500MHz machine that's 1,000,000 clock cycles, so you can execute a lot of instructions in that time. Even on a now-clunky 120MHz Pentium there is no noticeable delay in handling the controls.

This did not stop Microsoft from totally violating the C++ object model on message handlers; if you call CWnd::OnWhatever(...), instead of actually calling DefWindowProc with the parameters you provide, they re-use the parameters of the last message to call ::DefWindowProc. The goal is to "reduce the size of the MFC runtime", as if a few hundred instructions more in a massive DLL would matter! Even I can figure out how to get CWnd::OnAnything to inline-expand to a call on DefWindowProc.

Optimization is your enemy

Some many years ago, as I indicated in the introduction, I worked on a large (16-processor) multiprocessor system. We were using custom-modified PDP-11 minicomputers, which were relatively slow. We were programming them in Bliss-11, which as far as I've been able to tell still holds the record for the world's tightest optimizing compiler (although I've seen some quite impressive optimizations in Microsoft C/C++). After doing some performance measurement, we determined that the paging algorithm was the outstanding performance bottleneck. So our first assumption was that the paging algorithm was faulty. We examined the code, and the paging algorithm maintainer rewrote it, taking our performance data into account, and had a new, and much faster, page management algorithm installed within a week.

Meanwhile, up at MIT, MULTICS was still running. They traced a serious performance problem to the paging system. Because it was written in a PL/1 like language, EPL, the assumption was that because it was written in a high-level language, the code was suboptimal, so they launched an effort to rewrite the page management algorithm in assembly code. A year later, the code was working, and was put into the production system. Performance dropped 5%. Upon inspection, it was found that the fundamental algorithm was at fault. They took the EPL code, rewrote the algorithm, and had the improved algorithm working and installed in a few weeks. The lesson: don't optimize something that is not the problem. Understand the problem first. Then, and only then, do the optimization. Otherwise, the optimization is a waste of time and may even make the performance worse.

In the Bliss compiler, the 'register' attribute on a variable said to the compiler "You will assign this variable to a register". In C, it means "I'd like you to assign this variable to a register". Many programmers decided that they should put certain variables in registers to get better code. But the Bliss compiler was very good; it had a very sophisticated register allocation system, and in the absence of direction from the programmer felt free to assign a variable to a register if that produced the best code. Explicitly assigning a variable to a register meant that the register was not available for common subexpressions, particularly those subexpressions implicit in data structure access. After a fair amount of experiments, we determined that almost invariably adding 'register' attributes to declarations produced significantly worse code than letting the compiler do the assignment. For inner-loop code, many hours of effort would usually result in a small improvement in performance, but it was generally understood by the programmers that unless you studied the generated machine code and did a number of calibrated experiments, any attempts at optimization would make the code worse.

If you've heard of the SPECmark performance, you may also be aware of how they are gamed. In particular, IBM wrote a program that took a FORTRAN program as input, such as the SPEC matrix-multiply benchmark, and produced another FORTRAN program as output, but one which was optimized for the cache architecture of the machine it would run on. A small number of parameters described all the cache strategies of each of the models of the RISC 6000 product line. The "optimized" original FORTRAN program performed on one machine at 45 SPECmarks. After being transformed, it performed on the same machine at over 900 SPECmarks. That's a factor of 20 performance improvement based solely on fourth-order effects, cache line hits. If you're doing image processing, particularly of large images, an awareness of cache issues (even relatively machine-independent) can buy you an order of magnitude performance.

A naive approach, optimizing at the code-line level, is not nearly as effective as higher-order optimizations. Paging optimizations, cache line optimizations, and memory allocation optimizations can often have vastly more significant effects than code-line optimization. Algorithmic optimizations are the next best bet, particularly if your problem is not amenable to paging/cache optimizations. Only after all these have been done, and, of course, you have measured everything in sight, does it make sense to start doing line-level optimizations. And if your problem domain demands it, it can even make sense to recode the inner loops, particularly of such algorithms as convolution algorithms and DSP algorithms, in assembly code, most often to take advantage of the instructions such as for MMX and streaming media.

Perhaps the best example of pure programmer stupidity in "optimizing" code occurred when I was porting a large library we used for our research project. Think of it as a 16-bit-to-32-bit port (it was an 18-bit-to-36-bit port, and the language wasn't C, but the details don't matter--you can write ghastly code in any language, and I've seen C programmers do things just as badly). The port mostly worked, but we had a really strange problem that showed up only under some rare conditions, but which crashed the program using the library. I started looking. The heap was damaged. When I found how the heap was being damaged, it was being damaged via a bad pointer which allowed a store into a random place in the heap. OK, how did that pointer get bad? Four levels of storage damage down, and after 12 solid hours of debugging, I found the real culprit. By why did it fail? Another 5 hours, and I found that the programmer who wrote the constructor code for the data structure had a struct-equivalent something like {char * p1; char * p2;} where the pointers had been 16-bit, and we now used 32-bit pointers. In looking at the initialization code, instead of seeing something like something->p1 = NULL; something->p2= NULL;, I found the moral equivalent of (*(DWORD*)&something.p1) = 0! When I confronted the programmer, he justified it by explaining that he was now able to zero out two pointers with only a single doubleword store instruction (it wasn't an x86 computer, but a mainframe), and wasn't that a clever optimization? Of course, when the pointers became 32-bit pointers, this optimization only zeroed one of the two pointers, leaving the other either NULL (most of the time), or, occasionally, pointing into the heap in an eventually destructive manner. I pointed out that this optimization happened once, at object creation time; the average application that used our library created perhaps six of these objects, and that according to the CPU data of the day before I'd spent not only 17 hours of my time but 6 hours of CPU time, and that if we fixed the bug and ran the formerly-failing program continuously, starting it up instantly after it finished, for fourteen years, the time saved by his clever hack would just about break even with the CPU time required to find and fix this piece of gratuitous nonsense. Several years later he was still doing tricks like this; some people never learn.

Summary

Optimization matters only when it matters. When it matters, it matters a lot, but until you know that it matters, don't waste a lot of time doing it. Even if you know it matters, you need to know where it matters. Without performance data, you won't know what to optimize, and you'll probably optimize the wrong thing.

The result will be obscure, hard to write, hard to debug, and hard to maintain code that doesn't solve your problem. Thus it has the dual disadvantage of (a) increasing software development and software maintenance costs, and (b) having no performance effect at all.

Hard to beat that combination! Now do you understand what I meant in the title?


The views expressed in these essays are those of the author, and in no way represent, nor are they endorsed by, Microsoft.

Send mail to newcomer@flounder.com with questions or comments about this article.
Copyright © 1999 The Joseph M. Newcomer Co.
www.flounder.com/mvp_tips.htm

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Joseph M. Newcomer
United States United States
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralExcellent advicememberIAbstract13 Jun '11 - 1:54 
Experience is a great teacher. Your comment, "don't optimize something that is not the problem. Understand the problem first" reminds me of another quote: "The answer to your problem may not be the answer to your question."
GeneralMy vote of 4memberIAbstract13 Jun '11 - 1:52 
great information and very authoritative
Question"Profiling for Performance", in Dr. Dobb's Journalmember.dan.g.8 Nov '10 - 18:26 
Joseph
 
I'm unable to find a copy of your original article in DDJ and wondered if you could point me to a version.
 
Regards
 
Daniel
.dan.g.
 
AbstractSpoon Software
abstractspoon2_at_optusnet_dot_com_dot_au

AnswerRe: "Profiling for Performance", in Dr. Dobb's JournalmemberTCP_JM10 Nov '10 - 23:57 
http://www.drdobbs.com/184408927[^]
 
http://www.drdobbs.com/article/printableArticle.jhtml;jsessionid=ZGSL2Y22PC4YBQE1GHPSKHWATMY32JVN?articleId=184408927&dept_url=/[^]
GeneralGreat articlememberHemant kulkarni17 Jun '09 - 5:46 
Thanks for such a great article !
 
I really enjoyed reading your many articles but this one is special due to its time. We just finished a backup product. It is working smoothly in many places(it is still in beta) but is having problems regarding the backup speed. It is running the speed less than 50% of the speeds available in other products. We all developrs agreed that we need a optimization of code. The real point is what optimization? (We are talking about 1TB+ data to be backdup on LAN or WAN) . The another problem we are facing is statictics information. We need a data about using named pipes is fater than sockets by what factor etc?. Looking at the factor(50% slow) it is some what clear to me that the problem is not in the way code is written(we use simple Win32 API on Win2K or later platform) but is in general architecture of application(how data is read transferred and write).
After some analysis I found that we are doing extra read and write of data over pipe. Where can I get some matrix about performance of named pipe and socket? Also what other optimization u can suggest?
GeneralGood articlememberjri_smbc6 May '09 - 4:53 
I enjoyed reading this article and agree with its main arguments.
Thanks for sharing
GeneralNames! We need names!memberSandork2 Oct '08 - 9:06 
"Several years later he was still doing tricks like this; some people never learn."
 
Names! We need names!
Smile | :)
GeneralRe: Names! We need names!memberJoseph M. Newcomer4 Oct '08 - 4:02 
Since the person has not been a programmer in many, many years, is in fact an administrator now, and approaching retirement, I see no reason to release his name.
GeneralI completely agreememberBill Hallahan15 Sep '08 - 12:41 
I'm an electrical engineer who ended up writing lots of Digital Signal Processing code in my career, some of it in assembly language.
 
I agree with what's in the article.
 
I had an experience where I had optimized a floating point version of a routine in assembly language for a compressor. That ran faster. The, I did the same thing for the similar, but opposite, routine in the decompressor, and the assembly code ran slower! The reverse routine was integer code, and so it needed to shift the data after the multiplications. The additional shift instructions made the code exceed the processor's 4K instruction cache, and cache thrashing ensued.
 
This is why inline code can sometimes be slower than just calling a routine. If a small routine is cached in the primary processor cache, it might run faster than if it's in a routine that loops that exceeds the cache size. The processor will keep things that run often in cache.
 
There are rare exceptions though. Sometimes, it is obvious that a particular piece of code should be written in assembly code. An example would be a video decoder that takes advantage of SIMD instructions. Any 'C' implementation will definitely run much, much, slower. I suppose I only know this because of experience, however, the ability to do four multiples and two adds in one cycle, when hundreds of millions of multiple adds must be done, and where the instructions are designed to solve that type of problem, makes deciding to use them a no-brainer.
 
However, I'd still profile the code, even for the video/assembly code example mentioned above, for the reason mentioned in the article. There could be something unexpected slowing any program down. I only mentioned the processor cache affecting speed. There are many other issues beyond the algorithm complexity, such as device latencies, bus speeds, and traps, and exceptions. On one processor, I found out that one floating point conversion instruction was done using a trap instruction, and the conversion was done in code. (This was an early chip made by the company I worked at then, and this was fixed in later processors... the code was faster on other machines, but horribly slow on this one machine. I recode it to work differently - I could not have found this without profiling the code).
 
Great article. You clearly know what you're talking about.
 
Bill
 
Design Patterns

GeneralSpace optimizations in limited-size appsmembersupercat912 Jul '08 - 9:05 
I agree that timing optimizations should generally be limited to situations where a simple implementation is known to be too slow (either through profiling, or via calculations of actual time requirements). Space optimizations may be another story, however.
 
If an application will have a hard limit on code size (as distinct from having a general desire to minimize code size) and if it is anticipated that the limit is likely to pose difficulties, I would think it better to watch one's space utilization throughout the project, rather than being stuck at the end when things don't fit. If a particular space optimization might take some time to implement, it may be worthwhile to mark the code saying, in essence, "If ~100 more bytes are needed, they could be had here for a few hours' coding" but if a particular optimization is fast and easy and is known to save a few bytes, I would see no reason not to implement it at the coding phase. For example, as a matter of habit, replace
  for (i=0; i<8; i++){
}
with either
  i=8;
  do { .. loop
  } while(--i);
if the index is not needed in the loop or
  i=0;
  do {..loop
  } while(!((++i) & 8));
in situations where the upper bound is extremely unlikely to change (e.g. number of bits in a byte). (note on the latter optimization: much of my coding is done on architectures which are optimized for such bit tests).
 
If I know that I'm likely to be tight on code space, is there any reason not to save an instruction in places where I can do so easily (note that I typically code on machines with room for about 4,096 instructions, so an instruction here and an instruction there can really add up)? As a matter of style, which loop form do you like better? Aesthetically, I like the do{}while(--i) better form a bit better, since it does not rely upon the number of repetitions being a power of two, it's less ugly, and it executes a little faster, but they both tend to produce the same amount of code, e.g.
; i=8; do while(--i) version--four instructions; two cycles setup plus three per loop
  movlw 8
  movwf i
loop:
  ...
  decfsz i
    goto loop
 
; i=0; do while(!((++i) & 8)) version--one cycle setup plus four per loop
  clrf i
loop:
  incf i
  btfss i,3
    goto loop
Do you have any aesthetic preference?
 
One major difference between space optimizations and timing optimizations is that when optimizing for timing, code which is likely to execute once per second is only 1/1000 as important as code which executes 1,000 times per second; when optimizing for space, code which executes once per decade (or even code which is unlikely to execute ever!) is just as important as code which executes 10,000 times/second.
GeneralRe: Space optimizations in limited-size appsmemberJoseph M. Newcomer12 Jul '08 - 11:43 
I don't recognize the machine code involved, but it looks like a fixed-instruction-size machine, so the discussion is irrelevant to most modern architectures (e.g., the x86). On the whole, the issue of code size is irrelevant nearly all the time. Also, "cycle" is a pretty vague concept in a pipelined superscalar architecture with caches, so largely it is impossible to predict how long any instruction takes. A good rule of thumb for the x86 is for a 2.8GHz machine the average instruction time is 175ps (that's *pico* seconds), which means instruction times for loop increment and test are essentially free, no matter how you measure them.
 
Data size is what kills you; reducing the data size by a factor of 10 can produce reductions in time by factors of 1000 without much effort. Localizing data to a small number of pages is better than having data structures spread out everywhere. As I tell my students, "Code size is irrelevant. Data size will kill you. Third-order effects will dominate your performance"
 
My own preference is that if I have a counted loop I will use a for-loop because it conveys what I intend; since the loop increment and branch are free, counting cycles just doesn't matter.
GeneralRe: Space optimizations in limited-size appsmembersupercat912 Jul '08 - 13:31 
I don't recognize the machine code involved, but it looks like a fixed-instruction-size machine, so the discussion is irrelevant to most modern architectures (e.g., the x86).
 
It is irrelevant to most modern architectures used for general-purpose computing. It is very much relevant in modern architectures used in embedded devices (exercise machines, microwave ovens, etc.) If a company is going to be making 10,000 units of a particular product and a chip with 4K of code space is $0.25 cheaper than one with 8K, taking the time to fit the code into 4K instead of 8K will represent a cost savings of $2,500. If there are going to be 100,000 units of the device made, the savings would be $25,000.
 
I guess another way to look at it would be to say that if a piece of code is going into 10,000 of those little micros, the extra code space going from a 4K part to an 8K part costs $0.62/byte. If it's going into 100,000 units, the extra code costs $6.20/byte. Quite a different ball game from general-purpose computers where a failure to optimize code that's used by a million users would probably impose a total extra cost on all those users of less than $0.0001/byte.
GeneralRe: Space optimizations in limited-size appsmemberJoseph M. Newcomer12 Jul '08 - 16:02 
Yes, but the embedded/realtime world is a world of its own. My essays relate generally to programming applications in general-purpose operating systems. I've done embedded code, with fixed-size EPROMs, but the rules and culture of realtime programming do not apply to general-purpose application programming; it is rarely cost-effective to act as if this were so.
GeneralRe: Space optimizations in limited-size appsmembersupercat912 Jul '08 - 19:54 
I do a mixture of PC and embedded programming, though until recently the vast majority of my work was with embedded systems. I've found that while there are some principles that are only applicable to one style or the other, there are many principles that apply to both.
 
Your comment about data size is very well taken, though it applies IMHO not only to the size of dynamic data, but also to the size of data that is bundled with the code. In many situations, code is dwarfed by data that is bundled with it. Even if one figures that for a widely-used program the total cost to all users of each byte of the program is less than $0.0001/byte, the real-world cost can still be many dollars per megabyte. If a program uses 100 megabytes of data to accomplish something that could be done just as well in 50, the total cost to users will likely not be intolerable, but it won't be trivial either.
 
Incidentally, in my current project I actually run a large amount of code on a PIC (a small microcontroller) and on the PC. I have some routines like "load LCD data" which have separate implementations on the two platforms, but most of the code is identical. The PC versions of such routines feed data to a vb.net program which simulates the hardware platform. The net effect is that I can use the Visual Studio debugging features to get new features working on the PC, and then run that same code on the target system. I've used the approach before with DOS-based tools; this is my first time using VS. Also, btw, it's the first time I've done a lot of PIC code development using an editor other than the venerable PC-Write 3.01 which I've been using for over 20 years.
 
It's interesting--the microprocessor was originally designed to be a replacement for logic rather than a computing device, but the roles of the computer and the roles of the microprocessor converged. Over the last decade or so, however, they've become increasingly divergent. The processors used in PCs are grossly unsuitable for many embedded tasks and vice versa. Things have sure changed since the days of the venerable 6502 (for which I still enjoy coding, btw) eh?
GeneralRe: Space optimizations in limited-size appsmemberJoseph M. Newcomer13 Jul '08 - 6:48 
Actually, 100MB is trivial amounts of data. Data does not *really* become interesting until you start into GB. This was not always so. I remember developing programs that would do data compression of a database, creating a STRINGTABLE representation that could be put into a DLL; the resulting compression was 4:1 which made the difference between fitting the app on a laptop of the era or not being able to do it at all.
 
I developed a lot of embedded code (including dynamic loading of LCD data) on MS-DOS before we moved the code to an embedded platform (back in 1990).
 
I still have my 6502 Ohio Microsystems board and I'm sure it is still functional; it hasn't had power applied in perhaps 20 years...
GeneralRe: Space optimizations in limited-size appsmembersupercat913 Jul '08 - 10:11 
Actually, 100MB is trivial amounts of data.

 
That depends who's using it. If an application is going to be distributed over the Internet, adding 100MB to the download size will significantly increase download time for many users (and monstrously increase it for many of those in rural areas). If the program is a game which runs off DVD, adding 100MB to the load size will noticeably increase the load time every time a user plays the game. While one extra blob of 100MB will not generally cause too much annoyance, if there are very many they'll add up.
 
I still have my 6502 Ohio Microsystems board and I'm sure it is still functional; it hasn't had power applied in perhaps 20 years...

 
I still write games for the Atari 2600 Video Computer System. It has 128 bytes of RAM, and carts are typically 4K-32K. My last project was a menu for a 64K multi-game cart. The menu was about 20K, and featured four-voice wavetable synthesized music along with some pretty scrolling graphics and a small scrolltext. Not bad for a 6502 at 1.19MHz. The music generation took 46 cycles every 76--some rather nasty optimizations there! The project before that was an 8K cart which included a rather nice game of mine plus a 1.25K "bonus game" that someone else wrote.


GeneralRe: Space optimizations in limited-size appsmemberStepanus David Kurniawan6 Jan '10 - 15:37 
Interesting discussions: one from PC point of view, and other from embedded device.
I guess we go back to the Summary of this article: "Optimization matters only when it matters"
We should optimize speed, size, or both only when it matters. Smile | :)
GeneralOutstanding [modified]memberJahmani29 Apr '08 - 4:32 
I've read this article and I'm glad I didn't have to grasp the points made in it the proverbial hard way. However one point you make make (but don't push hard enough enough in my opinion Big Grin | :-D ) is the fact that writing software costs money and that this expenditure can only be justified if the perfomance of an application meets and outweighs the benefits realized.
 
modified on Wednesday, April 30, 2008 3:12 AM

GeneralExcellent article! But I disagree with the conclusion.membervaldok7 Jan '08 - 8:05 
You've mentioned the performance bottlenecks caused by the following bizzare things:
 
1. Kernel mode calls (ReadFile for 1 byte per call).
2. Heap allocations of compile-time predictable, very small sized blocks.
3. CPU cache misses and page faults due to the memory overuse.
4. Naive low-level code add-ons that disable some global compiler optimizations.
5. Inefficient algorithm in general.
 
I agree that diagnosing the problem is essential in order to solve it.
But why not avoid the very basic and obvious performance pitfalls ? I mean (1) and (2): barbaric kernel mode calls and heap allocations.
 
I agree that one should not try to optimize everything at any cost, but IMHO - those two things should the ABC of any code writing, regardless to wether your project already suffers from inefficiency or not.
 
Once I was asked to optimize a server. After a coupple of days I managed to make it tens times faster by just fixing those things.
 
Vladik.
GeneralRe: Excellent article! But I disagree with the conclusion.memberJoseph M. Newcomer7 Jan '08 - 9:19 
Yes, of course they should be avoided, but in fact they were not, and the result was really bad performance. Key here is identify the reason for the performance failure and fix it, and that it is usually someplace unanticipated, due to some real blunder. Yet people will happily spend weeks tuning some irrelevant piece of code in the mistaken belief they are "optimizing" their code, when the real culprit is simpler and more obvious.
Generalwrong wrong wrongmemberCodeVirtue5 Oct '07 - 20:08 
Anybody who survives long enough to get down to here please take it from a top notch real-time systems programmer with 25+ years experience: Any nitwit can pick up the "premature optimization is the root of all evil" argument. This article loses its own argument in the first line: "A reasonably skilled programmer will not write a grossly inefficient program". FALSE. Reasonably skilled programmers who subscribe to this line of thinking absolutely do write inefficient programs. Optimization is poorly understood and routinely receives far less attention than it should, even in AAA commercial software packages. Garbage like this is the reason why.

GeneralRe: wrong wrong wrongmemberDQNOK11 Oct '07 - 4:05 
Ha! Very funny.
 
Do you know any "top notch real-time systems programmer"s?
GeneralRe: wrong wrong wrongmemberCodeVirtue12 Oct '07 - 16:05 
of course

GeneralRe: wrong wrong wrongmemberAndrew Selivanov6 Dec '07 - 3:05 
Ok, what is optimization in your opinion?
GeneralRe: wrong wrong wrongmemberJoseph M. Newcomer2 Feb '08 - 15:23 
Hmmm...take it from a top-notch systems programmer with 45+ years experience that the nasty reality is that young programmers will spend weeks (literally), optimizing a piece of code that doesn't matter at all. Perhaps you have never had to work 90-hour weeks because you've been called in to save a project because the programmer decided he had a "faster" algorithm, put the whole project a month behind schedule, and wrote a very small and very fast version OF THE WRONG ALGORITHM, and his smaller, faster algorithm improved overall performance by < 1%. Sorry, I was the guy called in to rescue the project. The Unix philosophy all over again: it doesn't matter if it is right as long as it is small and fast.
 
This article is based on the fact that I spent over 15 years doing performance analysis on programs, and discovered that if you ask any programmer where the time is going in their code, you will get the WRONG answer. The parts where they think the time is going are not the parts where the actual time is being spent, and the outcome is that they spend weeks "tuning" irrelevant code. Without peformance data, you have no idea where to spend the effort. Anyone who tries to optimize code without performance data is a fool.
 
You don't do optimization until you know what needs to be optimized. And lines of code rarely matter. Architectural changes buy orders of magnitude performance improvement; lines-of-code changes typically buy small single-digit percentages. That's reality.
GeneralUseful ArticlememberSreekanth Muralidharan29 Nov '05 - 0:52 
Hello
That was a great job!! I found it really interesting to read through. Perhaps this really matters when a part or whole of a project is handled by different members of the team and the when members change quite often.
 
Cheers,
 
Sreekanth Muralidharan,
Corporate Systems Consultant [Embedded Systems],
INDIA
GeneralCrapsussAnonymous24 Jun '04 - 9:40 
I'll read this article again when its been redone, without all the annoying little detours and your showing off.
Just my comment.You won't find a better one.
GeneralRe: CrapmemberJoseph M. Newcomer26 Jun '04 - 6:32 
Life is hard. I have no intention of redoing it.
GeneralRe: CrapmemberPaulC197226 Nov '06 - 17:27 
Joseph M. Newcomer wrote:
Life is hard. I have no intention of redoing it.


Laugh | :laugh:
 
Very nice article Smile | :)
 


If you try to write that in English, I might be able to understand more than a fraction of it. - Guffa

GeneralRe: CrapmemberDavid Nash11 May '05 - 5:04 
Our anonymous writer here seems to be quite an expert on crap. Smile | :)
 
Seriously though...
 
It goes without saying that this article is quite opinionated, but hey - what's wrong with that! Personally, I found the article to be an enjoyable read, and also informative.
 
Unfortunately there are a few that can't handle other people's opinions (especially if expressed strongly). My advice to them is...
Don't read articles like this if you can't handle it. Spend your time doing something you actually enjoy.
 
Regards,
David
GeneralRe: Crapmemberpeterchen18 Nov '05 - 12:50 
1 for your post. Need I explain?
 

Pandoras Gift #44: Hope. The one that keeps you on suffering.
aber.. "Wie gesagt, der Scheiss is' Therapie"
boost your code || Fold With Us! || sighist | doxygen

GeneralWell putsussEloff30 Dec '03 - 22:24 
It is very true that optimization without clearly understanding where the problem lies is worse than a waste of time.
 
The trouble is though, that it goes against the grain of a programmer to write what he knows to be slow code even if the execution speed harldy matters. If the programmer can think of a better way, he/she will generally favor that, even if it means more work for naught in the end. The trick, I suppose, is to find a balance where you don't waste too much time on optimization, but don't use horribly ineffective code either.
 
I stumbled across your article while searching for some benchmark code I could plugin to quickly test if my compiler makes an unescessary copy on return and therefore if I should use coding trick to eliminate that. In reality, although the code may see a fair amount of use in the program, it will doubtless pale by comparison with the true performance hog, the core text parsing algorithim of the application. In the end the copying of a 12 byte structure will be like a bucket of water added to a river. It won't be noticed. It still bothers me greatly that I may not be writing the most effecient code, even if it is more manageable than the hack of using a private contructor for each operator and distinguishing them with a thrid dummy parameter.
 
Taking a trip down memory lane...
 
I remember once back when I was really naive, I thought I'd write this database application into C# instead of PHP. So I started learning a new language because all I knew was PHP, and began implementing my code. I thankfully was eager to test how much faster this compiled C like language would be and set up a benchmark where I could compare the performance of C# to PHP when selecting records in a large table.
 
As you might guess, I was shocked to discover that C# was considerably slower. The reason being, as I understand it now, that I was using a database driver that hides behind several layers of abstraction in C#. In PHP you have a highly optimized C++ API produced and maintained by the database company itself. In hindsight learning a new programming language helped me, and provided the bridge to C++, but otherwise was a waste of time.
 
Daniel Eloff
GeneralRe: Well putmemberJoseph M. Newcomer5 May '04 - 9:22 
I agree that the trouble is that writing slow code "goes against the grain" of most programmers, proving conclusively that such programmers were not well-trained, and in fact are mere technicians, not engineers.
 
An engineer who specified gold-palladium contacts for a child's toy would be fired. One who failed to do so for an aerospace device would likewise be fired.
 
Engineering is the art of building cost-effective solutions within specified constraints. These include ultimate performance, development costs, maintenance costs, etc. Overengineering a solution merely wastes time and intellectual effort that could be better spent figuring out what the real costs are.
 
It is possible to write a bad FORTRAN program in any language.
 
In a good compiler, nine levels of abstraction will generate half an instruction. C# has not come even close to this yet.
 
You can lose an order of magnitude performance in a bad interface. Or not. Until you measure, you have no idea.
GeneralRe: Well putsussPhilGan7 May '04 - 3:28 
The material used in a device doesn't seem to be a relevant example in this case; I'm sure most programming 'technicians' wouldn't specify cutting edge technology to run a basic program. Not anymore than an 'engineer' would use inferior material to build a high technology device.
 
The issue would seem to be in taking the time to make improvements in areas that are not part of the program's rate determining step. Considering that many programmers place an emphasis on code reuse I would argue that this is not the crime that Joseph makes it out to be. I may write code that could be used in several programs and, since I don't always know in advance every possible use it may have, wouldn't it make sense to make it as efficient as I can in the first instance?
GeneralRe: Well putmemberJoseph M. Newcomer7 May '04 - 4:18 
Rarely. Very, very rarely. The chances that a subroutine used in one program will become part of the critical performance of another is so low that it is not worth worrying about. Only when it becomes a performance issue (with data to prove it!) is it worth optimizing. So if you write a subroutine which is reused many times, the chances that it has a performance issue is very, very slight. If it does, then, and only then, is it worth revisiting it to see if there is a meaningful improvement.
 
Going back to my analogy, this is like an engineer saying "I'm designing a battery contact for a toy. It might be reused in other toys, and we might get complaints about poor battery contact, so I'll naturually specify this should be gold-palladium coated. The chances that the same battery contact would be reused in an aerospace application is vanishingly small, so optimizing its performance for some unforeseen purpose, based on some unlikely failure mode, is pointless.
 
Of course, if you reuse a design, it is always important to evaluate its suitability. Using an O(n**2) sort when you had less than 10 items is sensible; using it when you have 100,000 items would be a poor reuse. But writing the world's best whatever-subroutine as incredibly fast because it MIGHT be used someday in an unlikely context where it was the performance bottleneck makes no sense.
GeneralRe: Well putmemberBadut29 Sep '04 - 15:40 
Joseph M. Newcomer wrote:
In a good compiler, nine levels of abstraction will generate half an instruction. C# has not come even close to this yet.
 
I don't understand what this means but I would like to. Could Joseph or someone else please explain?

GeneralRe: Well putmemberJoseph M. Newcomer2 Oct '04 - 5:38 
Imagine a class which has a method. This method is implemented in terms of a method of another class, which, when we examine it, is implemented in terms of another method, and so on. Some of these methods may have switch statements or if-statements in them that say "if the second parameter is thus-and-such, call method A of class B, otherwise, call method C of class D" or something like that. Now suppose at the topmost call, the user provides a compile-time constant value. Ultimately, in the lowest-level class called, this turns into an access of a class member. The only thing that was important through all these levels of abstraction was getting ultimately to that class member. But the class member access only requires an addressing mode, e.g., [EBX+12]. That's half an instruction. I actually worked on such compilers in the early 1980s, and they were very good indeed. Most of this actually came out of the peephole optimizer that runs on the generated machine code.
GeneralOptimize this!memberTerry O'Nolley28 Sep '03 - 13:34 
In an operating systems class several years ago, we got into talking about optimization. THe entire hour was spent discussing "real world" situations where optimizations helped the OS run better.
 
One of the examples the instructor gave us had to do with a team of developers who used the wrong tool when benchmarking their new version of an OS.
 
They noticed that a very large percentage of the execution time was being spent in a small block of code. So they began reqriting that block in assembler but no matter what they did, the execution time did not shrink.
 
It was finally discovered that they were optimizing the idle loop.
 







GeneralRe: Optimize this!memberplaudeman4 May '04 - 13:14 
LMAO! Too funny...
GeneralRe: Optimize this!membergenome69200614 Nov '07 - 4:49 
Hahaha!Laugh | :laugh:
GeneralRe: Optimize this!membersupercat912 Jul '08 - 9:17 
Terry O'Nolley wrote:
It was finally discovered that they were optimizing the idle loop.

 
I remember a time when the 'idle loop' turned out to be a problem, but the required change was essentially an un-optimization: replace a sleep instruction with a branch-to-self. It turned out that the processor's reduced current consumption in sleep caused the system voltage level to vary enough to produce a noticeable "click" from the audio circuitry. Normally minimizing current consumption would have been a good thing, but here reducing current consumption 100x/second caused a faint but annoying buzz. Replacing the sleep instruction with a branch-to-self solved the buzzing.
 
BTW, while the anecdote is funny with regard to the idle loop, it's important to recognize that busy-wait loops of any kind may yield similar results but many of them do need optimizing. An application should only be spending time in a busy-wait loop if there's nothing else for it to be doing, but I've seen plenty of code which says:
  while (!character_is_ready())
    ;
Such code hogs CPU time, and generally needs fixing. Optimizing the character_is_ready() function will help only if the "optimization" includes having it yield CPU time when called excessively.
GeneralRe: Optimize this!memberJoseph M. Newcomer12 Jul '08 - 11:51 
This is one of the most common failure modes I see with new thread-programmers; they will spin on some boolean instead of waiting.
 
Spin locks, on the other hand, are quite effective. Consider that if I do
 
acquire lock
twiddle four to six pointers
release lock
 
and I have a conflict, I will spin for under 5ns, possibly under 3ns. But if I WaitForSingleObject, I have to enter the kernel, have my parameters validated, look up the handle in the handle map, verify it is a waitable object, test it, and if I have to wait, call the scheduler to deschedule the thread, which among other things, involves two context swaps (swap out current thread context, start some other thread). Then, when the wait is finished, I have to have call the OS to release the lock (parameter validation, handle table lookup, etc.) and then call the scheduler to schedule the thread, then go through two context swaps (save current, load next) to get that thread running. So I'll execute several hundred thousand, perhaps a million, instructions to save a conflict of ten instructions. Not a good tradeoff.
 
That's what CRITICAL_SECTION is good for: it is a timed spin lock. If it can acquire in a short period, no kernel call is involved; only if the delay seems long will it call the kernel to officially deschedule the thread.
 
Busy-wait does not scale well, but for small sequences, it is highly effective. If you follow the adage to "lock the smallest amount of data for the shortest possible time" then conflicts are low, and when they occur, are not terribly expensive if you use a spin lock. But in general, a spin lock (unlike true love) really *is* forever, so the compromise with the "timeout" (actually a count-down of a simple counter) of CRITICAL_SECTION balances the need for performance against the desire to not hog the CPU by polling interminably.
joe

GeneralRe: Optimize this! (spin locks)membersupercat912 Jul '08 - 13:20 
One should only loop on a spin lock if there is some reason to believe that the necessary condition either (1) is likely to come true the next time the loop executes, or (2) is likely to be made true by the next execution of the loop. On a single-processor system, if one doesn't yield one's time slice during a loop, there's no reason to expect any other threads to do anything; if the loop can only succeed if other threads to do something, it should yield its timeslice.
 
Having a loop yield its timeslice if it needs something else to happen before it can do useful work is a often a rather nasty kludge, at least in a system which has semaphores and other task-handling features. On the other hand, I've implemented a couple systems which do use CPU-yielding busywaits for even the simplest of conditions because the non-preemptive multi-tasking "OS" was too simple for anything else. On an 8x51 system, for example, my task switch code is
  MOV A,SP
  XCH A,AltSp
  MOV SP,A
  RET
That's it. A big whopping seven cycles to do a task switch. In a situation like that, code to determine whether it would be worthwhile to run a task would almost certainly be slower than letting the task itself do something like:
WaitForData:
  CALL TaskSwitch
  JNB  Data_Ready,WaitForData
In that situation, if one task calls TaskSwitch while the other task is waiting for data, control will return to the first task in 16 cycles.
GeneralRe: Optimize this! (spin locks)memberJoseph M. Newcomer12 Jul '08 - 16:00 
I agree that on a uniprocessor, a spin lock in a preemptive timesliced environment is certainly a waste of time. What's a uniprocessor?
 
Seriously, when multiprocessors are being sold to 12-year-olds (and for that matter, I gave my mother a multiprocessor when she was 86), most of the concerns about wasting timeslices tends to be less critical. And generally a spin lock can't release on the next cycle, because it is probably locking a short sequence of instructions, so it will loop perhaps a dozen times or more.
 
They are used extensively in the kernel for multiprocessor interlocks. In a general-purpose OS, a task switch is a massively expensive operation, tens of thousands of instructions. It usually disrupts the cache and the Translation Lookaside Buffer (TLB) as well.
 
A dedicated realtime system such as the OCCAM system can do a task switch on an 80x86 in < 20ns, and it is not uncommon to hear these users talking about running 300,000 threads.
GeneralRe: Optimize this! (spin locks)membersupercat912 Jul '08 - 19:58 
They are used extensively in the kernel for multiprocessor interlocks. In a general-purpose OS, a task switch is a massively expensive operation, tens of thousands of instructions. It usually disrupts the cache and the Translation Lookaside Buffer (TLB) as well.
 
I wonder whether any OS's provide a means by which a task that's waiting on a mutex can tell if the thread that holds it is currently executing? If not, I would think that would be a simple feature to add. Simply provide that each thread is assigned a global variable which will be zero when the task is running and non-zero when it isn't. A task which acquires a mutex could set a pointer to its 'active' flag; a flag that wants the mutex could dereference the pointer to determine whether it should spinlock or yield its timeslice. If the target of the pointer is zero, the task should spinlock; if it's non-zero, it should yield.
 
Do any systems do anything like that?
GeneralRe: Optimize this! (spin locks)memberJoseph M. Newcomer13 Jul '08 - 6:51 
Actually, it isn't a simple feature; furthermore, "running" is a problematic evaluation. If the thread has blocked on a page fault, is it "running"? If it blocked on a second synchronization primitive, is it "running"? And what is a "global variable"? Certainly for interprocess synchronization the concept is meaningless, and there is no way to implement it in Windows, Unix, linux, Solaris, Mac OS-X even within a single process.
 
The reason spin-locks are used in the kernel is that there is no scheduler running at that level, so the concept of descheduling a thread cannot exist.
GeneralRe: Optimize this! (spin locks)membersupercat913 Jul '08 - 10:03 
If one is waiting on a lock which is going to be held for less time than it takes to do a context switch, it makes sense to wait without yielding the CPU. If it's going to be held for more time than it takes t1o do two context switches, one should yield. If the time will be between those values, either action is probably reasonable.
 
While it wouldn't make sense to spend a huge amount of time deciding whether a lock was going to be held for a long or a short time, I would think that having the OS make available a rough heuristic might be a useful feature. As for the questions regarding page faults and nested locks, I would define the flag as being set if a context switch would be necessary for a thread to actually execute any instructions; the context-switch handler would be responsible for maintaining the flags.
 
Since a page fault would force a context switch, it should mark the thread as "not running". If a thread is waiting on a lock which is expected to be released soon, it would be considered "running"; if it's waiting on a lock which is not expected to be released soon, it would yield the CPU and thus become "not running". Not a perfect heuristic, but I would think that in many cases it could be implemented cheaply, at least with regard to multiple threads in the same address space.
GeneralHelp for Image ProcessingsussPrit14 Apr '03 - 7:43 

hello
i am working on a Image Processing project, i need algo/code on morphology,finding skeleton of a video camera image,Regional descriptors....
can anybody plz guid me
my email id is
prit_me@hotmail.com
 

 
VC++
GeneralRe: Help for Image ProcessingmemberChris Losinger14 Apr '03 - 8:11 
are you really going to post this on every article you can find?
 


Image tools: ThumbNailer, Bobber, TIFFAssembler

GeneralRe: Help for Image ProcessingsussPrit15 Apr '03 - 8:29 
Ya i am requesting for guid line on every Image Processing related artical i find.
At least one will help me....hay may be you can !!
i badly need it.
Prit
 
VC++

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web01 | 2.6.130516.1 | Last Updated 13 Aug 2000
Article Copyright 2000 by Joseph M. Newcomer
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid