|
|
Comments and Discussions
|
|
 |

|
Experience is a great teacher. Your comment, "don't optimize something that is not the problem. Understand the problem first" reminds me of another quote: "The answer to your problem may not be the answer to your question."
|
|
|
|

|
great information and very authoritative
|
|
|
|

|
Joseph
I'm unable to find a copy of your original article in DDJ and wondered if you could point me to a version.
Regards
Daniel
|
|
|
|
|

|
Thanks for such a great article !
I really enjoyed reading your many articles but this one is special due to its time. We just finished a backup product. It is working smoothly in many places(it is still in beta) but is having problems regarding the backup speed. It is running the speed less than 50% of the speeds available in other products. We all developrs agreed that we need a optimization of code. The real point is what optimization? (We are talking about 1TB+ data to be backdup on LAN or WAN) . The another problem we are facing is statictics information. We need a data about using named pipes is fater than sockets by what factor etc?. Looking at the factor(50% slow) it is some what clear to me that the problem is not in the way code is written(we use simple Win32 API on Win2K or later platform) but is in general architecture of application(how data is read transferred and write).
After some analysis I found that we are doing extra read and write of data over pipe. Where can I get some matrix about performance of named pipe and socket? Also what other optimization u can suggest?
|
|
|
|

|
I enjoyed reading this article and agree with its main arguments.
Thanks for sharing
|
|
|
|

|
"Several years later he was still doing tricks like this; some people never learn."
Names! We need names!
|
|
|
|

|
Since the person has not been a programmer in many, many years, is in fact an administrator now, and approaching retirement, I see no reason to release his name.
|
|
|
|

|
I'm an electrical engineer who ended up writing lots of Digital Signal Processing code in my career, some of it in assembly language.
I agree with what's in the article.
I had an experience where I had optimized a floating point version of a routine in assembly language for a compressor. That ran faster. The, I did the same thing for the similar, but opposite, routine in the decompressor, and the assembly code ran slower! The reverse routine was integer code, and so it needed to shift the data after the multiplications. The additional shift instructions made the code exceed the processor's 4K instruction cache, and cache thrashing ensued.
This is why inline code can sometimes be slower than just calling a routine. If a small routine is cached in the primary processor cache, it might run faster than if it's in a routine that loops that exceeds the cache size. The processor will keep things that run often in cache.
There are rare exceptions though. Sometimes, it is obvious that a particular piece of code should be written in assembly code. An example would be a video decoder that takes advantage of SIMD instructions. Any 'C' implementation will definitely run much, much, slower. I suppose I only know this because of experience, however, the ability to do four multiples and two adds in one cycle, when hundreds of millions of multiple adds must be done, and where the instructions are designed to solve that type of problem, makes deciding to use them a no-brainer.
However, I'd still profile the code, even for the video/assembly code example mentioned above, for the reason mentioned in the article. There could be something unexpected slowing any program down. I only mentioned the processor cache affecting speed. There are many other issues beyond the algorithm complexity, such as device latencies, bus speeds, and traps, and exceptions. On one processor, I found out that one floating point conversion instruction was done using a trap instruction, and the conversion was done in code. (This was an early chip made by the company I worked at then, and this was fixed in later processors... the code was faster on other machines, but horribly slow on this one machine. I recode it to work differently - I could not have found this without profiling the code).
Great article. You clearly know what you're talking about.
Bill
Design Patterns
|
|
|
|

|
I agree that timing optimizations should generally be limited to situations where a simple implementation is known to be too slow (either through profiling, or via calculations of actual time requirements). Space optimizations may be another story, however.
If an application will have a hard limit on code size (as distinct from having a general desire to minimize code size) and if it is anticipated that the limit is likely to pose difficulties, I would think it better to watch one's space utilization throughout the project, rather than being stuck at the end when things don't fit. If a particular space optimization might take some time to implement, it may be worthwhile to mark the code saying, in essence, "If ~100 more bytes are needed, they could be had here for a few hours' coding" but if a particular optimization is fast and easy and is known to save a few bytes, I would see no reason not to implement it at the coding phase. For example, as a matter of habit, replace for (i=0; i<8; i++){
} with either i=8;
do { .. loop
} while(--i);if the index is not needed in the loop or i=0;
do {..loop
} while(!((++i) & 8));in situations where the upper bound is extremely unlikely to change (e.g. number of bits in a byte). (note on the latter optimization: much of my coding is done on architectures which are optimized for such bit tests).
If I know that I'm likely to be tight on code space, is there any reason not to save an instruction in places where I can do so easily (note that I typically code on machines with room for about 4,096 instructions, so an instruction here and an instruction there can really add up)? As a matter of style, which loop form do you like better? Aesthetically, I like the do{}while(--i) better form a bit better, since it does not rely upon the number of repetitions being a power of two, it's less ugly, and it executes a little faster, but they both tend to produce the same amount of code, e.g.; i=8; do while(--i) version--four instructions; two cycles setup plus three per loop
movlw 8
movwf i
loop:
...
decfsz i
goto loop
; i=0; do while(!((++i) & 8)) version--one cycle setup plus four per loop
clrf i
loop:
incf i
btfss i,3
goto loopDo you have any aesthetic preference?
One major difference between space optimizations and timing optimizations is that when optimizing for timing, code which is likely to execute once per second is only 1/1000 as important as code which executes 1,000 times per second; when optimizing for space, code which executes once per decade (or even code which is unlikely to execute ever!) is just as important as code which executes 10,000 times/second.
|
|
|
|

|
I don't recognize the machine code involved, but it looks like a fixed-instruction-size machine, so the discussion is irrelevant to most modern architectures (e.g., the x86). On the whole, the issue of code size is irrelevant nearly all the time. Also, "cycle" is a pretty vague concept in a pipelined superscalar architecture with caches, so largely it is impossible to predict how long any instruction takes. A good rule of thumb for the x86 is for a 2.8GHz machine the average instruction time is 175ps (that's *pico* seconds), which means instruction times for loop increment and test are essentially free, no matter how you measure them.
Data size is what kills you; reducing the data size by a factor of 10 can produce reductions in time by factors of 1000 without much effort. Localizing data to a small number of pages is better than having data structures spread out everywhere. As I tell my students, "Code size is irrelevant. Data size will kill you. Third-order effects will dominate your performance"
My own preference is that if I have a counted loop I will use a for-loop because it conveys what I intend; since the loop increment and branch are free, counting cycles just doesn't matter.
|
|
|
|

|
I don't recognize the machine code involved, but it looks like a fixed-instruction-size machine, so the discussion is irrelevant to most modern architectures (e.g., the x86).
It is irrelevant to most modern architectures used for general-purpose computing. It is very much relevant in modern architectures used in embedded devices (exercise machines, microwave ovens, etc.) If a company is going to be making 10,000 units of a particular product and a chip with 4K of code space is $0.25 cheaper than one with 8K, taking the time to fit the code into 4K instead of 8K will represent a cost savings of $2,500. If there are going to be 100,000 units of the device made, the savings would be $25,000.
I guess another way to look at it would be to say that if a piece of code is going into 10,000 of those little micros, the extra code space going from a 4K part to an 8K part costs $0.62/byte. If it's going into 100,000 units, the extra code costs $6.20/byte. Quite a different ball game from general-purpose computers where a failure to optimize code that's used by a million users would probably impose a total extra cost on all those users of less than $0.0001/byte.
|
|
|
|

|
Yes, but the embedded/realtime world is a world of its own. My essays relate generally to programming applications in general-purpose operating systems. I've done embedded code, with fixed-size EPROMs, but the rules and culture of realtime programming do not apply to general-purpose application programming; it is rarely cost-effective to act as if this were so.
|
|
|
|

|
I do a mixture of PC and embedded programming, though until recently the vast majority of my work was with embedded systems. I've found that while there are some principles that are only applicable to one style or the other, there are many principles that apply to both.
Your comment about data size is very well taken, though it applies IMHO not only to the size of dynamic data, but also to the size of data that is bundled with the code. In many situations, code is dwarfed by data that is bundled with it. Even if one figures that for a widely-used program the total cost to all users of each byte of the program is less than $0.0001/byte, the real-world cost can still be many dollars per megabyte. If a program uses 100 megabytes of data to accomplish something that could be done just as well in 50, the total cost to users will likely not be intolerable, but it won't be trivial either.
Incidentally, in my current project I actually run a large amount of code on a PIC (a small microcontroller) and on the PC. I have some routines like "load LCD data" which have separate implementations on the two platforms, but most of the code is identical. The PC versions of such routines feed data to a vb.net program which simulates the hardware platform. The net effect is that I can use the Visual Studio debugging features to get new features working on the PC, and then run that same code on the target system. I've used the approach before with DOS-based tools; this is my first time using VS. Also, btw, it's the first time I've done a lot of PIC code development using an editor other than the venerable PC-Write 3.01 which I've been using for over 20 years.
It's interesting--the microprocessor was originally designed to be a replacement for logic rather than a computing device, but the roles of the computer and the roles of the microprocessor converged. Over the last decade or so, however, they've become increasingly divergent. The processors used in PCs are grossly unsuitable for many embedded tasks and vice versa. Things have sure changed since the days of the venerable 6502 (for which I still enjoy coding, btw) eh?
|
|
|
|

|
Actually, 100MB is trivial amounts of data. Data does not *really* become interesting until you start into GB. This was not always so. I remember developing programs that would do data compression of a database, creating a STRINGTABLE representation that could be put into a DLL; the resulting compression was 4:1 which made the difference between fitting the app on a laptop of the era or not being able to do it at all.
I developed a lot of embedded code (including dynamic loading of LCD data) on MS-DOS before we moved the code to an embedded platform (back in 1990).
I still have my 6502 Ohio Microsystems board and I'm sure it is still functional; it hasn't had power applied in perhaps 20 years...
|
|
|
|

|
| Actually, 100MB is trivial amounts of data.
That depends who's using it. If an application is going to be distributed over the Internet, adding 100MB to the download size will significantly increase download time for many users (and monstrously increase it for many of those in rural areas). If the program is a game which runs off DVD, adding 100MB to the load size will noticeably increase the load time every time a user plays the game. While one extra blob of 100MB will not generally cause too much annoyance, if there are very many they'll add up.
I still have my 6502 Ohio Microsystems board and I'm sure it is still functional; it hasn't had power applied in perhaps 20 years...
I still write games for the Atari 2600 Video Computer System. It has 128 bytes of RAM, and carts are typically 4K-32K. My last project was a menu for a 64K multi-game cart. The menu was about 20K, and featured four-voice wavetable synthesized music along with some pretty scrolling graphics and a small scrolltext. Not bad for a 6502 at 1.19MHz. The music generation took 46 cycles every 76--some rather nasty optimizations there! The project before that was an 8K cart which included a rather nice game of mine plus a 1.25K "bonus game" that someone else wrote.
|
|
|
|

|
Interesting discussions: one from PC point of view, and other from embedded device.
I guess we go back to the Summary of this article: "Optimization matters only when it matters"
We should optimize speed, size, or both only when it matters.
|
|
|
|

|
I've read this article and I'm glad I didn't have to grasp the points made in it the proverbial hard way. However one point you make make (but don't push hard enough enough in my opinion ) is the fact that writing software costs money and that this expenditure can only be justified if the perfomance of an application meets and outweighs the benefits realized.
modified on Wednesday, April 30, 2008 3:12 AM
|
|
|
|

|
You've mentioned the performance bottlenecks caused by the following bizzare things:
1. Kernel mode calls (ReadFile for 1 byte per call).
2. Heap allocations of compile-time predictable, very small sized blocks.
3. CPU cache misses and page faults due to the memory overuse.
4. Naive low-level code add-ons that disable some global compiler optimizations.
5. Inefficient algorithm in general.
I agree that diagnosing the problem is essential in order to solve it.
But why not avoid the very basic and obvious performance pitfalls ? I mean (1) and (2): barbaric kernel mode calls and heap allocations.
I agree that one should not try to optimize everything at any cost, but IMHO - those two things should the ABC of any code writing, regardless to wether your project already suffers from inefficiency or not.
Once I was asked to optimize a server. After a coupple of days I managed to make it tens times faster by just fixing those things.
Vladik.
|
|
|
|

|
Yes, of course they should be avoided, but in fact they were not, and the result was really bad performance. Key here is identify the reason for the performance failure and fix it, and that it is usually someplace unanticipated, due to some real blunder. Yet people will happily spend weeks tuning some irrelevant piece of code in the mistaken belief they are "optimizing" their code, when the real culprit is simpler and more obvious.
|
|
|
|

|
Anybody who survives long enough to get down to here please take it from a top notch real-time systems programmer with 25+ years experience: Any nitwit can pick up the "premature optimization is the root of all evil" argument. This article loses its own argument in the first line: "A reasonably skilled programmer will not write a grossly inefficient program". FALSE. Reasonably skilled programmers who subscribe to this line of thinking absolutely do write inefficient programs. Optimization is poorly understood and routinely receives far less attention than it should, even in AAA commercial software packages. Garbage like this is the reason why.
|
|
|
|

|
Ha! Very funny.
Do you know any "top notch real-time systems programmer"s?
|
|
|
|
|

|
Ok, what is optimization in your opinion?
|
|
|
|

|
Hmmm...take it from a top-notch systems programmer with 45+ years experience that the nasty reality is that young programmers will spend weeks (literally), optimizing a piece of code that doesn't matter at all. Perhaps you have never had to work 90-hour weeks because you've been called in to save a project because the programmer decided he had a "faster" algorithm, put the whole project a month behind schedule, and wrote a very small and very fast version OF THE WRONG ALGORITHM, and his smaller, faster algorithm improved overall performance by < 1%. Sorry, I was the guy called in to rescue the project. The Unix philosophy all over again: it doesn't matter if it is right as long as it is small and fast.
This article is based on the fact that I spent over 15 years doing performance analysis on programs, and discovered that if you ask any programmer where the time is going in their code, you will get the WRONG answer. The parts where they think the time is going are not the parts where the actual time is being spent, and the outcome is that they spend weeks "tuning" irrelevant code. Without peformance data, you have no idea where to spend the effort. Anyone who tries to optimize code without performance data is a fool.
You don't do optimization until you know what needs to be optimized. And lines of code rarely matter. Architectural changes buy orders of magnitude performance improvement; lines-of-code changes typically buy small single-digit percentages. That's reality.
|
|
|
|

|
Hello
That was a great job!! I found it really interesting to read through. Perhaps this really matters when a part or whole of a project is handled by different members of the team and the when members change quite often.
Cheers,
Sreekanth Muralidharan,
Corporate Systems Consultant [Embedded Systems],
INDIA
|
|
|
|

|
I'll read this article again when its been redone, without all the annoying little detours and your showing off.
Just my comment.You won't find a better one.
|
|
|
|

|
Life is hard. I have no intention of redoing it.
|
|
|
|

|
Joseph M. Newcomer wrote: Life is hard. I have no intention of redoing it.
Very nice article
If you try to write that in English, I might be able to understand more than a fraction of it. - Guffa
|
|
|
|

|
Our anonymous writer here seems to be quite an expert on crap.
Seriously though...
It goes without saying that this article is quite opinionated, but hey - what's wrong with that! Personally, I found the article to be an enjoyable read, and also informative.
Unfortunately there are a few that can't handle other people's opinions (especially if expressed strongly). My advice to them is...
Don't read articles like this if you can't handle it. Spend your time doing something you actually enjoy.
Regards,
David
|
|
|
|
|

|
It is very true that optimization without clearly understanding where the problem lies is worse than a waste of time. The trouble is though, that it goes against the grain of a programmer to write what he knows to be slow code even if the execution speed harldy matters. If the programmer can think of a better way, he/she will generally favor that, even if it means more work for naught in the end. The trick, I suppose, is to find a balance where you don't waste too much time on optimization, but don't use horribly ineffective code either. I stumbled across your article while searching for some benchmark code I could plugin to quickly test if my compiler makes an unescessary copy on return and therefore if I should use coding trick to eliminate that. In reality, although the code may see a fair amount of use in the program, it will doubtless pale by comparison with the true performance hog, the core text parsing algorithim of the application. In the end the copying of a 12 byte structure will be like a bucket of water added to a river. It won't be noticed. It still bothers me greatly that I may not be writing the most effecient code, even if it is more manageable than the hack of using a private contructor for each operator and distinguishing them with a thrid dummy parameter. Taking a trip down memory lane... I remember once back when I was really naive, I thought I'd write this database application into C# instead of PHP. So I started learning a new language because all I knew was PHP, and began implementing my code. I thankfully was eager to test how much faster this compiled C like language would be and set up a benchmark where I could compare the performance of C# to PHP when selecting records in a large table. As you might guess, I was shocked to discover that C# was considerably slower. The reason being, as I understand it now, that I was using a database driver that hides behind several layers of abstraction in C#. In PHP you have a highly optimized C++ API produced and maintained by the database company itself. In hindsight learning a new programming language helped me, and provided the bridge to C++, but otherwise was a waste of time. Daniel Eloff
|
|
|
|

|
I agree that the trouble is that writing slow code "goes against the grain" of most programmers, proving conclusively that such programmers were not well-trained, and in fact are mere technicians, not engineers.
An engineer who specified gold-palladium contacts for a child's toy would be fired. One who failed to do so for an aerospace device would likewise be fired.
Engineering is the art of building cost-effective solutions within specified constraints. These include ultimate performance, development costs, maintenance costs, etc. Overengineering a solution merely wastes time and intellectual effort that could be better spent figuring out what the real costs are.
It is possible to write a bad FORTRAN program in any language.
In a good compiler, nine levels of abstraction will generate half an instruction. C# has not come even close to this yet.
You can lose an order of magnitude performance in a bad interface. Or not. Until you measure, you have no idea.
|
|
|
|

|
The material used in a device doesn't seem to be a relevant example in this case; I'm sure most programming 'technicians' wouldn't specify cutting edge technology to run a basic program. Not anymore than an 'engineer' would use inferior material to build a high technology device.
The issue would seem to be in taking the time to make improvements in areas that are not part of the program's rate determining step. Considering that many programmers place an emphasis on code reuse I would argue that this is not the crime that Joseph makes it out to be. I may write code that could be used in several programs and, since I don't always know in advance every possible use it may have, wouldn't it make sense to make it as efficient as I can in the first instance?
|
|
|
|

|
Rarely. Very, very rarely. The chances that a subroutine used in one program will become part of the critical performance of another is so low that it is not worth worrying about. Only when it becomes a performance issue (with data to prove it!) is it worth optimizing. So if you write a subroutine which is reused many times, the chances that it has a performance issue is very, very slight. If it does, then, and only then, is it worth revisiting it to see if there is a meaningful improvement.
Going back to my analogy, this is like an engineer saying "I'm designing a battery contact for a toy. It might be reused in other toys, and we might get complaints about poor battery contact, so I'll naturually specify this should be gold-palladium coated. The chances that the same battery contact would be reused in an aerospace application is vanishingly small, so optimizing its performance for some unforeseen purpose, based on some unlikely failure mode, is pointless.
Of course, if you reuse a design, it is always important to evaluate its suitability. Using an O(n**2) sort when you had less than 10 items is sensible; using it when you have 100,000 items would be a poor reuse. But writing the world's best whatever-subroutine as incredibly fast because it MIGHT be used someday in an unlikely context where it was the performance bottleneck makes no sense.
|
|
|
|

|
Joseph M. Newcomer wrote:
In a good compiler, nine levels of abstraction will generate half an instruction. C# has not come even close to this yet.
I don't understand what this means but I would like to. Could Joseph or someone else please explain?
|
|
|
|

|
Imagine a class which has a method. This method is implemented in terms of a method of another class, which, when we examine it, is implemented in terms of another method, and so on. Some of these methods may have switch statements or if-statements in them that say "if the second parameter is thus-and-such, call method A of class B, otherwise, call method C of class D" or something like that. Now suppose at the topmost call, the user provides a compile-time constant value. Ultimately, in the lowest-level class called, this turns into an access of a class member. The only thing that was important through all these levels of abstraction was getting ultimately to that class member. But the class member access only requires an addressing mode, e.g., [EBX+12]. That's half an instruction. I actually worked on such compilers in the early 1980s, and they were very good indeed. Most of this actually came out of the peephole optimizer that runs on the generated machine code.
|
|
|
|

|
In an operating systems class several years ago, we got into talking about optimization. THe entire hour was spent discussing "real world" situations where optimizations helped the OS run better.
One of the examples the instructor gave us had to do with a team of developers who used the wrong tool when benchmarking their new version of an OS.
They noticed that a very large percentage of the execution time was being spent in a small block of code. So they began reqriting that block in assembler but no matter what they did, the execution time did not shrink.
It was finally discovered that they were optimizing the idle loop.
|
|
|
|
|

|
Hahaha!
|
|
|
|

|
Terry O'Nolley wrote: It was finally discovered that they were optimizing the idle loop.
I remember a time when the 'idle loop' turned out to be a problem, but the required change was essentially an un-optimization: replace a sleep instruction with a branch-to-self. It turned out that the processor's reduced current consumption in sleep caused the system voltage level to vary enough to produce a noticeable "click" from the audio circuitry. Normally minimizing current consumption would have been a good thing, but here reducing current consumption 100x/second caused a faint but annoying buzz. Replacing the sleep instruction with a branch-to-self solved the buzzing.
BTW, while the anecdote is funny with regard to the idle loop, it's important to recognize that busy-wait loops of any kind may yield similar results but many of them do need optimizing. An application should only be spending time in a busy-wait loop if there's nothing else for it to be doing, but I've seen plenty of code which says: while (!character_is_ready())
;
Such code hogs CPU time, and generally needs fixing. Optimizing the character_is_ready() function will help only if the "optimization" includes having it yield CPU time when called excessively.
|
|
|
|

|
This is one of the most common failure modes I see with new thread-programmers; they will spin on some boolean instead of waiting.
Spin locks, on the other hand, are quite effective. Consider that if I do
acquire lock
twiddle four to six pointers
release lock
and I have a conflict, I will spin for under 5ns, possibly under 3ns. But if I WaitForSingleObject, I have to enter the kernel, have my parameters validated, look up the handle in the handle map, verify it is a waitable object, test it, and if I have to wait, call the scheduler to deschedule the thread, which among other things, involves two context swaps (swap out current thread context, start some other thread). Then, when the wait is finished, I have to have call the OS to release the lock (parameter validation, handle table lookup, etc.) and then call the scheduler to schedule the thread, then go through two context swaps (save current, load next) to get that thread running. So I'll execute several hundred thousand, perhaps a million, instructions to save a conflict of ten instructions. Not a good tradeoff.
That's what CRITICAL_SECTION is good for: it is a timed spin lock. If it can acquire in a short period, no kernel call is involved; only if the delay seems long will it call the kernel to officially deschedule the thread.
Busy-wait does not scale well, but for small sequences, it is highly effective. If you follow the adage to "lock the smallest amount of data for the shortest possible time" then conflicts are low, and when they occur, are not terribly expensive if you use a spin lock. But in general, a spin lock (unlike true love) really *is* forever, so the compromise with the "timeout" (actually a count-down of a simple counter) of CRITICAL_SECTION balances the need for performance against the desire to not hog the CPU by polling interminably.
joe
|
|
|
|

|
One should only loop on a spin lock if there is some reason to believe that the necessary condition either (1) is likely to come true the next time the loop executes, or (2) is likely to be made true by the next execution of the loop. On a single-processor system, if one doesn't yield one's time slice during a loop, there's no reason to expect any other threads to do anything; if the loop can only succeed if other threads to do something, it should yield its timeslice.
Having a loop yield its timeslice if it needs something else to happen before it can do useful work is a often a rather nasty kludge, at least in a system which has semaphores and other task-handling features. On the other hand, I've implemented a couple systems which do use CPU-yielding busywaits for even the simplest of conditions because the non-preemptive multi-tasking "OS" was too simple for anything else. On an 8x51 system, for example, my task switch code is
MOV A,SP
XCH A,AltSp
MOV SP,A
RET That's it. A big whopping seven cycles to do a task switch. In a situation like that, code to determine whether it would be worthwhile to run a task would almost certainly be slower than letting the task itself do something like:
WaitForData:
CALL TaskSwitch
JNB Data_Ready,WaitForData
In that situation, if one task calls TaskSwitch while the other task is waiting for data, control will return to the first task in 16 cycles.
|
|
|
|

|
I agree that on a uniprocessor, a spin lock in a preemptive timesliced environment is certainly a waste of time. What's a uniprocessor?
Seriously, when multiprocessors are being sold to 12-year-olds (and for that matter, I gave my mother a multiprocessor when she was 86), most of the concerns about wasting timeslices tends to be less critical. And generally a spin lock can't release on the next cycle, because it is probably locking a short sequence of instructions, so it will loop perhaps a dozen times or more.
They are used extensively in the kernel for multiprocessor interlocks. In a general-purpose OS, a task switch is a massively expensive operation, tens of thousands of instructions. It usually disrupts the cache and the Translation Lookaside Buffer (TLB) as well.
A dedicated realtime system such as the OCCAM system can do a task switch on an 80x86 in < 20ns, and it is not uncommon to hear these users talking about running 300,000 threads.
|
|
|
|

|
They are used extensively in the kernel for multiprocessor interlocks. In a general-purpose OS, a task switch is a massively expensive operation, tens of thousands of instructions. It usually disrupts the cache and the Translation Lookaside Buffer (TLB) as well.
I wonder whether any OS's provide a means by which a task that's waiting on a mutex can tell if the thread that holds it is currently executing? If not, I would think that would be a simple feature to add. Simply provide that each thread is assigned a global variable which will be zero when the task is running and non-zero when it isn't. A task which acquires a mutex could set a pointer to its 'active' flag; a flag that wants the mutex could dereference the pointer to determine whether it should spinlock or yield its timeslice. If the target of the pointer is zero, the task should spinlock; if it's non-zero, it should yield.
Do any systems do anything like that?
|
|
|
|

|
Actually, it isn't a simple feature; furthermore, "running" is a problematic evaluation. If the thread has blocked on a page fault, is it "running"? If it blocked on a second synchronization primitive, is it "running"? And what is a "global variable"? Certainly for interprocess synchronization the concept is meaningless, and there is no way to implement it in Windows, Unix, linux, Solaris, Mac OS-X even within a single process.
The reason spin-locks are used in the kernel is that there is no scheduler running at that level, so the concept of descheduling a thread cannot exist.
|
|
|
|

|
If one is waiting on a lock which is going to be held for less time than it takes to do a context switch, it makes sense to wait without yielding the CPU. If it's going to be held for more time than it takes t1o do two context switches, one should yield. If the time will be between those values, either action is probably reasonable.
While it wouldn't make sense to spend a huge amount of time deciding whether a lock was going to be held for a long or a short time, I would think that having the OS make available a rough heuristic might be a useful feature. As for the questions regarding page faults and nested locks, I would define the flag as being set if a context switch would be necessary for a thread to actually execute any instructions; the context-switch handler would be responsible for maintaining the flags.
Since a page fault would force a context switch, it should mark the thread as "not running". If a thread is waiting on a lock which is expected to be released soon, it would be considered "running"; if it's waiting on a lock which is not expected to be released soon, it would yield the CPU and thus become "not running". Not a perfect heuristic, but I would think that in many cases it could be implemented cheaply, at least with regard to multiple threads in the same address space.
|
|
|
|

|
hello
i am working on a Image Processing project, i need algo/code on morphology,finding skeleton of a video camera image,Regional descriptors....
can anybody plz guid me
my email id is
prit_me@hotmail.com
VC++
|
|
|
|
|

|
Ya i am requesting for guid line on every Image Processing related artical i find.
At least one will help me....hay may be you can !!
i badly need it.
Prit
VC++
|
|
|
|
 |
|
|
General News Suggestion Question Bug Answer Joke Rant Admin
|
Learn about the potential pitfalls of code optimization.
| Type | Article |
| Licence | |
| First Posted | 16 May 2000 |
| Views | 248,036 |
| Bookmarked | 126 times |
|
|