Algorithms « Stack Overflow

Archive for the ‘Algorithms’ Category

Idling along, (or what to do in the idle task)

Sunday, April 14th, 2013 Nigel Jones

If you are using an RTOS in your latest design then no doubt you have an idle task. (Most of the time, the idle task is explicit and is the user task with the lowest priority; sometimes it’s built into the RTOS). It’s been my experience that the idle task is an interesting beast. On the one hand it is what gets executed when there’s nothing else to do, and thus inherently contains nothing directly related the product. On the other hand it’s this wonderful resource that you can exploit to do all sorts of interesting things to improve your product without having to worry too much about it impacting the running of your product.

With that being said, here are some suggestions for what your idle task can do, starting with one thing it shouldn’t do.

Do Nothing

If your idle task consists of a do-nothing loop, then you are almost certainly missing an opportunity. Hopefully the suggestions below will serve to spark your creative juices.

Watchdog

In any RTOS based design, the idle task should play a role in the overall watchdog supervisor. Without going into a treatise on watchdog design, suffice it to say if the idle task isn’t being run frequently then you’ve got a problem. Thus if the idle task doesn’t feature in your watchdog supervisor then you are doing something wrong. Note that if the idle task is featured in your watchdog, then it doesn’t necessarily mean you are doing it right either!

Power Save

For power constrained systems, the idle task is usually the place to put the microprocessor to sleep and / or indulge in other power saving strategies. I often find that my idle tasks consist of some of the features described here, plus power save. In other words, the idle task takes care of some housekeeping and then takes a nap.

Load calculation

Used in conjunction with hardware timers and the task switch hook function, it’s normally possible to construct a system that gives a decent indication of both overall CPU load and also the CPU utilization of each of the tasks. The idle task isn’t a bad place to do all the calculations. I find this a very useful diagnostic aid as I’m developing a system. Once you are done using it as a development aid, with a bit more work it can be modified to be part of your overall watchdog strategy, in that it can provide useful information about how tasks are (mis)behaving.

Flash Check

Just about every embedded system I’ve looked at in the last decade or two performs a CRC check on program memory on start up. However, if you are designing a system that is safety critical and / or expected to run a long time between power cycles, then you should seriously consider running a Flash CRC check in the idle task. Because no writing of memory is involved and one is instead reading memory that is supposed to be constant, there are are no real race-conditions to worry about and thus there’s no need to be entering critical sections to perform the reads. Of course if you are using an MMU or MPU then things might get a little more challenging. Naturally such a challenge is nothing for a reader of your ability! [As an aside, one of my electrical engineering professors used to say to me, "Nigel, this is nothing for a man of your ability!" One is of course simultaneously flattered, irritated and motivated. I've never forgotten it.]

RAM Check

This is of the course the evil twin to the Flash check. However this time you need to perform both reads and writes from locations that are being used by higher priority tasks and interrupts. You can of course only do this safely by executing a suitable lock / unlock procedure on each RAM location. Now doing this in the idle task could seriously change your system’s response time, so you need to think very seriously about how to structure such a test. A good starting point is to do just one locked read / write per idle task invocation. Of course if that results in a 10 year RAM test on your system, then you’ll need to rethink the strategy.

If you have other good ideas for idle task work then please leave them in the comments.

Posted in Algorithms, RTOS Multithreading | 8 Comments »

Efficient C Tip #13 – use the modulus (%) operator with caution

Tuesday, February 8th, 2011 Nigel Jones

This is the thirteenth in a series of tips on writing efficient C for embedded systems. As the title suggests, if you are interested in writing efficient C, you need to be cautious about using the modulus operator. Why is this? Well a little thought shows that C = A % B is equivalent to C = A – B * (A / B). In other words the modulus operator is functionally equivalent to three operations. As a result it’s hardly surprising that code that uses the modulus operator can take a long time to execute. Now in some cases you absolutely have to use the modulus operator. However in many cases it’s possible to restructure the code such that the modulus operator is not needed. To demonstrate what I mean, some background information is in order as to how this blog posting came about.

Converting seconds to days, hours, minutes and seconds

In Embedded Systems Design there is an increasing need for some form of real time clock. When this is done, the designer typically implements the time as a 32 bit variable containing the number of seconds since a particular date. When this is done, it’s not usually long before one has to convert the ‘time’ into days, hours, minutes and seconds. Well I found myself in just such a situation recently. As a result, I thought a quick internet search was in order to find the ‘best’ way of converting ‘time’ to days, hours, minutes and seconds. The code I found wasn’t great and as usual was highly PC centric. I thus sat down to write my own code.

Attempt #1 – Using the modulus operator

My first attempt used the ‘obvious’ algorithm and employed the modulus operator. The relevant code fragment appears below.

void compute_time(uint32_t time)
{
 uint32_t    days, hours, minutes, seconds;

 seconds = time % 60UL;
 time /= 60UL;
 minutes = time % 60UL;
 time /= 60UL;
 hours = time % 24UL;
 time /= 24UL;
 days = time;  
}

This approach has a nice looking symmetry to it. However, it contained three divisions and three modulus operations. I thus was rather concerned about its performance and so I measured its speed for three different architectures – AVR (8 bit), MSP430 (16 bit), and ARM Cortex (32 bit). In all three cases I used an IAR compiler with full speed optimization. The number of cycles quoted are for 10 invocations of the test code and include the test harness overhead:

AVR: 29,825 cycles

MSP430: 27,019 cycles

ARM Cortex: 390 cycles

No that isn’t a misprint. The ARM was nearly two orders of magnitude more cycle efficient than the MSP430 and AVR. Thus my claim that the modulus operator can be very inefficient is true for some architectures – but not all. Thus if you are using the modulus operator on an ARM processor then it’s probably not worth worrying about. However if you are working on smaller processors then clearly something needs to be done – and so I investigated some alternatives.

Attempt #2 – Replace the modulus operator

As mentioned in the introduction, C = A % B is equivalent to C = A – B * (A / B). If we compare this to the code in attempt 1, then it should be apparent that the intermediate value (A/B) computed as part of the modulus operation is in fact needed in the next line of code. Thus this suggests a simple optimization to the algorithm.

void compute_time(uint32_t time)
{
 uint32_t    days, hours, minutes, seconds;

 days = time / (24UL * 3600UL);    
 time -= days * 24UL * 3600UL;
 /* time now contains the number of seconds in the last day */
 hours = time / 3600UL;
 time -= (hours * 3600UL);
 /* time now contains the number of seconds in the last hour */
 minutes = time / 60U;
 seconds = time - minutes * 60U;
 }

In this case I have replaced three mods with three subtractions and three multiplications. Thus although I have replaced a single operator (%) with two operations (- *) I still expect an increase in speed because the modulus operator is actually three operators in one (- * /). Thus effectively I have eliminated three divisions and so I expected a significant improvement in speed. The results however were a little surprising:

AVR: 18,720 cycles

MSP430: 14,805 cycles

ARM Cortex: 384 cycles

Thus while this technique yielded a roughly order of two improvements for the AVR and MSP430 processors, it had essentially no impact on the ARM code. Presumably this is because the ARM has native support for the modulus operation. Notwithstanding the ARM results, it’s clear that at least in this example, it’s possible to significantly speed up an algorithm by eliminating the modulus operator.

I could of course just stop at this point. However examination of attempt 2 shows that further optimizations are possible by observing that if seconds is a 32 bit variable, then days can be at most a 16 bit variable. Furthermore, hours, minutes and seconds are inherently limited to an 8 bit range. I thus recoded attempt 2 to use smaller data types.

Attempt #3 – Data type size reduction

My naive implementation of the code looked like this:

void compute_time(uint32_t time)
{
 uint16_t    days;
 uint8_t     hours, minutes, seconds;
 uint16_t    stime;

 days = (uint16_t)(time / (24UL * 3600UL));    
 time -= (uint32_t)days * 24UL * 3600UL;
 /* time now contains the number of seconds in the last day */
 hours = (uint8_t)(time / 3600UL);
 stime = time - ((uint32_t)hours * 3600UL);
 /*stime now contains the number of seconds in the last hour */
 minutes = stime / 60U;
 seconds = stime - minutes * 60U;
}

All I have done is change the data types and to add casts where appropriate. The results were interesting:

AVR: 14,400 cycles

MSP430: 11,457 cycles

ARM Cortex: 434 cycles

Thus while this resulted in a significant improvement for the AVR & MSP430, it resulted in a significant worsening for the ARM. Clearly the ARM doesn’t like working with non 32 bit variables. Thus this suggested an improvement that would make the code a lot more portable – and that is to use the C99 fast types. Doing this gives the following code:

Attempt #4 – Using the C99 fast data types

void display_time(uint32_t time)
{
 uint_fast16_t    days;
 uint_fast8_t    hours, minutes, seconds;
 uint_fast16_t    stime;

 days = (uint_fast16_t)(time / (24UL * 3600UL));    
 time -= (uint32_t)days * 24UL * 3600UL;
 /* time now contains the number of seconds in the last day */
 hours = (uint_fast8_t)(time / 3600UL);
 stime = time - ((uint32_t)hours * 3600UL);
 /*stime now contains the number of seconds in the last hour */
 minutes = stime / 60U;
 seconds = stime - minutes * 60U;
}

All I have done is change the data types to the C99 fast types. The results were encouraging:

AVR: 14,400 cycles

MSP430: 11,595 cycles

ARM Cortex: 384 cycles

Although the MSP430 time increased very slightly, the AVR and ARM stayed at their fastest speeds. Thus attempt #4 is both fast and portable.

Conclusion

Not only did replacing the modulus operator with alternative operations result in faster code, it also opened up the possibility for further optimizations. As a result with the AVR & MSP430 I was able to more than halve the execution time.

Converting Integers for Display

A similar problem (with a similar solution) occurs when one wants to display integers on a display. For example if you are using a custom LCD panel with say a 3 digit numeric field, then the problem arises as to how to determine the value of each digit. The obvious way, using the modulus operator is as follows:

void display_value(uint16_t value)
{
 uint8_t    msd, nsd, lsd;

 if (value > 999)
 {
 value = 999;
 }

 lsd = value % 10;
 value /= 10;
 nsd = value % 10;
 value /= 10;
 msd = value;

 /* Now display the digits */
}

However, using the technique espoused above, we can rewrite this much more efficiently as:

void display_value(uint16_t value)
{
 uint8_t    msd, nsd, lsd;

 if (value > 999U)
 {
  value = 999U;
 }

 msd = value / 100U;
 value -= msd * 100U;

 nsd = value / 10U;
 value -= nsd * 10U;

 lsd = value;

 /* Now display the digits */
}

If you benchmark this you should find it considerably faster than the modulus based approach.

Previous Tip

Posted in Algorithms, Efficient C/C++, General C issues | 63 Comments »

Median Filter Performance Results

Tuesday, November 9th, 2010 Nigel Jones

In my earlier post on median filtering I made the claim that for filter sizes of 3, 5 or 7 that using a simple insertion sort is ‘better’ than using Phil Ekstrom’s technique. It occurred to me that this claim was based upon my testing with 8 bit processors quite a few years ago, and that the results might not be valid for 32 bit processors with their superior pointer manipulation. Accordingly I ran some bench marks comparing an insertion sort based approach with Ekstrom’s method.

The procedure was as follows:

I generated an array of random integers on the interval 900 – 1000. The idea is that these would represent data from a typical 10 bit ADC found on many microcontrollers.
I then put together a base line project which performed all the basic house keeping functions, but without performing any filtering. The idea was to try and get a feel for the non-algorithm specific overhead.
I then put together a project which median filtered using an insertion sort, for sizes, 3, 5, 7, 9, 11, and 13. Note that I elected to take a copy of the data prior to sorting. See this comment thread for a discussion of whether this is necessary or not.
I put together another project which median filtered using Ekstrom’s method.
I compiled the above for an ARM Cortex M3 target using an IAR compiler with full speed optimization.

The results were a clear win for Ekstrom. His code size was 132 bytes versus 224. His code was 5%, 32%, 61%, 89%,113% and 146% faster than the insertion sort for filters sizes of 3, 5, 7, 9, 11 and 13 respectively. To be fair to the insertion sort technique, I have made no effort to optimize it. Notwithstanding this, I think I can say that for 32 bit targets, you may as well just use Ekstrom’s approach for all filter sizes.

I’ll endeavor to update this post with results for a 16 bit target (MSP430) in the next few days.

Well I finally got around to running the tests on an MSP430 target. In this case Ekstrom’s method produced a larger code size (186 bytes versus 160). Much to my surprise, Ekstrom’s method was dramatically superior to the insertion sort approach, with speeds of 69% faster for a filter size of 3, going up to a whopping 250% faster with a filter size of 13. The bottom line: I think my original claim is bunk. Use Ekstrom’s method by default!

Tags: Median filter
Posted in Algorithms | No Comments »

Median filtering

Saturday, October 2nd, 2010 Nigel Jones

NOTE: I have heavily edited this blog post since it was originally published, based on some recent testing

If your engineering education was anything like mine then I’m sure that you learned all about different types of linear filters whose essential objective was to pass signals within a certain frequency band and to reject as far as possible all others. These filters are of course indispensable for many types of ‘noise’. However in the real world of embedded systems it doesn’t take one too long to realize that these classical linear filters are useless against burst noise. This kind of noise typically arises from a quasi-random event. For example a 2-way radio may be keyed next to your product or an ESD event may occur close to your signal. Whenever this happens your input signal may transiently go to a ridiculous value. For example I have often seen A2D readings that look something like this: 385, 389, 388, 388, 912, 388, 387. The 912 value is presumably anomalous and as such should be rejected. If you try and use a classical linear filter then you will almost certainly find that the 912 reading actually ends up having a significant impact on the output. The ‘obvious’ answer in this case is to use a median filter. Despite the supposed obviousness of this, it’s my experience that median filters are used remarkably infrequently in embedded systems. I don’t know why this is, but my guess is that it is a combination of a lack of knowledge of their existence, coupled with difficulty of implementation. Hopefully this post will go some way to rectifying both issues.

As its name suggests, a median filter is one which takes the middle of a group of readings. It’s normal for the group to have an odd number of members such that there is no ambiguity about the middle value. Thus the general idea is that one buffers a certain number of readings and takes the middle reading.

Now Until recently I recognized three classes of median filter, based purely on the size of the filter. They were:

Filter size of 3 (i.e. the smallest possible).
Filter size of 5, 7 or 9 (the most common).
Filter size of 11 or more.

However, I now espouse a simple dichotomy

Filter size of 3
Filter size > 3

Filter size of 3

The filter size of three is of course the smallest possible filter. It’s possible to find the middle value simply via a few if statements. The code below is based on an algorithm described here. Clearly this is small and fast code.

uint16_t middle_of_3(uint16_t a, uint16_t b, uint16_t c)
{
 uint16_t middle;

 if ((a <= b) && (a <= c))
 {
   middle = (b <= c) ? b : c;
 }
 else if ((b <= a) && (b <= c))
 {
   middle = (a <= c) ? a : c;
 }
 else
 {
   middle = (a <= b) ? a : b;
 }
 return middle;
}

Filter size > 3

For filter sizes greater than 3 I suggest you turn to an algorithm described by Phil Ekstrom in the November 2000 edition of Embedded Systems Programming magazine. With the recent hatchet job on embedded.com I can’t find the original article. However there is a copy here. Ekstrom’s approach is to use a linked list. The approach works essentially by observing that once an array is sorted, the act of removing the oldest value and inserting the newest value doesn’t result in the array being significantly unsorted. As a result his approach works well – particularly for large filter sizes.

Be warned that there are some bugs in the originally published code (which Ekstrom corrected). However given the difficulty of finding anything on embedded.com nowadays I have opted to publish my implementation of his code. Be warned that the code below was originally written in Dynamic C and has been ported to standard C for this blog posting. It is believed to work. However it would behoove you to check it thoroughly before use!

#define STOPPER 0                                      /* Smaller than any datum */
#define    MEDIAN_FILTER_SIZE    (13)

uint16_t median_filter(uint16_t datum)
{
 struct pair
 {
   struct pair   *point;                              /* Pointers forming list linked in sorted order */
   uint16_t  value;                                   /* Values to sort */
 };
 static struct pair buffer[MEDIAN_FILTER_SIZE] = {0}; /* Buffer of nwidth pairs */
 static struct pair *datpoint = buffer;               /* Pointer into circular buffer of data */
 static struct pair small = {NULL, STOPPER};          /* Chain stopper */
 static struct pair big = {&small, 0};                /* Pointer to head (largest) of linked list.*/

 struct pair *successor;                              /* Pointer to successor of replaced data item */
 struct pair *scan;                                   /* Pointer used to scan down the sorted list */
 struct pair *scanold;                                /* Previous value of scan */
 struct pair *median;                                 /* Pointer to median */
 uint16_t i;

 if (datum == STOPPER)
 {
   datum = STOPPER + 1;                             /* No stoppers allowed. */
 }

 if ( (++datpoint - buffer) >= MEDIAN_FILTER_SIZE)
 {
   datpoint = buffer;                               /* Increment and wrap data in pointer.*/
 }

 datpoint->value = datum;                           /* Copy in new datum */
 successor = datpoint->point;                       /* Save pointer to old value's successor */
 median = &big;                                     /* Median initially to first in chain */
 scanold = NULL;                                    /* Scanold initially null. */
 scan = &big;                                       /* Points to pointer to first (largest) datum in chain */

 /* Handle chain-out of first item in chain as special case */
 if (scan->point == datpoint)
 {
   scan->point = successor;
 }
 scanold = scan;                                     /* Save this pointer and   */
 scan = scan->point ;                                /* step down chain */

 /* Loop through the chain, normal loop exit via break. */
 for (i = 0 ; i < MEDIAN_FILTER_SIZE; ++i)
 {
   /* Handle odd-numbered item in chain  */
   if (scan->point == datpoint)
   {
     scan->point = successor;                      /* Chain out the old datum.*/
   }

   if (scan->value < datum)                        /* If datum is larger than scanned value,*/
   {
     datpoint->point = scanold->point;             /* Chain it in here.  */
     scanold->point = datpoint;                    /* Mark it chained in. */
     datum = STOPPER;
   };

   /* Step median pointer down chain after doing odd-numbered element */
   median = median->point;                       /* Step median pointer.  */
   if (scan == &small)
   {
     break;                                      /* Break at end of chain  */
   }
   scanold = scan;                               /* Save this pointer and   */
   scan = scan->point;                           /* step down chain */

   /* Handle even-numbered item in chain.  */
   if (scan->point == datpoint)
   {
     scan->point = successor;
   }

   if (scan->value < datum)
   {
     datpoint->point = scanold->point;
     scanold->point = datpoint;
     datum = STOPPER;
   }

   if (scan == &small)
   {
     break;
   }

   scanold = scan;
   scan = scan->point;
 }
 return median->value;
}

To use this code, simply call the function every time you have a new input value. It will return the median of the last MEDIAN_FILTER_SIZE readings. This approach can consume a fair amount of RAM as one has to store both the values and the pointers. However if this isn’t a problem for you then it really is a nice algorithm that deserves to be in your tool box as it is dramatically faster than algorithms based upon sorting.

Median filtering based on sorting

In the original version of this article I espoused using a sorting based approach to median filtering when the filter size was 5, 7 or 9. I no longer subscribe to this belief. However for those of you that want to do it, here’s the basic outline:

 if (ADC_Buffer_Full)
 {
   uint_fast16_t adc_copy[MEDIAN_FILTER_SIZE];
   uint_fast16_t filtered_cnts;

   /* Copy the data */
   memcpy(adc_copy, ADC_Counts, sizeof(adc_copy));
   /* Sort it */
   shell_sort(adc_copy, MEDIAN_FILTER_SIZE);
   /* Take the middle value */
   filtered_cnts = adc_copy[(MEDIAN_FILTER_SIZE - 1U) / 2U];
   /* Convert to engineering units */
   ...
 }

Final Thoughts

Like most things in embedded systems, median filters have certain costs associated with them. Clearly median filters introduce a delay to a step change in value which can be problematic at times. In addition median filters can completely clobber frequency information in the signal. Of course if you are only interested in DC values then this is not a problem. With these caveats I strongly recommend that you consider incorporating median filters in your next embedded design.

Tags: Median filter
Posted in Algorithms | 37 Comments »

Sorting (in) embedded systems

Sunday, March 15th, 2009 Nigel Jones

Although countless PhD’s have been awarded on sorting algorithms, it’s not a topic that seems to come up much in embedded systems (or at least the kind of embedded systems that I work on). Thus it was with some surprise recently that I found myself needing to sort an array of integers. The array wasn’t very large (about twenty entries) and I was eager to move on to the real problem at hand and so I just dropped in a call to the standard C routine qsort(). I didn’t give it a great deal of thought because I ‘knew’ that a ‘Quick Sort’ algorithm is in general fast and well behaved and that with sorting so few entries I wasn’t too concerned about it being ‘optimal’. Anyway, with the main task at hand solved, on a whim I decided to take another look at qsort(), just to make sure that I wasn’t being too cavalier in my approach. Boy did I get a shock! My call to qsort() was increasing my code size by 1500 bytes and it wasn’t giving very good sort times either. For those of you programming big systems, this may seem acceptable. In my case, the target processor had 16K of memory and so 1500 bytes was a huge hit.

Surely there had to be a better solution? Well there’s always a better solution, but in my case in particular, and for embedded systems in general, what is the optimal sorting algorithm?

Well, after thinking about it for a while, I think the optimal sorting algorithm for embedded systems has these characteristics:

It must sort in place.
The algorithm must not be recursive.
Its best, average and worst case running times should be of similar magnitude.
Its code size should be commensurate with the problem.
Its running time should increase linearly or logarithmically with the number of elements to be sorted.
Its implementation must be ‘clean’ – i.e. free of breaks and returns in the middle of a loop.

Sort In Place

This is an important criterion not just because it saves memory, but most importantly because it obviates the need for dynamic memory allocation. In general dynamic memory allocation should be avoided in embedded systems because of problems with heap fragmentation and allocation performance. If you aren’t aware of this issue, then read this series of articles by Dan Saks on the issue.

Recursion

Recursion is beautiful and solves certain problems amazingly elegantly. However, it’s not fast and it can easily lead to problems of stack overflow. As a result, it should never be used in embedded systems.

Running Time Variability

Even the softest of real time systems have some time constraints that need to be met. As a result a function whose execution time varies enormously with the input data can often be problematic. Thus I prefer code whose execution time is nicely bounded.

Code Size

This is often a concern. Suffice to say that the code size should be reasonable for the target system.

Data Size Dependence

Sorting algorithms are usually classified using ‘Big O notation’ to denote how sensitive they are to the amount of data to be sorted. If N is the number of elements to be sorted, then an algorithm whose running time is N Log N is usually preferred to one whose running time is N². However, as you shall see, for small N the advantage of the more sophisticated algorithms can be lost by the the overhead of the sophistication.

Clean Implementation

I’m a great proponent of ‘clean’ code. Thus code where one exits from the middle of a loop isn’t as acceptable as code where everything proceeds in an orderly fashion. Although this is a personal preference of mine, it is also codified in for example the MISRA C requirements, to which many embedded systems are built. Anyway to determine the optimal sorting algorithm, I went to the Wikipedia page on sorting algorithms and initially selected the following for comparison to the built in qsort: Comb, Gnome, Selection, Insertion, Shell & Heap sorts. All of these are sort in place algorithms. I originally eschewed the Bubble & Cocktail sorts as they really have nothing to commend them. However, several people posted comments asking that I include them – so I did. As predicted they have nothing to commend them. In all cases, I used the Wikipedia code pretty much as is, optimized for maximum speed. (I recognize that the implementations in Wikipedia may not be optimal – but they are the best I have). For each algorithm, I sorted arrays of 8, 32 & 128 signed integers. In every case I sorted the same random array, together with a sorted array and an inverse sorted array. First the code sizes in bytes:

qsort()      1538
Gnome()        76
Selection()   130
Insertion()   104
Shell()       242
Comb()        190
Heap()        200
Bubble()      104
Cocktail()    140

Clearly, anything is a lot better than the built in qsort(). However, we are not comparing apples and oranges, because qsort() is a general purpose routine, whereas the others are designed explicitly to sort integers. Leaving aside qsort(), the Gnome sort Insertion sort and Bubble sorts are clearly the code size leaders. Having said that, in most embedded systems, a 100 bytes here or there is irrelevant and so we are free to choose based upon other criteria.

Execution times for the 8 element array

Name        Random  Sorted  Inverse Sorted
qsort()     3004     832    2765
Gnome()     1191     220    2047
Selection() 1120    1120    1120
Insertion() 544      287    756
Shell()     1233    1029    1425
Comb()      2460    1975    2480
Heap()      1265    1324    1153
Bubble()     875     208    1032
Cocktail()  1682     927    2056

In this case, the Insertion sort is the clear winner. Not only is it dramatically faster in almost all cases, it also has reasonable variability and it has almost the smallest code size. Notice that the bubble sort for all its vaunted simplicity consumes as much code and runs considerably slower. Notice that the Selection sort’s running time is completely consistent – and not too bad when compared to other methods.

Execution times for the 32 element array

Name        Random  Sorted  Inverse Sorted
qsort()     23004    3088   19853
Gnome()     17389     892   35395
Selection() 14392   14392   14392
Insertion()  5588    1179   10324
Shell()      6589    4675    6115
Comb()      10217    8638   10047
Heap()       8449    8607    7413
Bubble()    13664     784   16368
Cocktail()  17657    3807   27634

In this case, the winner isn’t so clear cut. Although the insertion sort still performed well, it’s showing a very large variation in running time now. By contrast the shell sort has got decent times with small variability. The Gnome, Bubble and Cocktail sorts are showing huge variability in execution times (with a very bad worst case), while the Selection sort shows consistent execution time. On balance, I’d go with the shell sort in most cases.

Execution times for the 128 element array

Name         Random  Sorted  Inverse Sorted
qsort()      120772   28411   77896
Gnome()      316550    3580  577747
Selection()  217420  217420  217420
Insertion()   88475    4731  158020
Shell()       41661   25611   34707
Comb()        50858   43523   48568
Heap()        46959   49215   43314
Bubble()     231294    3088  262032
Cocktail()   271821   15327  422266

In this case the winner is either the shell sort or the heap sort depending on whether you want raw performance more or less when compared to performance variability. The Gnome, Bubble and Cocktail sorts are hopelessly outclassed. So what to make of all this? Well in any comparison like this there are a myriad of variables that one should take into account, and so I don’t believe these data should be treated as gospel. What is clear to me is that:

Being a general purpose routine, qsort() is unlikely to be the optimal solution for an embedded system.
For many embedded applications, a shell sort has a lot to commend it – decent code size, fast running time, well behaved and a clean implementation. Thus if you don’t want to bother with this sort of investigation every time you need to sort an array, then a shell sort should be your starting point. It will be for me henceforth.

Home

Tags: Sorting
Posted in Algorithms | 16 Comments »