๐Ÿ‘คashishgandhi๐Ÿ•‘13y๐Ÿ”ผ590๐Ÿ—จ๏ธ119

(Replying to PARENT post)

Modern architectures really are quite complex, and it's important to keep your internal machine model up to date with what's happening. Examples: It used to be on a 486 even correctly-predicted taken branches had a two-cycle penalty. It used to be important whether you arranged your code so the most commonly taken code was on the fall through path or not. Today, correctly predicted branches are free--the processor stitches together the current path and the new path so there are no bubbles in the pipeline. Virtual function calls also used to be more expensive back in the day, because CPU's didn't try to predict the target address of the branch. Today, CPU's have a buffer that maps from a program counter to a target address, so if your virtual function call predictably hits the same target each time, the only overhead will be the extra memory accesses to indirect through the v-table.

At the same time, things that are expensive today may not be so in the near future. Haswell is supposed to make uncontended synchronization operations almost free. It'll make a lot of algorithms, particularly lock-free algorithms, much more practical than they are today. For example, C++ smart pointers are slow in multi-threaded situations because the reference count manipulation needs to be done under a lock. But the lock is almost never needed (except when it is). Cheap synchronization should make it much more practical to use reference counting more pervasively in C++ programs.

๐Ÿ‘คrayiner๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Ah this.. the highest voted Q on SO. Not again plz - http://www.hnsearch.com/search#request/all&q=Why+is+proc... (interesting case of same page diff urls bypassing HN duplicate url check though)

Go to http://stackoverflow.com/questions sort by votes and read from top if you like.

๐Ÿ‘คlazydon๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Several months old, but still worth a read - especially for the end of Mystical's comment, where he discusses the different optimizations performed by different compilers. Particularly interesting is the Intel Compiler, which effectively optimizes out the benchmark itself: something to keep in mind when testing your own code, if you want your results to make sense.
๐Ÿ‘คBakkot๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

One take away point from this for non-c coders is that higher level languages will always be significantly slower than C because it's far easier to make these kinds of fiddly pipeline optimizations when you don't have another abstraction layer in between mucking things up. There's few java programmers for instance that both intuitevly understand how both the machine architecture, and the JVM compiler will affect the code they write. There's too many quirks. While in theory it's possible to write comparably fast java code(assuming your willing to do sketchy things for memory manipulation related tasks). In practice it's significantly more difficult to get all the juice out with higher level languages.
๐Ÿ‘คsoup10๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Mystical mentions that Intel's compiler uses loop interchange to gain an extraordinary speedup. There's actually a great answer further down the page that details this and was great as I wasn't aware of how it worked: http://stackoverflow.com/a/11303693/300908
๐Ÿ‘คPermit๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Isn't HN amazing? Here is the same article submitted on no less than four previous occasions:

* http://news.ycombinator.com/item?id=4185226

* http://news.ycombinator.com/item?id=4170972

* http://news.ycombinator.com/item?id=4355548

* http://news.ycombinator.com/item?id=4167834

Total upvotes: 29.

Total discussion: 0.

No wonder people try to game the system and find optimal times to submit things.

No wonder people think HN is broken.

๐Ÿ‘คColinWright๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

You are the victim of branch prediction fail.

"Failure" is a perfectly serviceable word and the cutesy "fail" is just annoying in a technical discussion like this.

๐Ÿ‘คLagged2Death๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

If you're looking for a good online course on how computer systems work from AND gate scale - up to data center size, the iTunesU course <<"Computer Science 61C", UC Berkeley, w/ Daniel Garcia>> is really informative and fun to watch.

http://itunes.apple.com/de/itunes-u/computer-science-61c-001...

๐Ÿ‘คcvursache๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Not to hate on the great top-voted answer, but isn't it a bit strange that it was posted a mere five minutes after the question?
๐Ÿ‘คmrich๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

The fact that someone would take the time to help someone in such depth restores my faith in humanity. With a picture to boot.
๐Ÿ‘คjuddlyon๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

The answer focuses on branch misprediction but there is also another factor - CPU data prefetch.

In order to offset increasing CAS latencies (largely due to the increasing number of channels) modern CPU's are greedy (in a good sense) and fetch more pages per memory controller read request than are necessary.

If your data is laid out continuously (in this case - sorted), chances of the pages you need next being primed via prefetch are greatly increased and further read requests to the memory controller may be reduced or avoided.

I benchmarked the effect of data prefetching on modern processors with large data structures (usually limited to L1d cache lines):

https://github.com/johnj/llds#l1-data-prefetch-misses-200000...

A plausible analogy for CPU prefetching would be "readahead" mechanisms implemented by modern filesystems and operating systems.

๐Ÿ‘คzippie๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Alexander J. Yee, who answered the question is a student who calculated Pi to 5 trillion digits:

http://www.geekosystem.com/pi-five-trillion-digits-alexander... The main challenge for a computation of such a size, is that both software and hardware are pushed beyond their limits. For such a long computation and with so much hardware, failure is not just a probability. It is a given.

๐Ÿ‘คdennisgorelik๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Sorted arrays make it easy for a simple branch predictor guess correctly, if the branch predictor guesses correctly then you avoid a pipeline stall.

Assuming the median value in the array is 128 then for the sorted case the first half of the array will have one cache miss and the second half will have one cache miss.

If the OP optimized their code, sorted/filtered the array and then summed only those values greater than 128 it would be obvious why sorting is faster.

๐Ÿ‘คfleitz๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

What about the time taken to do the sorting? Is there some rule of thumb that hints at whether taking the time to sort the data is going to be a net win?
๐Ÿ‘คgrannyg00se๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Interesting test, but it's not news if you know something about code optimization.

Caching of the data may also play a part on similar cases (but not this one - or better, the effect is negligible)

Here's a little test that can be tried. Make the test result (inside the if) be true or false alternately (for example, sum only if the index is even), see how long it takes.

Spoiler: modern branch predictors can detect and predict cyclic conditions

๐Ÿ‘คraverbashing๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

TL;DR Would it not be great that we change the branch prediction implementation to switch on/off automatically(with every new process)? This would be difficult to trace or even implement but I'm just guessing. The CPU stops using branch prediction if a lot of it's predictions are failing but starts again after some time.
๐Ÿ‘คhm8๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Branch prediction is not the only reason this is faster and probably not even the biggest reason.

Since the array is sorted,

    if (data[c] >= 128)
will evaluate to true consecutively. When the cache requests something from memory, it will request blocks containing multiple words at a time. Since every data[c] that needs to be added to sum is in a contiguous piece of memory, the code is minimizing the number of times a block is transferred from memory to the cache. This is the concept of spatial locality[1].

[1]http://en.wikipedia.org/wiki/Locality_of_reference#Locality_...

๐Ÿ‘คprophetjohn๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

a very broad intuition is that less entropy in the input, more predictability in executing algorithms on that input, more regularities to exploit
๐Ÿ‘คkeefe๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Where can I learn more about this?
๐Ÿ‘คmarblar๐Ÿ•‘13y๐Ÿ”ผ0๐Ÿ—จ๏ธ0