(Replying to PARENT post)
Go to http://stackoverflow.com/questions sort by votes and read from top if you like.
(Replying to PARENT post)
(Replying to PARENT post)
(Replying to PARENT post)
(Replying to PARENT post)
* http://news.ycombinator.com/item?id=4185226
* http://news.ycombinator.com/item?id=4170972
* http://news.ycombinator.com/item?id=4355548
* http://news.ycombinator.com/item?id=4167834
Total upvotes: 29.
Total discussion: 0.
No wonder people try to game the system and find optimal times to submit things.
No wonder people think HN is broken.
(Replying to PARENT post)
"Failure" is a perfectly serviceable word and the cutesy "fail" is just annoying in a technical discussion like this.
(Replying to PARENT post)
http://itunes.apple.com/de/itunes-u/computer-science-61c-001...
(Replying to PARENT post)
(Replying to PARENT post)
(Replying to PARENT post)
In order to offset increasing CAS latencies (largely due to the increasing number of channels) modern CPU's are greedy (in a good sense) and fetch more pages per memory controller read request than are necessary.
If your data is laid out continuously (in this case - sorted), chances of the pages you need next being primed via prefetch are greatly increased and further read requests to the memory controller may be reduced or avoided.
I benchmarked the effect of data prefetching on modern processors with large data structures (usually limited to L1d cache lines):
https://github.com/johnj/llds#l1-data-prefetch-misses-200000...
A plausible analogy for CPU prefetching would be "readahead" mechanisms implemented by modern filesystems and operating systems.
(Replying to PARENT post)
http://www.geekosystem.com/pi-five-trillion-digits-alexander... The main challenge for a computation of such a size, is that both software and hardware are pushed beyond their limits. For such a long computation and with so much hardware, failure is not just a probability. It is a given.
(Replying to PARENT post)
Assuming the median value in the array is 128 then for the sorted case the first half of the array will have one cache miss and the second half will have one cache miss.
If the OP optimized their code, sorted/filtered the array and then summed only those values greater than 128 it would be obvious why sorting is faster.
(Replying to PARENT post)
(Replying to PARENT post)
Caching of the data may also play a part on similar cases (but not this one - or better, the effect is negligible)
Here's a little test that can be tried. Make the test result (inside the if) be true or false alternately (for example, sum only if the index is even), see how long it takes.
Spoiler: modern branch predictors can detect and predict cyclic conditions
(Replying to PARENT post)
(Replying to PARENT post)
Since the array is sorted,
if (data[c] >= 128)
will evaluate to true consecutively. When the cache requests something from memory, it will request blocks containing multiple words at a time. Since every data[c] that needs to be added to sum is in a contiguous piece of memory, the code is minimizing the number of times a block is transferred from memory to the cache. This is the concept of spatial locality[1].[1]http://en.wikipedia.org/wiki/Locality_of_reference#Locality_...
(Replying to PARENT post)
At the same time, things that are expensive today may not be so in the near future. Haswell is supposed to make uncontended synchronization operations almost free. It'll make a lot of algorithms, particularly lock-free algorithms, much more practical than they are today. For example, C++ smart pointers are slow in multi-threaded situations because the reference count manipulation needs to be done under a lock. But the lock is almost never needed (except when it is). Cheap synchronization should make it much more practical to use reference counting more pervasively in C++ programs.