πŸ‘€aloukissasπŸ•‘3yπŸ”Ό264πŸ—¨οΈ226

(Replying to PARENT post)

A potential lesson here (i.e. I am applying confirmation bias to retroactively view this article as justification for a strongly held opinion, lol):

Unless you are gonna benchmark something, for details like this you should pretty much always just trust the damn compiler and write the code in the most maintainable way.

This comes up in code review a LOT at my work:

- "you can write this simpler with XYZ"

- "but that will be slower because it's a copy/a function call/an indirect branch/a channel send/a shared memory access/some other combination of assumptions about what the compiler will generate and what is slow on a CPU"

I always ask them to either prove it or write the simple thing. If the code in question isn't hot enough to bother benchmarking it, the performance benefits probably aren't worth it _even if they exist_.

πŸ‘€bjackmanπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I don’t feel like this gave a satisfactory answer the question. Since everything was inlined, the argument passing convention made no difference in the micro benchmarks. But what happens when it does not inline? Then you would actually be testing by-borrow be by-copy instead of how good rust is at optimizing.
πŸ‘€celeritasceleryπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

This is one advantage of Ada, where parameters are abstractly declared as "in" or "in out" or "out". The compiler can then decide how to best implement it for that specific size and architecture.
πŸ‘€dwheelerπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

This is one of those questions where you really, honestly, do need to look at a very low level.

Back in the ancient days, I worked at IBM doing benchmarking for an OS project that was never released. We were using PPC601 Sandalfoots (Sandalfeet?) as dev machines. A perennial fight was devs writing their own memcpy using dst++ = src++ loops rather than the one in the library, which was written by one of my coworkers and consisted of 3 pages of assembly that used at least 18 registers.

The simple loop was something like X cycles/byte, while the library version was P + (Q cycles/byte) but the difference was such that the crossover point was about 8 bytes. So, scraping out the simple memcpy implementations from the code was about a weekly thing for me.

At this point, we discovered that our C compiler would pass structs by value (This was the early-ish days of ANSI C and was a surprise to some of my older coworkers.) and benchmarked that.

And discovered that its copy code was worse than the simple dst++ = src++ loops. By about a factor of 4. (The simple loop would be optimized to work with word-sized ints, while the compiler was generating code that copied each byte individually.)

If you are doing something where this matters, something like VTune is very important. So is the ability to convince people who do stupid things to stop doing the stupid things.

πŸ‘€mcguireπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I always prefer by-borrow. That's because in the future this struct may become non-copy and that means some unnecessary refactoring. My thinking is a bit like "don't take ownership if not needed" - the "not needed" part is the most important thing. Don't require things that are not needed.
πŸ‘€lukaszwojtowπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

> Blech! Having to explicitly borrow temporary values is super gross.

I don’t think you ever have to write code like this. Implement your math traits in terms for both value and reference types like the standard library does.

Go down to Trait Implementations for scalar types, for instance i32 [1]

impl Add<&i32> for &i32

impl Add<&i32> for i32

impl Add<i32> for &i32

impl Add<i32> for i32

Once you do that your ergonomics should be exactly the same as with built in scalar types.

[1] https://doc.rust-lang.org/std/primitive.i32.html

πŸ‘€arcticbullπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

Oh neat, that’s my blog. My old posts don’t resurface on HN that often.

Lots of criticism of my methodology in the comments here. That’s fine. That post was more of a self nerd snipe that went way deeper than I expected.

I hoped that my post would lead to a more definitive answer from some actual experts in the field. Unfortunately that never happened, afaik. Bummer.

πŸ‘€forrestthewoodsπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

It's compiled, so, without any investigation at all, I would have been disappointed if there were any significant difference in the code emitted in these cases. I would expect the compiler to do the efficient thing based on usage rather than the particular syntax. I may have too much faith in the compiler.
πŸ‘€ergonaughtπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I'd be interested to know what the benchmarks of the two rust solutions are when inlining is disabled so we can get an idea of the different performance characteristics of each function call even if it's not a very realistic scenario.

The other question I have is which style should you use when writing a library? It's obviously not possible to benchmark all the software that will call your library but you still want to consider readability, performance as well as other factors such as common convention.

πŸ‘€spuzπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I would go with the version that gives the clean user interface (that is, by copy in this case). If it turns out that the other version is significantly more performant and this additional performance is critical for the end users consider adding the by-borrow option.

The clarity of the code using a particular library is such an big (but often under-appreciated) benefit that I would heavily lean in this direction when considering interface options. My 2c.

πŸ‘€pteroπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I just went through all of this when building a raytracer.

* Sprinkling & around everything in math expressions does make them ugly. Maybe rust needs an asBorrow or similar?

* If you inline everything then the speed is the same.

* Link time optimizations are also an easy win.

https://github.com/mcallahan/lightray

πŸ‘€RustwerksπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

The benchmarks lack the standard deviation, so the results may well be equivalent. Don't roll your own micro-benchmark runners.

References may get optimized to copies where possible and sound (i.e. blittable and const), a common heuristic involves the size of a cache line (64b on most modern ISAs, including x86_64).

Using a Vector4 would have pushed the structure size beyond the 64b heuristic. You would also need to disable inlining for the measured methods.

πŸ‘€zamalekπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

Note that this is from 2019, so it's probably worth re-benchmarking to see if anything has changed in the interim. Can we get the year added to the title?
πŸ‘€kibwenπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

For this code, the compiler inlined the call. So there should be no difference between pass by copy or pass by reference, which is what was measured. Where it could matter is when the code isn’t inlined. But with small structs it might not matter all that much.

It does sometimes matter though. One optimization I’ve seen in a few places is to box the error type, so that a result doesn’t copy the (usually empty) error by value on the stack. That actually makes a small performance difference, on the order of about 5-10%.

πŸ‘€eloffπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

Folks, processors continue to give smaller and smaller gains every year. Something has to give. If you have critical path code that absolutely must max out the core, then this type of analysis (as pedantic as it is) is useful in the long run.
πŸ‘€BooneJSπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

It's not like you can do arithmetic with references, so maybe the ergonomics of by-value vs. by-reference shouldn't really be that different.

The cost of by-value lies in memory copies, while the cost of by-reference lies in dereferencing pointers where the values are needed, which might mean many more memory reads are needed than with by-value (depends on what you're doing). So it's just hard to tell which will do better in general -- there's no answer to that.

For a library, maybe providing by-value and by-reference interfaces should be good (except that will bloat the library). For everything else just use by-value as it has the best ergonomics.

πŸ‘€cryptonectorπŸ•‘2yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I did the test on my computer:

Rust - By-Copy: 14124, By-Borrow: 8150

C++ - By-Copy: 12160, By-Ref: 11423

P.S. Just built it using LLVM under CLion IDE and the results are:

  G:\temp\cpp\rust-cpp-bench\cpp\cmake\cmake-build- 
   release\fts_cmake_cpp_bench.exe
   Totals:
     Overlaps: 220384338
     By-Copy: 4397
     By-Ref: 4396  Delta: -0.0227428%

  Process finished with exit code 0
πŸ‘€FpUserπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

This actually bothers me. I think the rust performance here is praise worthy. What bothers me is that we piled complexity over complexity at the hardware and compiler levels, and ended up in a situation where you got no way to get a reasonable understanding of how low level code will perform. Nowadays the main reason to program in a "low level" language is that you know that on average the compiler will be able to do a better job because the language doesn't have abstractions that map poorly to the hardware model. But for much of it you can forget about "I know what the hardware is going to do"
πŸ‘€francassoπŸ•‘2yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I'm late to this discussion, sorry.

But at the risk of loss of respect, I'll wait for Rust2ShinyNewLanguage to solve this.

All I know is I hope I'm smart enough to understand ShinyNewLanguage's compiler. Or maybe even build it.

I've got several projects that could use some additional Boxes of structures, and borrow instead of move, and maybe a few more complex reference counting mechanics.

Rust forced me to understand what that meant. That's good for building a better engineer.

But it's not fun to work with.

I hope the next experience is better. Sorry Rustaceans.

πŸ‘€jgerrishπŸ•‘2yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

There is no single answer to this question because it's going to depend completely on call patterns further up. Especially in regards to how much of the rest of the running program's data fits in L1 cache, and most especially in regards to what's going on in terms of concurrency.

The benchmark made here could completely fall apart once more threads are added.

Modern computer architectures are non-uniform in terms of any kind of memory accesses. The same logical operations can have extremely varied costs depending on how the whole program flow goes.

πŸ‘€cmrdporcupineπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

My first thought was "now what is the calling convention for float parameters again? they are passed in registers right? the compiler can probably arrange so they don't have to actually be copied" and then I realized it will probably even inline it.

Anyway, assuming it's not inlined I would guess pass-by-copy, maybe with an occasional exception in code with heavy register pressure.

Edit: Actually since it's a structure, the calling convention is to memory allocate it and pass a pointer, doh. So it should actually compile the same.

πŸ‘€im3w1lπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

Anyone know why seemingly knowledgeable people (like the person who wrote this article) don't use micro benchmarking frameworks when they run these tests?

Also, whenever you do one of these, please post the full source with it. There's no reason to leave your readers in the dark, wondering what could be going on, which is exactly what I'm doing now, because there's almost no excuse for c++ to be slower in a task than rust--it's just a matter of how much work you need to put in to make it get there.

πŸ‘€kolbeπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

The general usability impact matters slightly less than it looks here, in part because the `do_math` with references in the article has two extra &s, and in part because methods autoreference when called like x.f().

Performance-wise, if you're likely to touch every element in a type anyway, err on the side of copies. They are going to have to end up in registers eventually anyway, so you might as well let the caller find out the best way to put them there.

πŸ‘€VeedracπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

The Rust-test implements the traits Add, Sub, Mul by value. This makes the few references less important in the total test. The ergonomics argument is motivated by using these traits. Otherwise, references would have had the same ergonomics.

But also, the struct is 3x32 bits, and Rust auto-implements the Copy-trait for it. It is barely larger than u64, which is the size of the reference.

But life is only simpler when Copy and Clone can be auto-implemented.

πŸ‘€yobboπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I covet Ada’s feature where you just specify if a parameter is in, out, or inout; the compiler’s figures out whether to copy or pass a pointer.
πŸ‘€YesThatTom2πŸ•‘2yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I haven't benchmarked that, but in Rust `ScalarPair`s (i.e., structs who have up to to scalars) are passed in two registers, while bigger structs are always passed by pointer. Therefore, passing bigger structs by move will require the compiler to copy them, while with references it is not required to, so references may be faster in that case.
πŸ‘€afdbcreidπŸ•‘2yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I understand that this is an example for the purposes of answering the given question, but when actually doing things with 3D vertices one should be thinking in terms of structures of arrays. As someone said here already: good generals worry about strategy and great generals worry about logistics.
πŸ‘€lowbloodsugarπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I'm surprised he tested MSVC and Clang, and not GCC which usually generates faster code than those two.
πŸ‘€redox99πŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

You are comparing two completely different compilers; I wouldn't worry all that much about the difference between rust and C++. If you do want to compare them directly, why not use LLVM for C++ as well? That will highlight any language-specific differences.
πŸ‘€datafulmanπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

Discussed at the time:

Should small Rust structs be passed by-copy or by-borrow? - https://news.ycombinator.com/item?id=20798033 - Aug 2019 (107 comments)

πŸ‘€dangπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

This is not really surprising in such a case. The Rust compiler is pretty good at optimizing out uneeded copies. Here it does see that the copied value is not used after the function call, so it should simply not emit the copies in the final assembly.
πŸ‘€tuetuopayπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

Minor nit: many of the differences in the article aren't really specific to the Rust vs C++, but rather differences between llvm vs whatever compiler backend is used by msvc.
πŸ‘€ardel95πŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

Interesting! Of note, My `Vec3` and `Quaternion` types (f32 and f64) have `Copy` APIs, but I've wondered about this since their inception.
πŸ‘€the__alchemistπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I don't know much about this, but by-copy sounds nice. I'd rather own stuff and be happy.
πŸ‘€TEP_Kim_Il_SungπŸ•‘2yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

Should have also tried pass-by-move .
πŸ‘€throwawaybycopyπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

A more direct comparison would have been a r-value reference.
πŸ‘€29athrowawayπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

So c++ is less complicated than rust in some cases?
πŸ‘€cuteboy19πŸ•‘2yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

This is one of the problems I have with writing rust code. You have to think about so many mundane details that you barely have time left to think about more important and more interesting things.
πŸ‘€ameliusπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I guess by-copy bc I’m cool
πŸ‘€birdyroosterπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

It is a problem of statistics and depends on internals of underlying operating system. I’m not sure you really need that sort of optimisation
πŸ‘€m00dyπŸ•‘3yπŸ”Ό0πŸ—¨οΈ0