(Replying to PARENT post)

For those curious about Rust optimization, the “optimized” version makes three changes compared to the “idiomatic” version:

* It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.

* It uses a faster hash algorithm. It’s not the first time this came up in a benchmark article. Rust’s decision to use a DOS-safe hash by default (and not provide a fast algorithm in the std, like other languages do) really seems to hurt it in that kind of microbenchmark.

* It uses get_mut+insert instead of the more convenient HashMap::entry method, because the latter would require redundantly allocating the key even in the repeat case. I’ve hit this problem in the past as well. Maybe the upcoming HashMap::raw_entry_mut will make this kind of optimization cleaner.

👤codeflo🕑3y🔼0🗨️0

(Replying to PARENT post)

The problem specified declares the words we're counting are ASCII:

> ASCII: it’s okay to only support ASCII for the whitespace handling and lowercase operation

UTF-8 (quite deliberately) is a superset of ASCII. So a UTF-8 solution is correct for ASCII, but a bytes-as-ASCII solution works fine in Rust if you only need ASCII.

This is why Rust provides ASCII variants of a lot of functions on strings, and the same functions are available on byte slices [u8] where ASCII could be what you have (whereas their Unicode cousins are not available on byte slices).

👤tialaramex🕑3y🔼0🗨️0

(Replying to PARENT post)

> It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.

Sure, but is it changing the problem to something easier than what the other languages are already doing, or to something more similar? I'd imagine the C code is basically just using byte arrays as well, for instance.

👤saghm🕑3y🔼0🗨️0

(Replying to PARENT post)

> * It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.

To be fair, that's what the C version does as well.

👤olalonde🕑3y🔼0🗨️0

(Replying to PARENT post)

> It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.

No, that's very wrong. ripgrep has rich Unicode support for example, but represents file contents as byte strings. UTF-8 strings vs byte strings is an implementation detail.

I think you might benefit from reading the "bonus" Rust submission: https://github.com/benhoyt/countwords/blob/8553c8f600c40a462...

IMO, Ben kind of glossed over the bonus submission. But I personally think it was the biggest point in favor of Rust for a real world version of this task.

👤burntsushi🕑3y🔼0🗨️0

(Replying to PARENT post)

The question as posed only cares about ASCII.
👤SAI_Peregrinus🕑3y🔼0🗨️0

(Replying to PARENT post)

>It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.

For tons of questions, both can be correct.

👤coldtea🕑3y🔼0🗨️0

(Replying to PARENT post)

Ad uft8: it would change the problem only if the definition of word breaks changed. E.g., if a word break is defined by whitespace, then it probably doesn't.
👤stewbrew🕑3y🔼0🗨️0