(Replying to PARENT post)
> ASCII: it’s okay to only support ASCII for the whitespace handling and lowercase operation
UTF-8 (quite deliberately) is a superset of ASCII. So a UTF-8 solution is correct for ASCII, but a bytes-as-ASCII solution works fine in Rust if you only need ASCII.
This is why Rust provides ASCII variants of a lot of functions on strings, and the same functions are available on byte slices [u8] where ASCII could be what you have (whereas their Unicode cousins are not available on byte slices).
(Replying to PARENT post)
Sure, but is it changing the problem to something easier than what the other languages are already doing, or to something more similar? I'd imagine the C code is basically just using byte arrays as well, for instance.
(Replying to PARENT post)
To be fair, that's what the C version does as well.
(Replying to PARENT post)
No, that's very wrong. ripgrep has rich Unicode support for example, but represents file contents as byte strings. UTF-8 strings vs byte strings is an implementation detail.
I think you might benefit from reading the "bonus" Rust submission: https://github.com/benhoyt/countwords/blob/8553c8f600c40a462...
IMO, Ben kind of glossed over the bonus submission. But I personally think it was the biggest point in favor of Rust for a real world version of this task.
(Replying to PARENT post)
(Replying to PARENT post)
For tons of questions, both can be correct.
(Replying to PARENT post)
* It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.
* It uses a faster hash algorithm. It’s not the first time this came up in a benchmark article. Rust’s decision to use a DOS-safe hash by default (and not provide a fast algorithm in the std, like other languages do) really seems to hurt it in that kind of microbenchmark.
* It uses get_mut+insert instead of the more convenient HashMap::entry method, because the latter would require redundantly allocating the key even in the repeat case. I’ve hit this problem in the past as well. Maybe the upcoming HashMap::raw_entry_mut will make this kind of optimization cleaner.