πŸ‘€evilpieπŸ•‘11yπŸ”Ό191πŸ—¨οΈ66

(Replying to PARENT post)

Lazily converting UTF-8 (or latin1) to UTF-16 as needed is indeed an old trick employed by many string classes.

It's even a bit surprising that a codebase as popular and performance-critical as SpiderMonkey hadn't picked up such as a simple and high-yield optimization several years ago.

By the way, other implementations are even lazier: the string is kept in its original encoding (utf-8 or 7-bit ascii) until someone calls a method requiring indexed access to characters, such as the subscript operator. At this point, you convert to UTF-16 to for O(1) random access.

Indexing characters in a localized string is rarely useful to applications and often denotes a bug (did they want the N-th grapheme, glyph or code-point?). It's best to use higher-level primitives for collating, concatenating and splitting localized text.

Granted, a JavaScript interpreter must remain bug-by-bug compatible with existing code, thus precluding some of the most aggressive optimizations.

πŸ‘€codewizπŸ•‘11yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

Linear-time indexing: operations like charAt require character indexing to be fast. We discussed solving this by adding a special flag to indicate all characters in the string are ASCII, so that we can still use O(1) indexing in this case. This scheme will only work for ASCII strings, though, so it’s a potential performance risk. An alternative is to have such operations inflate the string from UTF8 to TwoByte, but that’s also not ideal.

Perhaps I'm missing something (quite likely, as I am certainly no expert when it comes to unicode), but I was under the impression that this would already have to be the case since UTF16 is also variable length.

πŸ‘€tolmaskyπŸ•‘11yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

My personal ideal representation of strings.

Modified rope-like tree, where each node stores a flat array of characters, with the caveat that "characters" within a node must have the same encoding, with the node storing the encoding. Yes, this means that a string can have multiple encodings within it. The internal "default" representation is a flat array of a (modified) UTF-8 encoding, at a fixed number of bytes per character (stored in the encoding) using overlong encodings if necessary. (So if you change a character in a node composed of two-byte characters into a single-byte character, you don't need to regenerate the node if it doesn't make sense.)

πŸ‘€TheLoneWolflingπŸ•‘11yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

For every JS string we allocate a small, fixed-size structure (JSString) on the gc-heap. Short strings can store their characters inline (see the Inline strings section below), longer strings contain a pointer to characters stored on the malloc-heap.

I wonder what the reason is for this roundabout way of doing it - couldn't the whole string be stored as a variable-length block (header with length, and then the content bytes), all on one heap? Incidentally this is also one of the things I think is broken about the malloc() interface; there is no portable way to get the size of an allocated block with only a pointer to it, despite that information being available somewhere - free() has to know, after all. Thus to do it the "correct, portable" way you have to end up essentially duplicating that length somewhere else. The fact that people are getting told that it's not something they need to know (e.g. http://stackoverflow.com/questions/5451104/how-to-get-memory... ) doesn't help either.

I've written a "nonportable" (in reality, all that would be needed is to change the function that gets the length from the block header) string implementation that uses this technique, and it definitely works well.

Some operations like eval currently inflate Latin1 to a temporary TwoByte buffer, because the parser still works on TwoByte strings and making it work on Latin1 would be a pretty big change.

I haven't looked at their code but if the parser expects the whole string to be available and accesses it randomly it would certainly be a big rewrite; otherwise, if it's more like a getchar(), it wouldn't be so hard to have a function expand each character from the source string as the parser consumes it.

The main goal was saving memory, but Latin1 strings also improved performance on several benchmarks.

With modern processors having multilevel cache hierarchies and increasing memory latencies, smaller almost always is faster - it's well worth spending a few extra cycles in the core to avoid the few hundred cycles (or more) of a cache miss.

πŸ‘€userbinatorπŸ•‘11yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

Is this similar to the Flexible String Representation[0] in Python 3.3?

[0] http://legacy.python.org/dev/peps/pep-0393/

πŸ‘€ch0wnπŸ•‘11yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

I know this will probably get downvoted into oblivious but is string space really an issue in the browser?

For example, this page at the time of this post has 68k of html so 68k of text. Let's assume it's all ASCII but we store it at 32bits per character so it's 4x in memory or 270k. Checking Chrome's task manager this page is using 42meg of ram. Do I really care about 68k vs 270k when something else is using 42meg of ram for this page? Firefox is also at a similar level.

Why optimize this? It seems like wrong thing to optimize? Especially for the complexity added to the code.

πŸ‘€greggmanπŸ•‘11yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

Why not compared with other browser like chrome etc?
πŸ‘€hexleoπŸ•‘11yπŸ”Ό0πŸ—¨οΈ0

(Replying to PARENT post)

Latin1? I hoped it would die some day.
πŸ‘€thomersch_πŸ•‘11yπŸ”Ό0πŸ—¨οΈ0