This involved traversing the HTML DOM tree, very much like the high-performance approach described in Java: Iterating over XML DOM Nodes, then using the
.textContent || .innerHTML properties to further analyze - you guessed it - the text content.
.innerHTML properties are a bit of an issue themselves.
.textContent is "standards-compliant", defined by the W3C.
As Microsoft admits on their MSDN page,
.innerText has "no public standard that applies to this property".
To make matters worse, many developers seem to find obscure ways of detecting which property to use, often by using browser detection or other similarly flawed methods like detecting for other browser properties - e.g. using
document.all - which aren't even related to the issue.
.textContent || .innerText" is the best fix.
.textContent returns a value that can be evaluated to
true, it will be used as the result.
.innerText will be returned.
0, -0, null, false, NaN, undefined, or the empty string (
"") all evaluate to false.
In Internet Explorer,
.innerText is used.
An almost equal performance issue with both
.innerHTML quickly became apparent under Internet Explorer.
Basic testing showed that iterative use of these properties took on the order of 100x longer than the
.textContent / .innerHTML properties under Firefox.
(Firefox seems to work almost instantaneously, so increasing the length of the iteration quickly yields an exponential growth in the difference.
Additionally, these properties seem to be internally cached by Internet Explorer, such that calling the property repeatedly on the same node results in almost immediate results after the initial call - important for testing this.)
It will generate the specified number of
<span> elements within a
<div>, then time itself as it calls the selected property on each node.
On my machine, Firefox consistently completes any of the function selections in less than 500 ms (1/2 second) for 100,000 items. Internet Explorer consistently takes > 30,000 ms (30 seconds) except for the "null" selection (a dummy function that simply "returns null"), which performs as fast as Firefox.
(As much as I'd like to include this as an attached page, Blogger doesn't currently support non-image file attachments.)
The only work-around I know of at the moment is to "write your own" method to retrieve the text content.
.nodeValue may work for text nodes. Issues include making sure a text node is being referenced, and concatenating the values from all the children (and their children...) if needed, which matches the functionality of the above built-in properties.
Fortunately, in the controlled data I'm working with, this is not the case so a simple call to
.nodeValue is working... for now.