by Marcus Tucker
Preface
On the forums that I participate in (and at work) I often
see ASP code making extensive use of string concatenation, and although
I want to point out why
it’s bad and how to improve it, I usually bite my tongue and say nothing
because to cover the topic in sufficient depth would take more time
than I want to spend! It is disappointing to see that so many developers are
unaware of the performance implications of doing so, and in other cases where inefficient
techniques are used I would normally provide a link to a relevant article,
but so far I haven’t yet come across one that
does this particular subject justice, which is why I thought that I’d
write one myself!
The Problem
Let’s start with some simple (but handy) code to manipulate delimited lists:
|
Now, as you can see, the AddToDelimitedList() sub uses VBScript string concatenation
to tack each value onto the end of the delimited list, and in use this function
would be called many times to add items to the list.
During development (and perhaps testing), this seems to perform
fine, but that’s because there’s only a few items in
the list – only test data is being used. When the code becomes used in a production
environment, it could easily cause a major performance because unfortunately
VBScript’s string concatenation peformance has an N-squared cost,
i.e. the time taken to add each additional substring is proportional to the length of the entire
string, and thus the total time taken to perform N concatenations follows an
exponential curve, which quickly comes to dwarf the execution
time for the rest of the code in the script. More on this later.
Of course, the delimited string manipulation code above is merely one example
of where you might need to concatenate strings, but it’s an interesting example
because it’s all-to-easy to call code like this from within a loop without realising
that it could become a major bottleneck in your application’s performance once
the usage and/or volume of data increases significantly. Testing during development
rarely accurately simulates real-world usage, so performance hits like this may only become apparent
once the system is in place… and then it can often be too late to fix!
Therefore, you *must* keep an eye
out for hidden gotchas like this throughout the development process. So what
can you do about it? Read on…
Solutions
Years ago when I first became aware of the problem I rolled my own VBScript
class which used arrays and the Join()
function to provide huge efficiency improvements over the native concatenation
performance – a technique I had seen in a tutorial somewhere. It was a great improvement, but still had its own problems which
were caused by the need to continually ReDim the class’s internal array
– this became the bottleneck. I later refined this technique by doubling
the size of the array with each reallocation, which of course continually halves
the frequency of reallocation (at the expense of greater memory use, but
that’s not usually a problem with web servers), and therefore improves
performance quite a bit. This has since always been my preferred technique
for concatenating strings together, and the one that I have recommended
to others in my capacity as
Sitepoint’s
"ASP Guru", a position I have held for the last couple of years.
However, with the advent of ADO v2.5, some new and interesting objects became
available, most significantly the Stream object, which
replaces the antiquated (and extremely inefficient) file reading and writing
functionality provided by the FSO (FileSystem Object), and adds native binary
support (as well as the ability to convert between different data types and
character sets).
I use the Stream object all the time for manipulation of bulk binary data
and text, since it can easily be used as an in-memory
buffer (in the same way that a disconnected Recordset object can be used
as a simple in-memory flat file database), but I hadn’t thought to compare
it to my array-based string concatenation class to see if it offered any
performance improvements. I therefore wrote a new class to encapsulate the functionality, and
a benchmark script to compare the three approaches – native VBScript concatenation, array joining,
and the stream object. The results of these tests form the basis for the
rest of this article. I should also mention that there are other solutions in the form
of 3rd party freeware COM+/ActiveX components
(such as Michael Balloni’s StrCat.Catter),
but as we all know, getting components installed on shared
hosting space is nigh on impossible (for good reasons, but that’s another
story), so it’s probably only a viable option for dedicated servers that are yours to do
with as you please. And besides, from a portability perspective, it’s usually
wise to leverage the standard toolset (ADO in this case) as much as possible rather
than to use non-standard/custom solutions.
Results
All testing was performed on a spare PC dedicated to running IIS on Windows
2000 Server, and connected to my desktop PC via 100Mb Ethernet. Further
details of the machines
are irrelevant to these tests since it is the relative performance of figures
within each set of results that is important, but be assured that the machine
acting as the web server was not modified in any way during the time it took
to perform all the tests.
To start with, here are the results of the benchmarks with a substring length
of 100 characters, going up to a maximum of 2000 iterations (i.e. for every
iteration, a 100-character string was appended to the end of the string being
built):
Substring length: 100 | ||||||
Time /ms | ||||||
Iterations | Native | Array | ADO.Stream | |||
200 | 0.0 | 0.0 | 0.0 | |||
400 | 15.6 | 0.0 | 0.0 | |||
600 | 15.6 | 0.0 | 0.0 | |||
800 | 78.1 | 0.0 | 15.6 | |||
1000 | 218.8 | 0.0 | 15.6 | |||
1200 | 406.3 | 0.0 | 31.3 | |||
1400 | 468.8 | 15.6 | 31.3 | |||
1600 | 625.0 | 15.6 | 31.3 | |||
1800 | 921.9 | 15.6 | 31.3 | |||
2000 | 1,265.6 | 15.6 | 31.3 |
Note that the resolution of the PC clock is +/- approx. 15.6
milliseconds, and therefore timings in the same order of magnitude have a high
innacuracy (i.e.
accuracy is poor for times below approx. 150ms).
The results in Table A clearly demonstrate that the native VBScript concatenation
is extremely inefficient, taking 1.3 seconds to do what the array method can do in a mere 16 milliseconds
(approx.)! The stream method takes a similar amount of time as the array method,
but due to the innacuracy of the measurements at such short timings, it would be unwise
to attempt to infer any more from these results.
Substring length: 100 | ||||||
Time /ms | ||||||
Iterations | Native | Array | ADO.Stream | |||
2000 | 1,437.5 | 15.6 | 46.9 | |||
4000 | 6,796.9 | 31.3 | 62.5 | |||
8000 | – | 93.8 | 125.0 | |||
10000 | 43,515.6 | 109.4 | 171.9 | |||
20000 | – | 250.0 | 328.1 | |||
30000 | – | 359.4 | 468.8 | |||
40000 | – | 531.3 | 640.6 | |||
50000 | – | 625.0 | 796.9 | |||
100000 | – | 1,171.9 | 1,562.5 |
Note that because of the long execution times that the native
VBScript concatenation incurs, my benchmark script detects step
execution times over 2 seconds and skips subsequent timings for
that method. Note also that the table does display a couple of
timings over 2 seconds and has odd increments of iterations – this is because
I have appended results from another run with a different iteration range
(but the keeping the substring size the same so that the tests are fair and scientific).
By increasing the iterations, this picture becomes clearer – table B shows that
once the number of iterations has been increased, the native method takes
a ludicrous 43.5 seconds for 10000 iterations while the array method
is 400x quicker, taking a mere 110ms to the same, and the stream method close
behind with 172ms.
However, there is another important factor to consider – the size of the substring
that is being appended each time. So far all the tests have used a substring
100 characters long – what happens if it’s smaller?
Substring length: 10 | ||||||
Time /ms | ||||||
Iterations | Native | Array | ADO.Stream | |||
1000 | 0.0 | 0.0 | 15.6 | |||
2000 | 15.6 | 15.6 | 31.3 | |||
3000 | 31.3 | 15.6 | 62.5 | |||
4000 | 46.9 | 31.3 | 62.5 | |||
5000 | 93.8 | 46.9 | 93.8 | |||
6000 | 109.4 | 46.9 | 109.4 | |||
7000 | 312.5 | 62.5 | 109.4 | |||
8000 | 812.5 | 78.1 | 125.0 | |||
9000 | 1,281.3 | 93.8 | 156.3 | |||
10000 | 1,312.5 | 93.8 | 171.9 |
Table C shows that the native method performs better with smaller
substrings – 10000 iterations in this test took 1.3 seconds, compared to taking 43.5 seconds
with a substring of 100 characters (see the previous table). This is because
the main cause of the method’s performance hit is the overhead of copying around
large blocks of memory, so the smaller the substring, the less memory used /
copied, and the quicker the time. The array method is still faster than the stream method,
but it’s now almost 2x as fast! This is most likely to be because the mere act of accessing
or calling a method on a COM object incurs overhead, and at small substring sizes
these overheads become proportionally significant when compared to the cost of actually
storing the 10 character substring once the COM method has been invoked.
Substring length: 2000 | ||||||
Time /ms | ||||||
Iterations | Native | Array | ADO.Stream | |||
1000 | 8,687.5 | 46.9 | 15.6 | |||
2000 | – | 93.8 | 31.3 | |||
3000 | – | 140.6 | 46.9 | |||
4000 | – | 171.9 | 62.5 | |||
5000 | – | 218.8 | 78.1 | |||
6000 | – | 265.6 | 93.8 | |||
7000 | – | 296.9 | 109.4 | |||
8000 | – | 343.8 | 125.0 | |||
9000 | – | 406.3 | 140.6 | |||
10000 | – | 437.5 | 156.3 |
When the substring length is increased, the tables are turned – table D shows
that the stream-based method takes almost a third of the time of the array
method to build the string.
Again, lest we forget, the array method is approx 180x faster than native, and
the stream method is approx 540x faster! Some quick math shows that
the final string is 20 million characters long – which to be fair is quite a large
volume of data to manipulate – and while it’s safe to say that
it must occupy a minimum of 9.5MB of memory, since each method
stores it in a different way it’s impossible to
say *exactly* how much actually is used to store
it (without extensive further investigation – unicode further complicates matters,
for example).
Regardless, what *is* clear is that with large substrings the continual ReDim’ing
of the array becomes more expensive than simply writing more characters
to the end of the in-memory stream. I suspect that the stream object uses
a linked list internally to store data across non-contiguous blocks of
dynamically allocated memory instead of reallocating memory as a single
block each time, since the latter case is of course the Achilles’ heel of the native
concatenation.
Conclusions
So, we’ve compared the three approaches, but what have we learned? Well,
if nothing else, it should be crystal clear that VBScript string concatenation
should be avoided at all costs! That said, it is safe to use native string concatenation
when you *know* that only a fixed number of substrings
will be concatenated together, such as when you’re dynamically constructing
a dynamic SQL statement, or assembling a line of HTML to be written directly
to an output (file, response, etc). However, as a general rule of thumb,
in all other cases, and particularly when the volume of the
source data (in data size or number of items) is likely to increase
over time, if the number of records which are processed to generate the
output are subject to control by the end user (e.g. a selectable search
results limit), or you are incrementally generating
output which isn’t written directly to the output (Response object, disk, etc),
you must give due consideration to this issue and use a scalable string building technique.
And which technique should you use? As the results demonstrate, it all depends
on exactly *what* you are appending. For short substrings, the array-based
methods have the edge, but as you increase the length of the
substrings, the stream-based method becomes the winner. Therefore, you should
choose your method depending on the nature and volume of the data that you are using.
To generalise, since most applications are unlikely to require the frequent
appending of thousands of substrings which are each thousands of characters
in length, the array method will probably do you just fine in 99% of cases.
Incidentally, if you are already using an array-based VBScript concatenation
class, check the code to make sure that it is NOT ReDimming the array each time a new substring
is added – it’s significantly more expensive to use that approach (as I mentioned at
the start of the article), so edit it so that it uses the lazy length-doubling
method, or simply switch to using my code (see end).
As well as ensuring that you are now fully aware of the performance issues
associated with using native string concatenation, hopefully this article has
also made you realise that you should try to test your code with real-world
data, and to examine alleged performance-enhancing techniques carefully, since
whether or not you get a significant gain (or perhaps even a decrease in performance)
all depends on the precise circumstances in which you use a given technique – in particular,
what data you are passing through it, and how often it gets called.
Finally, the code for the benchmark and concatenation classes is available here:
StringConcatenation.zip (2 KB),
and you are free to use the code in your own work. I ask only that you leave
my credits in place. I hope you’ve found this interesting reading,
and it’s always nice to know whether advice has paid off for people, so let me
know your feedback, particularly if as a result you’ve implemented
changes which have made a significant difference to the performance of your
scripts / web applications.
Additional Information
An example of poorly performing code using the native method – CodingForums
The internal workings of VB/VBScript native string concatenation – Microsoft MSDN