More Recursion–Removing Multiple Spaces

My last post about recursive CTEs reminded me of a recursive SQL Server function I wrote a while back for removing multiple spaces from strings.  By removing multiple spaces, I mean replacing all consecutive repeating spaces with a single space.

This function was originally created as part of a cleansing process for incoming customer data, especially address related fields. It worked very well, and I thought it performed well at the time, but as you’ll see, it was actually pretty slow compared to other options.

To solve this problem, the less experienced SQL developer might ask, why not just use the T-SQL REPLACE function? After all, the REPLACE function replaces all of the instances of once set of characters with another set of characters within a given string, right?

If you are only replacing double-spaces with single-spaces, a single REPLACE works just fine. However, if there are longer sequences of spaces in the string, REPLACE doesn’t finish the job unless you apply it multiple times. For example, a sequence of 4 spaces would be replaced with 2 single-spaces next to each other, which is a new double-space. In order to remove all the the multiple spaces, REPLACE needs to be run multiple times, which is how the function below was born.

CREATE FUNCTION dbo.RemoveMultipleSpaces (@str NVARCHAR(MAX))
	RETURNS NVARCHAR(MAX)
AS
BEGIN
	IF CHARINDEX('  ', @str) = 0
		RETURN @str;

	RETURN dbo.RemoveMultipleSpaces(REPLACE(@str, '  ', ' '));
END
GO 

Here’s an example of calling REPLACE on an address string vs. calling the above function:

DECLARE @s VARCHAR(MAX) = '221B    BAKER       STREET';
SELECT REPLACE(RTRIM(@s), '  ', ' ') AS StillSpaced;
SELECT dbo.RemoveMultipleSpaces(@s) AS AllGood;

And the results:

An iterative version of this same function is straight-forward as well, though not quite as much fun to talk about…

CREATE FUNCTION dbo.RemoveMultipleSpacesItr(@str NVARCHAR(MAX))
	RETURNS NVARCHAR(MAX)
AS
BEGIN
	WHILE CHARINDEX('  ', @str) > 0
		SET @str = REPLACE(@str, '  ', ' ');

	RETURN @str;
END
GO

 

A Third Option

Lastly, I ran across a very interesting post by Jeff Moden on SQL Server Central from a couple of months ago. It describes a set based solution to this problem. Basically, it involves a series of replaces, first with an “unlikely character” then with a single space. Check it out the link for the details; a snippet of the code is below:

DECLARE @s VARCHAR(MAX) = '221B    BAKER       STREET';
SELECT REPLACE(
        REPLACE(
            REPLACE(
                LTRIM(RTRIM(@s))
            ,'  ',' '+CHAR(7))
        ,CHAR(7)+' ','')
    ,CHAR(7),'') AS CleanString ;

That code can be used as is, or it could be put in a scalar or inline table valued function, like so:

CREATE FUNCTION dbo.RemoveMultipleSpaces3Scalar (@str NVARCHAR(MAX))
	RETURNS VARCHAR(MAX)
AS
BEGIN
	DECLARE @str1 VARCHAR(MAX);

	SELECT @str1 = REPLACE(
				REPLACE(
					REPLACE(
						LTRIM(RTRIM(@str))
					,'  ',' '+CHAR(7))
				,CHAR(7)+' ','')
			,CHAR(7),'') ;

	RETURN @str1;
END
GO

CREATE FUNCTION RemoveMultipleSpaces3TVF(@str NVARCHAR(MAX))
RETURNS TABLE
AS
  RETURN (
   SELECT REPLACE(
            REPLACE(
                REPLACE(
                    LTRIM(RTRIM(@str))
                ,'  ',' '+CHAR(7))
            ,CHAR(7)+' ','')
        ,CHAR(7),'') AS CleanString
  );
GO

The scalar function would be called like the others, but the table valued function can be called by using a CROSS APPLY, for example:

SELECT *
FROM Customer a
CROSS APPLY dbo.RemoveMultipleSpaces3TVF(a.Address1)

 

A Word About Performance

Even though each of these pieces of code accomplishes the same task, the performance is very different. To test, I took the address record we’ve been using (’221B     BAKER       STREET’) and inserted it into a table 1,000,000 times. Then I tested each of the functions by calling them against that table and inserting the results into a second table. Each was run a few times to be sure there weren’t any delays due to warming the cache or compilation, etc. The results are shown below (listed in the order discussed in this post):

Not too surprisingly, the non-set based solutions are the worst performers, with the recursive one bringing up the rear. The set-based solution outside of a function, is smokin’ fast relatively speaking – it’s not much slower than the control.

The performace difference looks like a lot, and it is, but that doesn’t always matter. Like I mentioned earlier, in the customer data loading process I was working on, the recursive function was used and worked fine. Looking back, that’s probably because the loads did many other things as well, so slower performance at this step wasn’t noticable in the big picture.

If I were to repeat that project today though, I would take a hard look at the set based option. Because every little bit counts.

And because I’m a geek :)

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Notify me of followup comments via e-mail. You can also subscribe without commenting.