SQL and Database: Splitting a column into individual words?

I’m pretty sure I’ve seen them used for analysis of a single column. Like an export of an event log - how may events occurred on each date - or a list of search hits (where the original search term contained wildcards) - how may hits of each exact permutation. That’s why it occurred to me for what I thought you were looking for. But like I said, I’m not an Excel expert.

On second thought, though, if you’re going to be running the search or whatever repeatedly over time, recreating the table would probably not be the best solution.

You can use the code button in the edit toolbar to make code blocks, or you can just bracket code blocks with three backticks above and below:

```
like this - code goes here
blah blah blah
```

I finally had some time to think about it this afternoon and came up with some stuff I had squirreled away in my toolbox. I can’t take credit for this other than putting it together since I stole the components off the internet. The only one I’m sure I got things from is Jeff Moden who posts lots of stuff about tally tables. The actual origins of the string split stuff is lost.

You’ll have to do some replacement on your own. Anywhere it says “< table >”, replace that with your table name. Anywhere it says “< column >”, replace that with your column name.

set rowcount 12000
set nocount on

select identity (int, 1, 1) bob
  into #tally
  from master.sys.all_columns c
 cross join master.sys.all_columns c1

select substring(< column >, bob, charindex(' ', < column > + ' ', bob) - bob) as word
  into #temptable
  from #tally, < table >
 where bob <= convert(int, len(< column >))
   and substring(' ' + < column >, bob, 1) = ' '

It seems to work pretty quickly and gave me the same results as the function in the dummy table I set up.

Thanks! I’m going to have to try that tomorrow if they let me get away from QA testing. I’d much rather be playing with reports and analysis, but the QA work has to get done. :frowning:

Brazil, your code worked like a dream. I just added a column so I could join with another table, and voila! I have my analysis by frequency of word count.

Thanks again!

Edit: Oh, and I remembered to drop the temp tables at the end. :wink:

You probably already know this, but when you create a table that is prefixed with a hatch (#) as in “#temptable”, SQL understands that it’s temporary will automatically drop it when you leave your session. IMHO, whether it’s an actual temp table or a real table that you’ve created just for messing around, it’s simply good practice to do your housekeeping and drop it before you quit, but sometimes people forget which is the reason we usually encourage using a #temp table.

Also, if you create a table prefixed with two hatches, ##temptable, it’s a global table and can be shared between sessions.

Glad the script worked. It’s gratifying to learn that I haven’t been saving stuff for nothing.

Well, I got chewed out because I couldn’t explain the query and write it from the top of my head, even though I was upfront about where the query came from, and I’ve only been doing this for a couple of months.

I’ve been looking the stuff up, but since I’m still pretty new at this, I wanted to ask about the cross join on master to itself. I understand that the master table is the system level information, and that the select identity portion is creating an automatically incrementing number, but I’m not sure what the cross join is doing.

bob is just the name that the creator gave to the incrementing number, right?

I’m late to the party here, but red alert! red alert!.

Just because it’s “all in SQL” doesn’t mean that you can’t/won’t bring the instance to its knees with a bad query (I, uh, have a friend who might have done this). SQL Server Dev Edition costs $60 if bought direct from MS, you can get it cheaper elsewhere. Hell, depending on what features you’re using even Express (totally free) might be enough. There’s no reason why there shouldn’t be a non-production instance somewhere for you to experiment & test against, with the same schema you have in production - even if the data on it is a month or two out of date.

Well, the query that I used ended up taking about a minute and a half to run. The original query limited the row count, so it didn’t take too long to run. I do have access to a non-production instance now that I’ll use from now on for my experiments.

Sorry you got chewed on. Welcome to the wonderful world of the DBA!

A cross join is an unconditional match between each row of one table to each row of another table, in this case, the same table. Tell them it returns a Cartesian product. That’ll impress the hell out of them. As we’re using it here, we don’t even care what the data is, we just want a table big enough to give us 12000 matches so we can populate the tally table and the identity does that for us.

There are lots of ways to generate things like this and I picked the one I use most. To be honest, 12000 is undoubtedly overkill, but since I didn’t know how long the strings you were splitting would be, I just wanted to make sure it would be plenty big. If you only have strings that are 256 characters long, you would probably be safe with a tally table of 128. The name “bob” is indeed just a name I gave it named after a co-worker. I was going to call it “florian”, but I was afraid it would identify me as a PBS viewer.

If it makes them queasy knowing that you suspect that there’s an object named “all_columns” or even that there’s a master database, you could generate the tally table this way:

with things(florian) as
(select 1 as florian
 union all
 select florian + 1 from things where florian < 12000
)
select florian into #tally from things option(maxrecursion 12000)

That’s one alternate way among many to do it. It uses a CTE (Cretaceous Transcendental Effulgence or something like that). CTEs are fun.

I won’t bore you with all the horror stories of things people do to lock up our development servers or the nightmares I have thinking that they might try to do the same things on production if we gave them the chance. I’m glad they gave you somewhere else to do your learning.

CTEs (Common Table Expressions) can also cause (or do nothing to relieve) major performance problems (just like damn near everything else in SQL) if they’re referenced multiple times in the same query. They’re convenient to use as aliases to simplify queries, but I’ve had a couple situations recently where they did absolutely nothing to improve my query’s execution (I had a correlated subquery referenced twice in a single query, and tried extracting it to a CTE).

OTOH, taking the resultset of that subquery, dumping it to a temp table, and joining to that temp table in the following query resulted in a massive improvement (we’re talking 8 minutes down to 90 seconds, and a huge reduction in I/O counts & a much improved execution plan).

This is possibly the most useful thread we’ve ever had in the history of the UI subforum :smiley: I don’t even speak SQL and it’s fascinating!

FYI, the 12000 rowcount was not nearly big enough, since the table I was pulling from had many thousands of records with probably an average of four words per. I ended up setting the row count to 0 in order to get everything. I’m learning!

Hmmm. I have some explaining to do here. The term “rowcount” is misleading you. All we’re using it for is to populate the tally table. Now, forget about rows because it’s pretty meaningless from this point on…

The 12000 is used to keep track of the position in the string of the beginning of each word. You didn’t say anything about how long the strings were going to be, and I picked 12000 because I couldn’t imagine that you would be checking a string longer than 12000 characters and figured that would be overkill. I totally nailed it (yay me) if all you have in a string is four words unless you have some excruciatingly long words.

When you set rowcount = 0, you basically told it to populate the tally table with everything it could find. So if “master.sys.allcolumns” has 6313 rows, it’s going to use a cross join to populate the tally table with 6313 * 6313 or 39853969 values. That’s the Cartesian product I was talking about, and I thought 12000 was overkill.

So, let’s throw away the first way I sent for building the tally table and use the other one so we don’t have the word “rowcount” throwing us a curve:

-- We're going to use this variable to tell how long the string
-- data is (previously, it was 12000)
declare @str_len int = 128
with things(bob) as
(select 1 as bob
 union all
 select bob + 1 from things where bob < @str_len
)
select bob into #tally from things option(maxrecursion @str_len)


-- That has populated the tally table with 128 rows and it is plenty big enough
-- unless we have a string that is longer than that. Now, we will use it to split
-- the string data. Do not forget to keep forgetting about rows.
-- Note that we are looking here at the "substring" of the string data. "bob" is
-- telling substring the character positions of each blank space in the string.
select substring(< column >, bob, charindex(' ', < column > + ' ', bob) - bob) as word
  into #temptable
  from #tally, < table >
 where bob <= convert(int, len(< column >))
   and substring(' ' + < column >, bob, 1) = ' '

It takes the string data of the entire table as a whole and splits each string. As dakboy said, “the query resulted in a massive improvement (we’re talking 8 minutes down to 90 seconds, and a huge reduction in I/O counts & a much improved execution plan).” That’s because it’s processing the data as a whole rather than one row at a time, and it doesn’t care how many rows are there. 1, 2, 12000, 15M. It doesn’t make any difference.

I hope that clears things up a bit better. From now on when I post some code, I’ll try to remember to explain it fully. If there’s anything that you’re not sure about, sing out.

Thanks for the explanation! I’m looking for a SQL class to take, but this really helps.

I did look up rowcount, and the definition I found was the row at which the query stops. When I did a count of the words that were returned, it equalled 1200. When I set the rowcount to 0, it then got all of the words, which was more like 175,000. If that wasn’t the row count, then what else would have limited the returned data?

I started to reply that I didn’t know without having more details about your data, but then I had an idea, and I learned something from it. Turns out when you set rowcount, it persists through the rest of your session unless you change it. Why I didn’t already know that is the mystery.

If we had built the tally table and then reset rowcount to 0 before running the string search, it would have worked. I guess that’s it, the last thing I had to learn about SQL server.

The same is true of other options like set statistics io and set statistics time.

Yes, it was a facepalm moment.

I’m glad there’s a reason your explanation was correct and mine also worked. Nothing like a know nothing newbie asking dumb questions to teach you stuff, huh? :wink:

It’s just irritating. There are a bunch of properties you can set for configuring your session. Rowcount, nocount, the statistics that Dak mentioned, identity_insert, a bunch of ansi settings, … They go on and on, and I fumbled that one.

Coincidentally (or maybe the cosmos is trying to tell us something), I was alerted this morning to a post at simple-talk.com about Searching for Strings in SQL Server Databases. There’s some pretty good stuff there.