ABSTRACT
This article shows that adding a TOP clause to the SQL SELECT query may greatly increase its execution time. It also shows very effective solution to fix this problem.
INTRODUCTION
Lately I was conducting performance tests for a new project. I built sample database, filled it with data and began testing. The database schema is not important here. All you should know is that I had about 3M of records in main table [Articles] and about 11M of records in tables which I joined. I used this naive query at first:
SELECT
[Articles]
.Id
,
CountryCategories.Name
,
CityCategories
.Name
FROM[Articles]
INNER JOIN CategoryCountry2Articles
ON[Articles]
.Id = CategoryCountry2Articles
.IdArticle
INNER JOIN CountryCategories
ON CountryCategories.Id =
CategoryCountry2Articles
.IdCountry
INNER JOIN CategoryCity2Articles
ON[Articles]
.Id = CategoryCity2Articles
.IdArticle
INNER JOIN CityCategories
ON CityCategories.Id =
CategoryCity2Articles
.IdCity
WHERE CountryCategories.Name = 'country1'
AND CityCategories.Name =
'city4'
This query took 00:04 (seconds) to execute. It returned far too much records, so I thought it would be a good idea to use TOP N clause to limit them. After all, limiting the number of records to be returned should decrease execution time. And certainly it should not increase it. Query optimizer however is odd enough to refute my assumptions. By adding a TOP clause to aforementioned query:
SELECT TOP(5)
[Articles]
.Id,
(…)
query execution time jumped to 01:14! How is that possible?
EXPLANATION
Execution plans for both queries may help to explain it. For base query:
http://imageshack.us/a/img195/5361/sqltop1.png
and for query with TOP clause:
http://imageshack.us/a/img521/9523/sqltop2.png
The main difference between those two execution plans is that in first plan optimizer uses Hash Matches, while in second it uses Nested Loops. Both symbols generally describe the same operation (join). However Hash Matches are efficient with large sets, while Nested Loops only with small sets. This is why the latter query performs so badly. It is the query optimizer that makes wrong decision.
But let’s assume that for some reason you do not want to abandon the idea of using TOP clause here. There are two ways to induce query optimizer to take right decision.
SOLUTIONS
First solution is to use explicit hash joins in your query:
SELECT TOP 5
[Articles]
.Id
,
CountryCategories.Name
,
CityCategories
.Name
FROM[Articles]
INNER HASH JOIN CategoryCountry2Articles
ON[Articles]
.Id = CategoryCountry2Articles
.IdArticle
INNER HASH JOIN CountryCategories
ON CountryCategories.Id =
CategoryCountry2Articles
.IdCountry
INNER HASH JOIN CategoryCity2Articles
ON[Articles]
.Id = CategoryCity2Articles
.IdArticle
INNER HASH JOIN CityCategories
ON CityCategories.Id =
CategoryCity2Articles
.IdCity
WHERE CountryCategories.Name = 'country1'
AND CityCategories.Name =
'city4'
Execution time: 00:07. Execution plan:
http://imageshack.us/a/img812/2150/sqltop3.png
As you can see, this time Hash Matches are used.
Second and most effective solution is to use SQL variable:
DECLARE @topCount INT
SET @topCount = 5
SELECT TOP(@topCount)
(...)
Execution time: 00:00.2 (0,2 sec).
Quick look at the execution plan:
http://imageshack.us/a/img802/6742/sqltop4.png
It shows that query optimizer did not only choose to use Hash Matches, but it also shows high degree of parallelism in the query. This is why it is so incredibly efficient.
I am not able to explain why using variable pushes query optimizer towards those decisions. But in case you are struggling with the same problem as I did, try this solution. I hope it helps.
- Share This Post:
- Short Url: http://wblo.gs/dXS
posted @ Wednesday, January 30, 2013 10:27 PM