Code Highlighting

Friday, December 28, 2012

SELECTing the comedy horror genre

Whenever we get a new intern, I like to poke and prod him*, to see what he's made of (bones and gooey bits, usually). To test SQL proficiency, I use the following problem. I like it because it's something that comes up occasionally in projects, and is surprisingly complex, though it seems like it shouldn't be:

Suppose you've got a movie database. You've got a table with movies T_MOVIES:

CREATE TABLE [dbo].[T_MOVIES](
 [movie_key] [int] NOT NULL,
 [movie_name] [nvarchar](500) NOT NULL,
 CONSTRAINT [PK_T_MOVIES] PRIMARY KEY CLUSTERED 
(
 [movie_key] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]

and there's also a table for movie genres T_GENRES:

CREATE TABLE [dbo].[T_GENRES](
 [genre_key] [int] NOT NULL,
 [genre_name] [nvarchar](500) NOT NULL,
 CONSTRAINT [PK_T_GENRES] PRIMARY KEY CLUSTERED 
(
 [genre_key] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]

To link both tables there is a third table T_MOVIE_GENRES:

CREATE TABLE [dbo].[T_MOVIE_GENRES](
 [movie_key] [int] NOT NULL,
 [genre_key] [int] NOT NULL,
 CONSTRAINT [PK_T_MOVIE_GENRES] PRIMARY KEY CLUSTERED 
(
 [movie_key] ASC,
 [genre_key] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]

Assume the proper foreign key constraints have been applied, and you have a pretty sensible lay-out. To select all movies within the horror genre you only have to join T_MOVIES to T_MOVIE_GENRES and filter down your genre key in the WHERE clause to whatever the horror genre is, suppose it's 5:

SELECT
 T_MOVIES.movie_key,
 T_MOVIES.movie_name
FROM
 T_MOVIES INNER JOIN
 T_MOVIE_GENRES ON
 T_MOVIES.movie_key = T_MOVIE_GENRES.movie_key
WHERE
 T_MOVIE_GENRES.genre_key = 5

The question now is: What if I want to filter by two genres? How do I find movies that are not only horror, but also comedy? What query will yield "Shaun of the Dead"?
More generally, how do we find those records in a table that have more than one matching record in another table, where those matching records have a field set to a set of specific values?

Obviously WHERE T_MOVIE_GENRES.genre_key IN (1,5) is going to yield movies that belong to either genre.
At this point usually the intern scratches his chin, gives it a bit of thought, and comes up with this:

SELECT
 T_MOVIES.movie_key,
 T_MOVIES.movie_name
FROM
 T_MOVIES INNER JOIN
 (
 SELECT  T_MOVIE_GENRES.movie_key
 FROM T_MOVIE_GENRES
 WHERE T_MOVIE_GENRES.genre_key = 5
 ) AS Q_HORROR
 ON T_MOVIES.movie_key = Q_HORROR.movie_key
  INNER JOIN
 (
 SELECT  T_MOVIE_GENRES.movie_key
 FROM T_MOVIE_GENRES
 WHERE T_MOVIE_GENRES.genre_key = 1
 ) AS Q_COMEDY
 ON T_MOVIES.movie_key = Q_COMEDY.movie_key

(Or some other solution involving a subquery)
Okay, fine, you found "Shaun of the Dead". But what if I'm actually in the mood for Zombieland, a comedy / horror / road movie? In fact, I want a stored procedure that can take an arbitrary number of genres, and filter by it. You can go the dynamic sql route, and build a string with an arbitrary number of subqueries. It'd be ugly and hard to maintain, but it would work.
The following shows what I think is the best solution though:

SELECT
 T_MOVIES.movie_key,
 T_MOVIES.movie_name
FROM
 T_MOVIES INNER JOIN
 T_MOVIE_GENRES
 ON T_MOVIES.movie_key = T_MOVIE_GENRES.movie_key
WHERE
 T_MOVIE_GENRES.genre_key in (1,5)
GROUP BY
 T_MOVIES.movie_key,
 T_MOVIES.movie_name
HAVING
 COUNT(*) = 2

To refactor this into forementioned stored procedure you need to add a table-typed variable, and use a little bit of dynamic sql - because SQL does not have anything like arrays. Something like this:

CREATE PROCEDURE GetMoviesByGenres
 @genre_keys nvarchar(2000)
AS
BEGIN
 DECLARE @genre_keys_table table(genre_key int)
 DECLARE @genre_count int
 
 INSERT INTO @genre_keys_table
   EXEC(N'SELECT DISTINCT genre_key FROM T_GENRES WHERE genre_key IN (' + @genre_keys + N')')
 
 SELECT @genre_count = COUNT(*) FROM @genre_keys_table
 
 SELECT
  T_MOVIES.movie_key,
  T_MOVIES.movie_name
 FROM
  T_MOVIES INNER JOIN
  T_MOVIE_GENRES
  ON T_MOVIES.movie_key = T_MOVIE_GENRES.movie_key
 WHERE
  T_MOVIE_GENRES.genre_key IN (
   SELECT genre_key FROM @genre_keys_table
  )
 GROUP BY
  T_MOVIES.movie_key,
  T_MOVIES.movie_name
 HAVING
  COUNT(*) = @genre_count
END

If it wasn't clear from the code; @genre_keys takes a comma-delimited list of keys. That leaves an obvious injection vulnerability, so I would advise to have your calling code take an array of int, and not a string.
So that pretty much concludes that. If any of you ever end up as an intern at Tabeoka, you can now impress me with your mad SQL skillz.

Menno




* "him" because we've only ever gotten male interns. I'm prety sure I could get in trouble for poking and prodding a girl intern anyway. Both with the law, and my wife.

No comments:

Post a Comment