We can all recognise good-quality database code: It doesn't break with every change in the server's configuration, or on upgrade. It isn't affected by concurrent usage, or high workload. In an extract from his forthcoming book, Alex explains just how to go about producing resilient TSQL code that works, and carries on working.
The goal of defensive database programming is to produce resilient database code; in other words code that does not contain bugs and is not susceptible to being broken by unexpected use cases, small modifications to the underlying database schema, changes in SQL Server settings, and so on.
If you fail to program defensively, then code that runs as expected on a given standalone server, with a specific configuration, may run very differently in a different environment, under different SQL Server settings, against different data, or under conditions of concurrent access. When this happens, you will be susceptible to erratic behavior in your applications, performance problems, data integrity issues and unhappy users.
The process of reducing the number of vulnerabilities in your code, and so increasing its resilience, is one of constantly questioning the assumptions on which your implementation depends, ensuring they are always enforced if they are valid, and removing them if not. It is a process of constantly testing your code, breaking it, and then refining it based on what you have learned.
The best way to get a feel for this process, and for how to expose vulnerabilities in your code and fix them using defensive programming techniques, is to take a look at a few common areas where I see that code is routinely broken by unintended use cases or erroneous assumptions:
- Unreliable search patterns
- Reliance on specific SQL Server environment settings
- Mistakes and ambiguity during data modifications
In each case, we'll identify the assumptions that lead to code vulnerability, and show how to fix them. All the examples in this article are as simple as possible in that there is no concurrency and the underlying database schema is fixed.
My forthcoming book on this subject introduces many of the additional dangers that can arise when exposing the code to changes in the database schema, running it under high concurrency, and so on.
Reducing Code Vulnerability
There are four key elements to defensive database programming that, when applied, will allow you to eliminate bugs and make your code less vulnerable to be being subsequently broken by cases of unintended use:
- Define and understand your assumptions
- Test as many use cases as possible
- Lay out your code in short, fully testable, and fully tested modules
- Reuse your code whenever feasible, so that the code to solve a given problem is implemented in one place only
While I will occasionally make brief mention of the sort of checks and tests that ought to be included in your unit tests (steps 2 and 3), this article is focused on defensive programming, and so on the rigorous application of the first two principles.
Define your Assumptions
One of the most damaging mistakes made during the development of SQL code, and any other code, is a failure to explicitly define the assumptions that have been made regarding how the code should operate, and how it should respond to various inputs. Specifically, we must:
- Explicitly list the assumptions that have been made
- Ensure that the these assumptions always hold
- Systematically remove assumptions that are not essential, or are incorrect
When identifying these assumptions, there can be one of three possible outcomes. Firstly, if an assumption is deemed essential, it must be documented, and then tested rigorously to ensure it always holds; I prefer to use unit tests to document such assumptions. Failure to do so will mean that when the code makes it into production it will inevitably be broken as a result of usage that conflicts with the assumption.
Secondly, if the assumption is deemed non-essential, it should, if possible, be removed. Finally, in the worst case, the code may contain assumptions that are simply wrong, and can threaten the integrity of any data that the code modifies. Such assumptions must be eliminated from the code.
Rigorous Testing
As we develop code, we must use all our imagination to come up with cases of unintended use, trying to break our modules. We should incorporate these cases into our testing suites.
As we test, we will find out how different changes affect code execution and learn how to develop code that does not break when "something", for example a language setting or the value of ROWCOUNT, changes
Having identified a setting that breaks one of our code modules, we should fix it and then identify and fix all other similar problems in our code. We should not stop at that. The defensive programmer must investigate all other database settings that may affect the way the code runs and then review and amend the code again and again, fixing potential problems before they occur. This process usually takes a lot of iterations, but every time we end up with better, more robust code and we will save a lot of potential wasted time in troubleshooting problems, as well as expensive retesting and redeployment, when the code is deployed to production.
Throughout the rest of this article, we'll discuss how this basic defensive coding philosophy is applied in practice, by way of some simple practical examples.
Defending Against Cases of Unintended Use
All-too-often, we consider our code to be finished as soon as it passes a few simple tests. We do not take enough time to identify and test all possible, reasonable use cases for our code. When the inevitable happens, and our code is used in a way we failed to consider, it does not work as expected.
To demonstrate these points, we'll consider an example that shows how (and how not) to use string patterns in searching. We'll analyze a seemingly working stored procedure that searches a Messages table, construct cases of unintended use, and identify an implicit assumption on which the implementation of this procedure relies. We will then need to decide whether to eliminate the assumption or to guarantee that it always holds. Either way, we will end up with a more robust procedure.
Listing 1 contains the code needed to create a sample Messages table, which holds the subject and body of various text messages, and load it with two sample messages. It then creates the stored procedure,SelectMessagesBySubjectBeginning, which will search the messages using a search pattern based on theLIKE keyword. The stored procedure takes one parameter, SubjectBeginning, and is supposed to return every message whose subject starts with the specified text.
(
MessageID INT NOT IDENTITY(1,1) NOT NULL
PRIMARY KEY,
Subject VARCHAR(30) NOT NULL ,
Body VARCHAR(100) NOT NULL
) ;
GO
INSERT INTO dbo.Messages
( Subject ,
Body
)
SELECT 'Next release delayed' ,
'Still fixing bugs'
UNION ALL
SELECT 'New printer arrived' ,
'By the kitchen area' ;
GO
CREATE PROCEDURE dbo.SelectMessagesBySubjectBeginning
@SubjectBeginning VARCHAR(30)
AS
SET NOCOUNT ON ;
SELECT Subject ,
Body
FROM dbo.Messages
WHERE Subject LIKE @SubjectBeginning + '%' ;
Some preliminary testing against this small set of test data, as shown in Listing 2, does not reveal any problems.
-- must return one row
EXEC dbo.SelectMessagesBySubjectBeginning
@SubjectBeginning='Next';
Subject Body
------------------------------ -------------------
Next release delayed Still fixing bugs
-- must return one row
EXEC dbo.SelectMessagesBySubjectBeginning
@SubjectBeginning='New';
Subject Body
------------------------------ -------------------
New printer arrived By the kitchen area
-- must return two rows
EXEC dbo.SelectMessagesBySubjectBeginning
@SubjectBeginning='Ne';
Subject Body
------------------------------ -------------------
Next release delayed Still fixing bugs
New printer arrived By the kitchen area
-- must return nothing
EXEC dbo.SelectMessagesBySubjectBeginning
@SubjectBeginning='No Such Subject';
Subject Body
------------------------------ -------------------
Handling Special Characters in Searching
In defensive database programming, it is essential to construct cases of unintended use with which to break our code. The test data in Listing 1 and the stored procedure calls in Listing 2 demonstrate the cases of intended use, and clearly the procedure works, when it is used as intended.
However, have we considered all the possible cases? Will the procedure continue to work as expected in cases ofunintended use? Can we find any hidden bugs in this procedure? In fact, it is embarrassingly easy to break this stored procedure, simply by adding a few "off topic" messages to our table, as shown in Listing 3.
INSERT INTO dbo.Messages
( Subject ,
Body
)
SELECT '[OT] Great vacation in Norway!' ,
'Pictures already uploaded'
UNION ALL
SELECT '[OT] Great new camera' ,
'Used it on my vacation' ;
GO
-- must return two rows
EXEC dbo.SelectMessagesBySubjectBeginning
@SubjectBeginning = '[OT]' ;
Subject Body
------------------------------ -------------------
Our procedure fails to return the expected messages. In fact, by loading one more message, as shown in Listing 4, we can demonstrate that this procedure can also return incorrect data.
INSERT INTO dbo.Messages
( Subject ,
Body
)
SELECT 'Ordered new water cooler' ,
'Ordered new water cooler' ;
EXEC dbo.SelectMessagesBySubjectBeginning
@SubjectBeginning = '[OT]' ;
Subject Body
------------------------------ -------------------
Ordered new water cooler Ordered new water cooler
When using the LIKE keyword, square brackets ("[" and "]"), are treated as wildcard characters, denoting a single character within a given range or set. As a result, while the search was intended to be one for off-topic posts, it in fact searched for "any messages whose subject starts with O or T". Therefore Listing 3 returns no rows, since no such messages existed at that point, whereas Listing 4 "unexpectedly" returns the message starting with "O", rather than the off-topic messages.
In a similar vein, we can also prove that the procedure fails for messages with the % sign in subject lines, as shown in Listing 5.
INSERT INTO dbo.Messages
( Subject ,
Body
)
SELECT '50% bugs fixed for V2' ,
'Congrats to the developers!'
UNION ALL
SELECT '500 new customers in Q1' ,
'Congrats to all sales!' ;
GO
EXEC dbo.SelectMessagesBySubjectBeginning
@SubjectBeginning = '50%' ;
Subject Body
------------------------------ ----------------
50% bugs fixed for V2 Congrats to the developers!
500 new customers in Q1 Congrats to all sales!
The problem is basically the same: the % sign is a wildcard character denoting "any string of zero or more characters". Therefore, the search returns the "500 new customers…" row in addition to the desired "50% bugs fixed…" row.
Our testing has revealed an implicit assumption that underpins the implementation of theSelectMessagesBySubjectBeginning stored procedure: the author of this stored procedure did not anticipate or expect that message subject lines could contain special characters, such as square brackets and percent signs. As a result, the search only works if the specified SubjectBeginning does not contain special characters.
Having identified this assumption, we have a choice: we can either change our stored procedure so that it does not rely on this assumption, or we can enforce it.
Enforcing or Eliminating the Special Characters Assumption
Our first option is to fix our data by enforcing the assumption that messages will not contain special characters in their subject line. We can delete all the rows with special characters in their subject line, and then add a CHECKconstraint that forbids their future use, as shown in Listing 6. The patterns used in the DELETE command and in theCHECK constraint are advanced and need some explanation. The first pattern, %[[]%, means the following:
- Both percent signs denote "any string of zero or more characters"
- [[] in this case denotes "opening square bracket, ["
- The whole pattern means "any string of zero or more characters, followed by an opening square bracket, followed by another string of zero or more characters", which is equivalent to "any string containing at least one opening square bracket"
Similarly, the second pattern, %[%]%, means "any string containing at least one percent sign".
BEGIN TRAN ;
DELETE FROM dbo.Messages
WHERE Subject LIKE '%[[]%'
OR Subject LIKE '%[%]%' ;
ALTER TABLE dbo.Messages
ADD CONSTRAINT Messages_NoSpecialsInSubject
CHECK(Subject NOT LIKE '%[[]%'
AND Subject NOT LIKE '%[%]%') ;
ROLLBACK TRAN ;
Although enforcing the assumption is easy, does it make practical sense? It depends. I would say that under most circumstances special characters in subject lines should be allowed, so let's consider a second, better option: eliminating the assumption. Note that Listing 6 rolls back the transaction, so that our changes are not persisted in the database.
Listing 7 shows how to alter the stored procedure so that it can handle special characters. To better demonstrate how the procedure escapes special characters, I included some debugging output. Always remember to remove such debugging code before handing over the code for QA and deployment!
ALTER PROCEDURE dbo.SelectMessagesBySubjectBeginning
@SubjectBeginning ng VARCHAR(50)
AS
SET NOCOUNT ON ;
DECLARE @ModifiedSubjectBeginning VARCHAR(150) ;
SET @ModifiedSubjectBeginning =
REPLACE(REPLACE(@SubjectBeginning,
'[',
'[[]'),
'%',
'[%]') ;
SELECT @SubjectBeginning AS [@SubjectBeginning] ,
@ModifiedSubjectBeginning AS
[@ModifiedSubjectBeginning] ;
SELECT Subject ,
Body
FROM dbo.Messages
WHERE Subject LIKE @ModifiedSubjectBeginning + '%' ;
GO
Listing 8 demonstrates that our stored procedure now correctly handles special characters. Of course, in a real world situation, all previous test cases have to be rerun to check that we didn't break them in the process of fixing the bug.
-- must return two rows
EXEC dbo.SelectMessagesBySubjectBeginning
@SubjectBeginning = '[OT]' ;
@SubjectBeginning @ModifiedSubjectBeginning
----------------------------------------------------------
[OT] [[]OT]
Subject Body
------------------------------ ----------------------------
[OT] Great vacation in Norway! Pictures already uploaded
[OT] Great new camera Used it on my vacation
-- must return one row
EXEC dbo.SelectMessagesBySubjectBeginning
@SubjectBeginning='50%';
@SubjectBeginning @ModifiedSubjectBeginning
------------------------------------------------------------------
50% 50[%]
Subject Body
------------------------------ ----------------------
50% bugs fixed for V2 Congrats to the developers!
Whether we ultimately decide to enforce or eliminate the assumption, we have created a more robust search procedure as a result.
Defending Against Changes in SQL Server Settings
A common mistake made by developers is to develop SQL code on a given SQL Server, with a defined set of properties and settings, and then fail to consider how their code will respond when executed on instances with different settings, or when users change settings at the session level.
Let's examine a few simple cases of how hidden assumptions with regard to server settings can result in vulnerable code.
How SET ROWCOUNT can break a Trigger
Traditionally, developers have relied on the SET ROWCOUNT command to limit the number of rows returned to a client for a given query, or to limit the number of rows on which a data modification statement (UPDATE, DELETE,MERGE or INSERT) acts. In either case, SET ROWCOUNT works by instructing SQL Server to stop processing after a specified number of rows.
However, use of SET ROWCOUNT can have some unexpected consequences for the unwary developer. Consider a very simple table, Objects, which stores basic size and weight information about objects, as shown in Listing 9.
CREATE TABLE dbo.Objects
(
ObjectID INT NOT NULL PRIMARY KEY ,
SizeInInches FLOAT NOT NULL ,
WeightInPounds FLOAT NOT NULL
) ;
GO
INSERT INTO dbo.Objects
( ObjectID ,
SizeInInches ,
WeightInPounds
)
SELECT 1 ,
10 ,
10
UNION ALL
SELECT 2 ,
12 ,
12
UNION ALL
SELECT 3 ,
20 ,
22 ;
GO
We are required to start logging all updates of existing rows in this table, so we create a second table,ObjectsChangeLog, in which to record the changes made, and a trigger that will fire whenever data in theObjects table is updated, record details of the changes made, and insert them into ObjectsChangeLog.
CREATE TABLE dbo.ObjectsChangeLog
(
ObjectsChangeLogID INT NOT NULL
IDENTITY ,
ObjectID INT NOT NULL ,
ChangedColumnName VARCHAR(20) NOT NULL ,
ChangedAt DATETIME NOT NULL ,
OldValue FLOAT NOT NULL ,
CONSTRAINT PK_ObjectsChangeLog PRIMARY KEY
( ObjectsChangeLogID )
) ;
GO
CREATE TRIGGER Objects_UpdTrigger ON dbo.Objects
FOR UPDATE
AS
BEGIN;
INSERT INTO dbo.ObjectsChangeLog
( ObjectID ,
ChangedColumnName ,
ChangedAt ,
OldValue
)
SELECT i.ObjectID ,
'SizeInInches' ,
CURRENT_TIMESTAMP ,
d.SizeInInches
FROM inserted AS i
INNER JOIN deleted AS d ON
i.ObjectID = d.ObjectID
WHERE i.SizeInInches <> d.SizeInInches
UNION ALL
SELECT i.ObjectID ,
'WeightInPounds' ,
CURRENT_TIMESTAMP ,
d.WeightInPounds
FROM inserted AS i
INNER JOIN deleted AS d ON
i.ObjectID = d.ObjectID
WHERE i.WeightInPounds <> d.WeightInPounds ;
END ;
Please note that my approach to all example in this book is to keep them as simple as they can be while still providing a realistic demonstration of the point, which here is the effect of SET ROWCOUNT. So, in this case, I have omitted:
- A "real" key on the ObjectsChangeLog table, enforced by a UNIQUE constraint (ObjectID,ChangedColumnName, ChangedAt), in addition to the surrogate key on ObjectsChangeLogID
- The equivalent insert and delete triggers to log INSERT and DELETE modifications, as well as UPDATEs
Likewise, there are several ways of logging changes and the one I chose here may not be the best approach; again my goal was to keep the example focused and simple. Listing 11 shows the code that tests how our trigger logs changes against the Objects table.
BEGIN TRAN ;
-- TRUNCATE TABLE can also be used here
DELETE FROM dbo.ObjectsChangeLog ;
UPDATE dbo.Objects
SET SizeInInches = 12 ,
WeightInPounds = 14
WHERE ObjectID = 1 ;
-- we are selecting just enough columns
-- to demonstrate that the trigger works
SELECT ObjectID ,
ChangedColumnName ,
OldValue
FROM dbo.ObjectsChangeLog ;
-- we do not want to change the data,
-- only to demonstrate how the trigger works
ROLLBACK ;
-- the data has not been modified by this script
ObjectID ChangedColumnName OldValue
----------- -------------------- ------
1 SizeInInches 10
1 WeightInPounds 10
Apparently, our trigger works as expected! However, with a little further testing, we can prove that the trigger will sometimes fail to log UPDATEs made to the Objects table, due to an underlying assumption in the trigger code, of which the developer may not even have been aware!
The ROWCOUNT Assumption
Let's consider what might happen if, within a given session, a user changed the default value for ROWCOUNT and then updated the Objects table, without resetting ROWCOUNT, as shown in Listing 12.
DELETE FROM dbo.ObjectsChangeLog ;
SET ROWCOUNT 1 ;
-- do some other operation(s)
-- for which we needed to set rowcount to 1
-- do not restore ROWCOUNT setting
-- to its default value
BEGIN TRAN ;
UPDATE dbo.Objects
SET SizeInInches = 12 ,
WeightInPounds = 14
WHERE ObjectID = 1 ;
-- make sure to restore ROWCOUNT setting
-- to its default value so that it does not affect the
-- following SELECT
SET ROWCOUNT 0 ;
SELECT ObjectID ,
ChangedColumnName ,
OldValue
FROM dbo.ObjectsChangeLog ;
ROLLBACK ;
ObjectID ChangedColumnName OldValue
----------- -------------------- ---------
1 SizeInInches 10
As a result of the change to the ROWCOUNT value, our trigger processes the query that logs changes to theSizeInInches column, returns one row, and then ceases processing. This means that it fails to log the change toWeightInPounds column. Of course, there is no guarantee that the trigger will log the change to theSizeInInches column. On your server, the trigger may log only the change of WeightInPounds but fail to log the change in SizeInInches. Which column will be logged depends on the execution plan chosen by the optimizer, and we cannot assume that the optimizer will always choose one and the same plan for a query.
Although the developer of the trigger may not have realized it, the implied assumption regarding its implementation is that ROWCOUNT is set to its default value. Listing 12 proves that that when this assumption is not true, the trigger will not work as expected.
Enforcing and Eliminating the ROWCOUNT Assumption
Once we understand the problem, we can fix the trigger very easily, by resetting ROWCOUNT to its default value at the very beginning of the body of the trigger, as shown in Listing 13.
ALTER TRIGGER dbo.Objects_UpdTrigger ON dbo.Objects
FOR UPDATE
AS
BEGIN;
-- the scope of this setting is the body of the trigger
SET ROWCOUNT 0 ;
INSERT INTO dbo.ObjectsChangeLog
( ObjectID ,
ChangedColumnName ,
ChangedAt ,
OldValue
)
SELECT i.ObjectID ,
'SizeInInches' ,
CURRENT_TIMESTAMP ,
d.SizeInInches
FROM inserted AS i
INNER JOIN deleted AS d ON
i.ObjectID = d.ObjectID
WHERE i.SizeInInches <> d.SizeInInches
UNION ALL
SELECT i.ObjectID ,
'WeightInPounds' ,
CURRENT_TIMESTAMP ,
d.WeightInPounds
FROM inserted AS i
INNER JOIN deleted AS d ON
i.ObjectID = d.ObjectID
WHERE i.WeightInPounds <>
d.WeightInPounds ;
END ;
-- after the body of the trigger completes,
-- the original value of ROWCOUNT is restored
-- by the database engine
We can rerun the test from Listing 12 and this time the trigger will work as required, logging both changes. Note that the scope of our SET ROWCOUNT is the trigger, so our change will not affect the setting valid at the time when the trigger was fired.
SET ROWCOUNT is deprecated in SQL Server 2008...
...and eventually, in some future version, will have no effect on INSERT, UPDATE or DELETE statements. Microsoft advises rewriting any such statements that rely on ROWCOUNT to use TOP instead. As such, this example may be somewhat less relevant for future versions of SQL Server; the trigger might be less vulnerable to being broken, although still not immune. However, at the time of writing, this example is very relevant.
In this case, one simple step both enforces the underlying assumption, by ensuring that it is always valid, and eliminates it, by ensuring that the code continues to work in cases where ROWCOUNT is not at its default value.
Proactively Fixing SET ROWCOUNT Vulnerabilities
We have fixed the ROWCOUNT vulnerability in our trigger, but our job is not done. What about other modules in our system? Might they not have the same vulnerability?
Having learned of the potential side effects of SET ROWCOUNT, we can now analyze all the other modules in our system, determine if they have the same problem, and fix them if they do. For example, our stored procedureSelectMessagesBySubjectBeginning (Listing 1) has the same vulnerability, as demonstrated by the test in Listing 14.
SET ROWCOUNT 1 ;
-- must return two rows
EXEC dbo.SelectMessagesBySubjectBeginning
@SubjectBeginning = 'Ne' ;
…(Snip)…
Subject Body
------------------------------ -------------------
Next release delayed Still fixing bugs
We can apply the same fix, adding SET ROWCOUNT 0; to the very beginning of this stored procedure. Similarly, we should apply this fix to all other modules that need it.
If your code is supposed to exist for a considerable time, then it makes perfect sense to fix problems proactively. It is usually faster and easier to do so than to wait until the problem occurs, spend considerable time troubleshooting, and then eventually implement the same fix.
How SET LANGUAGE can break a Query
Just as the value of ROWCOUNT can be changed at the session level, so can other settings, such as the default language. Many developers test their code only under the default language setting of their server and do not test how their code will respond if executed on a server with a different language setting, or to a change in the setting at the session level.
This practice is perfectly correct as long as our code always runs under the same settings as those under which we develop and test it. However, if or when the code runs under different settings, this practice will often result in code that is vulnerable to errors, especially when dealing with dates.
Consider the case of a stored procedure that is supposed to retrieve from our ObjectsChangeLog table (Listing 10) a listing of all changes made to the Objects table over a given date range. According to the requirements, only the beginning of the range is required; the end of the range is an optional parameter. If an upper bound for the date range is not provided, we are required to use a date far in the future, December 31st, 2099, as the end of our range.
CREATE PROCEDURE dbo.SelectObjectsChangeLogForDateRange
@DateFrom DATETIME ,
@DateTo DATETIME = NULL
AS
SET ROWCOUNT 0 ;
SELECT ObjectID ,
ChangedColumnName ,
ChangedAt ,
OldValue
FROM dbo.ObjectsChangeLog
WHERE ChangedAt BETWEEN @DateFrom
AND COALESCE(@DateTo, '12/31/2099') ;
GO
Note that this stored procedure uses a string literal, 12/31/2099, to denote December 31st, 2099. Although12/31/2099 does represent December 31st, 2099 in many languages, such as US English, in many other cultures, such as Norwegian, this string does not represent a valid date. This means that the author of this stored procedure has made an implicit assumption: the code will always run under language settings where 12/31/2099represents December 31st, 2099.
When we convert string literals to DATETIME values, we do not have to make assumptions about language settings. Instead, we can explicitly specify the DATETIME format from which we are converting.
The following scripts demonstrate both the safe way to convert character strings to DATETIME values, and the vulnerability of our stored procedure to changes in language settings. The script shown in Listing1.18 populates the ObjectsChangeLog table and calls the SelectObjectsChangeLogForDateRange stored procedure under two different language settings, US English and Norwegian.
-- we can populate this table via our trigger, but
-- I used INSERTs,to keep the example simple
INSERT INTO dbo.ObjectsChangeLog
( ObjectID ,
ChangedColumnName ,
ChangedAt ,
OldValue
)
SELECT 1 ,
'SizeInInches' ,
-- the safe way to provide July 7th, 2009
'20090707',
12.34 ;
GO
SET LANGUAGE 'us_english' ;
-- this convertion always works in the same way,
-- regardless of the language settings,
-- because the format is explicitly specified
EXEC dbo.SelectObjectsChangeLogForDateRange
@DateFrom = '20090101';
SET LANGUAGE 'Norsk' ;
EXEC dbo.SelectObjectsChangeLogForDateRange
@DateFrom = '20090101';
-- your actual error message may be different from mine,
-- depending on the version of SQL Server
Changed language setting to us_english.
(successful output skipped)
Changed language setting to Norsk.
ObjectID ChangedColumnName ChangedAt OldValue
----------- -------------------- ----------------------- --------------
Msg 242, Level 16, State 3, Procedure SelectObjectsChangeLogForDateRange, Line 6
The conversion of a char data type to a datetime data type resulted in an out-of-range datetime value.
Under the Norwegian language settings we receive an error at the point where it attempts to convert 12/31/2099into a DATETIME string.
Note that we are, in fact, quite fortunate to receive an error message right away. Should we, in some other script or procedure, convert '10/12/2008' to DATETIME, SQL Server would silently convert this constant to a wrong value and we'd get incorrect results. Listing 17 shows how our stored procedure can return unexpected results without raising errors; such silent bugs may be very different to troubleshoot.
INSERT INTO dbo.ObjectsChangeLog
( ObjectID ,
ChangedColumnName ,
ChangedAt ,
OldValue
)
SELECT 1 ,
'SizeInInches' ,
-- this means June 15th, 2009
'20090615',
12.3
UNION ALL
SELECT 1 ,
'SizeInInches' ,
-- this means September 15th, 2009
'20090915',
12.5
SET LANGUAGE 'us_english' ;
-- this call returns rows from Jul 6th to Sep 10th, 2009
-- one log entry meets the criteria
EXEC SelectObjectsChangeLogForDateRange
@DateFrom = '07/06/2009',
@DateTo = '09/10/2009' ;
SET LANGUAGE 'Norsk' ;
-- this call returns rows from Jun 7th to Oct 9th, 2009
-- three log entries meet the criteria
EXEC SelectObjectsChangeLogForDateRange
@DateFrom = '07/06/2009',
@DateTo = '09/10/2009' ;
Changed language setting to us_english.
ObjectID ChangedColumnName ChangedAt OldValue
----------- -------------------- -----------------------
1 SizeInInches 2009-07-07 12.34
-- because the stored procedure does not have an ORDER BY
-- clause, your results may show up in a different
-- order
Changed language setting to Norsk.
ObjectID ChangedColumnName ChangedAt OldValue
----------- -------------------- -----------------------
1 SizeInInches 2009-07-07 12.34
1 SizeInInches 2009-06-15 12.3
1 SizeInInches 2009-09-15 12.5
To fix the stored procedure, as shown in Listing 18, we need to explicitly specify the format from which we convert the VARCHAR values provided when the stored procedure is executed.
ALTER PROCEDURE dbo.SelectObjectsChangeLogForDateRange
@DateFrom DATETIME ,
@DateTo DATETIME = NULL
AS
SET ROWCOUNT 0 ;
SELECT ObjectID ,
ChangedColumnName ,
ChangedAt ,
OldValue
FROM dbo.ObjectsChangeLog
WHERE ChangedAt BETWEEN @DateFrom
AND COALESCE(@DateTo,
'20991231') ;
The stored procedure will now run correctly, regardless of the language settings. In this case, we chose to fix the problem by eliminating the assumption. Alternatively, in some cases, we might choose to enforce it by setting the language at the beginning of the stored procedure, just as we did with the ROWCOUNT setting.
Of course, there are situations when our code will always run under one and the same settings, in which case there is no need to do anything. For example, if a module implements business rules specific to the state of Minnesota, it is reasonable to assume that it will always run under the same language settings.
Defensive Data Modification
Data modification is, in general, an area in which I see developers getting into trouble time and again. We'll start with a case that demonstrates how data can be erroneously updated as a result of a false assumption in the stored procedure that modifies it. It is a simple example, but the underlying problem is a very common one: using search criteria that affect more rows than intended.
We'll then discuss a second, somewhat more complex case, where an UPDATE can go wrong because it fails to unambiguously identify the row(s) to be modified, perhaps falsely assuming that the underlying data structures will ensure that no such ambiguity exists.
Updating more rows than intended
Listing 19 creates a simple Employee table, and a SetEmployeeManager stored procedure that assigns a manager to a given employee.
CREATE TABLE dbo.Employee
(
EmployeeID INT NOT NULL ,
ManagerID INT NULL ,
FirstName VARCHAR(50) NULL ,
LastName VARCHAR(50) NULL ,
CONSTRAINT PK_Employee_EmployeeID
PRIMARY KEY CLUSTERED ( EmployeeID ASC ) ,
CONSTRAINT FK_Employee_EmployeeID_ManagerID
FOREIGN KEY ( ManagerID )
REFERENCES dbo.Employee ( EmployeeID )
) ;
GO
CREATE PROCEDURE dbo.SetEmployeeManager
@FirstName VARCHAR(50) ,
@LastName VARCHAR(50) ,
@ManagerID INT
AS
SET NOCOUNT ON ;
UPDATE dbo.Employee
SET ManagerID = @ManagerID
WHERE FirstName = @FirstName
AND LastName = @LastName ;
Clearly, the person who developed the stored procedure assumed that, at most, one employee may have the provided first and last name. If there happens to be two people in the organization with the same name then this stored procedure will assign them both to the same manager.
Again, having uncovered the assumption, we need to decide whether to enforce it or eliminate it. We could enforce it simply by placing a UNIQUE constraint on the FirstName and LastName columns. However, in this case, it seems much more reasonable to assume that there may well be more than one employee with the same first and last name, and that these namesake employees may report to different managers. Therefore, we need to eliminate the incorrect assumption. There are many ways to do this, the simplest being to ensure that the parameter supplied to the stored procedure, and used in the search criteria, identifies a unique row, as shown in Listing 20.
ALTER PROCEDURE dbo.SetEmployeeManager
@EmployeeID INT ,
@ManagerID INT
AS
SET NOCOUNT ON ;
UPDATE dbo.Employee
SET ManagerID = @ManagerID
WHERE EmployeeID = @EmployeeID ;
As long as EmployeeID is the primary key on the dbo.Employee table, this procedure will work correctly.
The Problem of Ambiguous Updates
The results of data modifications may be unpredictable in the hands of the careless programmer. Let's consider a very common requirement: populating a permanent table from a staging table. First of all, let's create our permanent table, Codes, and a staging table, CodesStaging, as shown in Listing 21. Note that CodesStagingdoes not have a primary key. This is very common for staging tables, because data is often loaded into such tables before detecting duplicates and other data integrity violations.
CREATE TABLE dbo.Codes
(
Code VARCHAR(5) NOT NULL ,
Description VARCHAR(40) NOT NULL ,
CONSTRAINT PK_Codes PRIMARY KEY ( Code )
) ;
GO
CREATE TABLE dbo.CodesStaging
(
Code VARCHAR(10) NOT NULL ,
Description VARCHAR(40) NOT NULL
) ;
GO
Now, let's populate each table with some sample data, as shown in Listing 22.
DELETE FROM dbo.Codes ;
INSERT INTO dbo.Codes
( Code ,
Description
)
SELECT 'AR' ,
'Old description for Arkansas'
UNION ALL
SELECT 'IN' ,
'Old description for Indiana' ;
DELETE FROM dbo.CodesStaging ;
INSERT INTO dbo.CodesStaging
( Code ,
Description
)
SELECT 'AR' ,
'description for Argentina'
UNION ALL
SELECT 'AR' ,
'new description for Arkansas'
UNION ALL
SELECT 'IN' ,
'new description for Indiana ' ;
Now, we'll examine two different ways of updating data in the permanent table, based on data in the staging table, both of which are subject to ambiguities if care is not taken:
- Using UPDATE…FROM
- Updating an inline view
We'll then discuss strategies for avoiding such ambiguities.
Using UPDATE…FROM
Notice in Listing 22 that the incoming data in our staging table has a duplicate: the code AR occurs twice, with different descriptions. Suppose that we have not detected or resolved this duplicate, and that we are updating ourCodes table from the staging table.
UPDATE dbo.Codes
SET Description = s.Description
FROM dbo.Codes AS c INNER JOIN dbo.CodesStaging AS s
ON c.Code = s.Code ;
SELECT Code ,
Description
FROM dbo.Codes ;
Code Description
---------- ----------------------------------------
AR description for Argentina
IN new description for Indiana
(2 row(s) affected)
Although two descriptions were provided for the AR code, the UPDATE…FROM command did not raise an error; it just silently updated the corresponding row in Codes table with one of the two provided values. In this case, the 'old description for Arkansas' has been overwritten with the 'description for Argentina'.
Updating Inline Views
When we update inline views, we may encounter exactly the same problem. First, repopulate each of the tables with the original data, using the code from Listing 22. Next, create an inline view, and then use it to implement exactly the same functionality as the previous UPDATE…FROM commands, as shown in Listing 24.
WITH c AS ( SELECT c.Code ,
c.Description ,
s.Description AS NewDescription
FROM dbo.Codes AS c
INNER JOIN dbo.CodesStaging AS s
ON c.Code = s.Code
)
UPDATE c
SET Description = NewDescription ;
SELECT Code ,
Description
FROM dbo.Codes ;
Code Description
---------- ----------------------------------------
AR description for Argentina
IN new description for Indiana
Note that neither in this example nor the previous UPDATE…FROM example, can we predict which of these two values will end up in the target table – that, as usual, depends on the execution plan and as such is completely unpredictable. It is by pure chance that, in my examples, Argentina was chosen over Arkansas in both cases. I was able to get different results, with the description of Arkansas rather than Argentina inserted into Codes, just by changing the order in which the rows are inserted into CodesStaging. However, again, there is no guarantee that you will get the same results on your box. Also, bear in mind that if we ever did add an index to the staging table, this would almost certainly affect the result as well.
How to Avoid Ambiguous Updates
In both previous examples, the developer has written the UPDATE command apparently under the assumption that there can be no duplicate data in the CodesStaging – which cannot be guaranteed in the absence of a UNIQUE orPRIMARY KEY constraint on the Code column – or that any duplicate data should have been removed before updating the permanent table.
Generally, performing this sort of ambiguous update is unacceptable. In some cases, we might want to refine the query to make sure it never yields ambiguous results. However, typically we want either to raise an error when an ambiguity is detected, or to update only what is unambiguous.
In SQL Server 2008, we can circumvent such problems with UPDATE…FROM or CTE-based updates, by use of theMERGE command. However, prior SQL Server 2008, we have to detect these ambiguities.
Using MERGE to Detect Ambiguity (SQL Server 2008 only)
If you are working with SQL Server 2008, then easily the best option is to use the MERGE command. In Listing 25, we use the MERGE command to update our primary table from our staging table and immediately encounter the expected error.
MERGE INTO dbo.Codes AS c
USING dbo.CodesStaging AS s
ON c.Code = s.Code
WHEN MATCHED
THEN UPDATE
SET c.Description = s.Description ;
Msg 8672, Level 16, State 1, Line 1
The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.
An ANSI-standard Method
Pre-SQL Server 2008, we are forced to seek alternative ways to raise an error whenever there is an ambiguity. The code in Listing 26 is ANSI-standard SQL and accomplishes that goal.
-- rerun the code from Listing 22
-- before executing this code
UPDATE dbo.Codes
SET Description =
( SELECT Description
FROM dbo.CodesStaging
WHERE Codes.Code = CodesStaging.Code
)
WHERE EXISTS ( SELECT *
FROM dbo.CodesStaging AS s
WHERE Codes.Code = s.Code
) ;
Msg 512, Level 16, State 1, Line 3
Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
The statement has been terminated.
Note that in order to update just one column we had to use two almost identical subqueries in this command. This is definitely not a good practice. Should we need to update ten columns, we would have to repeat almost the same code eleven times! If, at some later time, we need to modify the subquery, we will have to make one and the same change in eleven places, which is very prone to errors.
Defensive Inline View Updates
Fortunately, there are several ways to improve the robustness of inline view updates, as well as UPDATE…FROMupdates (covered in the next section), which work with SQL 2005.
In the previous two examples, an error was raised when ambiguity was detected. This is usually preferable but, if your business rules allow you to ignore ambiguities, and only update that which is unambiguous, then the solution shown in Listing 27 will work.
-- rerun the code from Listing 22
-- before executing this code
BEGIN TRAN ;
WITH c AS ( SELECT c.Code ,
c.Description ,
s.Description AS NewDescription
FROM dbo.Codes AS c
INNER JOIN dbo.CodesStaging AS s
ON c.Code = s.Code
AND ( SELECT COUNT(*)
FROM dbo.CodesStaging AS s1
WHERE c.Code = s1.Code
) = 1
)
UPDATE c
SET Description = NewDescription ;
ROLLBACK ;
This time, only the description of Indiana is updated. In a similar fashion, we could filter out (i.e. ignore) ambiguities with the help of an analytical function, as shown in Listing 28.
-- rerun the code from Listing 22
-- before executing this code
BEGIN TRAN ;
WITH c AS ( SELECT c.Code ,
c.Description ,
s.Description AS NewDescription ,
COUNT(*) OVER ( PARTITION BY s.Code )
AS NumVersions
FROM dbo.Codes AS c
INNER JOIN dbo.CodesStaging AS s
ON c.Code = s.Code
)
UPDATE c
SET Description = NewDescription
WHERE NumVersions = 1 ;
ROLLBACK ;
In some cases, the approach of only performing unambiguous updates, and silently ignoring ambiguous ones, is unacceptable. In the absence of built in methods, we can use tricky workarounds to reuse the code as much as possible and still raise an error if there is an ambiguity. Consider the example shown in Listing 29, in which a divide by zero occurs if there is an ambiguity.
-- rerun the code from Listing 22
-- before executing this code
DECLARE @ambiguityDetector INT ;
WITH c AS ( SELECT c.Code ,
c.Description ,
s.Description AS NewDescription ,
COUNT(*) OVER ( PARTITION BY s.Code )
AS NumVersions
FROM dbo.Codes AS c
INNER JOIN dbo.CodesStaging AS s
ON c.Code = s.Code
)
UPDATE c
SET Description = NewDescription ,
@ambiguityDetector = CASE WHEN NumVersions = 1
THEN 1
-- if we have ambiguities, the following branch executes
-- and raises the following error:
-- Divide by zero error encountered.
ELSE 1 / 0
END ;
Msg 8134, Level 16, State 1, Line 4
Divide by zero error encountered.
The statement has been terminated.
Of course, the error message raised by this code (divide by zero) is misleading, so we should only use this approach when any of the previous alternatives are not viable.
Defensive UPDATE…FROM
Some of the approaches just outlined for improving the robustness of inline view updates, apply equally as well to improving the UPDATE…FROM command.
For example, we can use a sub-query to ignore ambiguities, as shown in Listing 30.
-- rerun the code from Listing 22
-- before executing this code
BEGIN TRAN ;
UPDATE dbo.Codes
SET Description = 'Old Description' ;
UPDATE dbo.Codes
SET Description = s.Description
FROM dbo.Codes AS c
INNER JOIN dbo.CodesStaging AS s
ON c.Code = s.Code
AND ( SELECT COUNT(*)
FROM dbo.CodesStaging AS s1
WHERE s.Code = s1.Code
) = 1 ;
SELECT Code ,
Description
FROM dbo.Codes ;
ROLLBACK ;
Likewise, we can use an analytical function for detecting and ignoring ambiguities, as shown in Listing 31.
-- rerun the code from Listing 22
-- before executing this code
BEGIN TRAN ;
UPDATE dbo.Codes
SET Description = 'Old Description' ;
UPDATE dbo.Codes
SET Description = s.Description
FROM dbo.Codes AS c
INNER JOIN ( SELECT Code ,
Description ,
COUNT(*) OVER ( PARTITION BY Code )
AS NumValues
FROM dbo.CodesStaging
) AS s
ON c.Code = s.Code
AND NumValues = 1 ;
SELECT Code ,
Description
FROM dbo.Codes ;
ROLLBACK ;
Summary
The goal of this article was to introduce, by way of some simple examples, some of the basic ideas that underpin defensive database programming. It is vital that you understand and document the assumptions that underpin your implementation, test them to ensure their validity, and eliminate them if they are not. It is also vital that you consider as many use cases as possible for your code, and ensure it behaves consistently in each case. Where inconsistencies or incorrect behavior are found, the defensive programmer will not only fix the offending module, but also test all other modules that might suffer from a similar problem and proactively safeguard against it.
Along the way, I hope you've learned the following specific lessons in defensive programming:
- How to use complex patterns to improve the robustness of LIKE searches
- How to avoid potential difficulties with SET ROWCOUNT
- The importance of safe date formats and of explicitly specifying the required format when converting dates
- How to avoid dangerous ambiguity when performing updates by, for example:
- Using MERGE, in SQL Server 2008
- Using subqueries, pre-SQL Server 2008
- How to use subqueries or the COUNT(*) OVER analytic function to improve the robustness of modifications when using UPDATE...FROM, or updating inline views, so that ambiguous updates are ignored.
No comments:
Post a Comment
Suggestions are invited from readers