Archive

Archive for the ‘Uncategorized’ Category

What are Candidate, alternate and composite keys?

November 9, 2006 9 comments

A candidate key is one that can identify each row of a table uniquely. Generally a candidate key becomes the primary key of the table. If the table has more than one candidate key, one of them will become the primary key, and the rest are called alternate keys.

A key formed by combining at least two or more columns is called composite key.

Normalization Basics

November 8, 2006 1 comment

Normalization
(1) In relational database design, the process of organizing data to minimize redundancy. Normalization usually involves dividing a database into two or more tables and defining relationships between the tables. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.There are three main normal forms, each with increasing levels of normalization:

  • First Normal Form (1NF): Each field in a table contains different information. For example, in an employee list, each table would contain only one birthdate field.
  • Second Normal Form (2NF): Each field in a table that is not a determiner of the contents of another field must itself be a function of the other fields in the table.
  • Third Normal Form (3NF): No duplicate information is permitted. So, for example, if two tables both require a birthdate field, the birthdate information would be separated into a separate table, and the two other tables would then access the birthdate information via an index field in the birthdate table. Any change to a birthdate would automatically be reflect in all tables that link to the birthdate table.

There are additional normalization levels, such as Boyce Codd Normal Form (BCNF), fourth normal form (4NF) and fifth normal form (5NF). While normalization makes databases more efficient to maintain, they can also make them more complex because data is separated into so many different tables.

(2) In data processing, a process applied to all data in a set that produces a specific statistical property. For example, each expenditure for a month can be divided by the total of all expenditures to produce a percentage.

(3) In programming, changing the format of a floating-point number so the left-most digit in the mantissa is not a zero.

Locks in SQL

November 8, 2006 12 comments

Introduction
Relational databases like Microsoft’s SQL Server use locks to prevent multiple users from making conflicting modifications to a set of data: when a set of data is locked by a user, no other users can modify that same set of data until the first user finishes modifying the data and relinquishes the lock. There are exceptions, but let’s not go there.

Some databases – SQL Server included – use locks to prevent users from seeing uncommitted modifications. In these systems, if UserA is modifying some set of data, UserB and all the rest of the users must wait until UserA is done modifying that data before UserB can get a shot at even reading the data, let alone modifying it.

Databases place locks at all levels of their physical hierarchies: rows, pages (typically a few KB of rows), extents (typically a few pages), entire tables, and entire databases. Some databases (Oracle, others?) only use fine-grained row locks, others don’t do row locks at all and only allow rough-grained page, extent, table, and database locks. Most databases – SQL Server included – support row locking, but often use rough-grained locks. This is because lock management is a royal pain. Locks aren’t small or simple entities, so if you only do row-level locking, you can get yourself into a world of pain: a million-row update can easily swamp memory and be a bear to manage.

Databases that don’t do just row-level locking often use a technique called lock escalation to achieve better performance. Unless its clear from the outset that a whole table will be modified, these databases start off using row locks, and they make plans to trade these locks in for rough-grained locks later if too many rows are modified.

Unfortunately, lock escalation introduces and amplifies a whole new problem: deadlock. If two users try to modify semantically-unrelated but physically-near data in two separate tables in reverse order, both users will start off with row locks, then try to upgrade them to page locks, and the situation will be that each user wants something the other user has, so they’re stuck. This is deadlock.

For example:

* UserA modifies some rows in TableA, causing a page lock affecting not just the rows UserA modified, but many others
* UserB modifies some rows in TableB, causing a page lock affecting not just the rows UserA modified, but many others
* UserA wants to modify some rows that UserB has locked (but not modified) in TableB
* UserB wants to modify – or maybe just access – some rows that UserA has locked (but not modified) in TableA.

Something’s gotta give. To deal with this problem, the database occasionally looks for deadlocks, and kills off one of the transactions so the other can finish. It usually kills the one that’s made the least modifications so that it minimizes the cost of rolling back changes. Databases that use only row-level locking almost never have this problem because two users rarely want to modify the exact same row, and even more rarely do they attain locks in the perfectly poor order needed to cause deadlock.

Also, databases like this use lock timeouts to prevent users from waiting too long for a lock. Query timeouts also factor in here. You can write code to retry queries that timeout, but this only automates database congestion. Any timeout that is often reached will only serve to worsen the user experience. Things simply should not take that long.

In practice and under high load, SQL Server’s locking system – which is based on lock escalation – does not perform well. Why? Lock contention. Lock contention is the problems of deadlock and waiting for locks. In a system in which many users are modifying the database at once, and many more users are trying to access the database concurrently, the locks are flying, users spend a lot of time waiting to attain locks, deadlocks are frequent, and users are far from happy.

Granted, if you’ve only got a few occasional users you won’t have much trouble with SQL Server’s out-of-the-box behavior. You’ll be hard pressed to see these problems with simple in-the-office tests or deployments involving just a few users. But throw a couple hundred concurrent users at your database and a constant stream of INSERTS and UPDATES with quite a few DELETEs sprinkled in, and you’ll start reading Oracle literature and eyeing your war chest. However, I’ve got a solution for you that will only cost you a code review, a few minor tweaks, and a system test. You do have a system test procedure in place, right?

Lock Contention Solved

If you used Streamload.com at all during June, July, and August, you probably got a “You were the deadlock loser” error, or a “Lock timeout” error, or an “Object required” error. These were all caused by lock contention. After scouring the documentation and talking to a few people, I learned what I summarized above and will say again here:

* SQL Server starts with row-level locks, but often escalates these to page and table locks, causing deadlocks
* SQL Server requires locks for reading from the database (SELECTs), so even folks not trying to modify the database are affected by the lock system.

Fortunately, I stumbled across some obscure keywords from the SQL Server lexicon: NOLOCK and ROWLOCK. They are used like this:

SELECT COUNT(UserID) FROM Users WITH (NOLOCK) WHERE Username LIKE 'foobar'

and

UPDATE Users WITH (ROWLOCK) SET Username = 'fred' WHERE Username = 'foobar'

NOLOCK and ROWLOCK

NOLOCK

Using NOLOCK politely asks SQL Server to ignore locks and read directly from the tables. This means you completely circumvent the lock system, which is a major performance and scalability improvement. However, you also completely circumvent the lock system, which means your code is living dangerously. You might read the not-necessarily-valid uncommitted modifications of a running transaction. This is a calculated risk.

ROWLOCK

Using ROWLOCK politely asks SQL Server to only use row-level locks. You can use this in SELECT, UPDATE, and DELETE statements, but I only use it in UPDATE and DELETE statements. You’d think that an UPDATE in which you specify the primary key would always cause a row lock, but when SQL Server gets a batch with a bunch of these, and some of them happen to be in the same page (depending on this situation, this can be quite likely, e.g. updating all files in a folder, files which were created at pretty much the same time), you’ll see page locks, and bad things will happen. And if you don’t specify a primary key for an UPDATE or DELETE, there’s no reason the database wouldn’t assume that a lot won’t be affected, so it probably goes right to page locks, and bad things happen.

By specifically requesting row-level locks, these problems are avoided. However, be aware that if you are wrong and lots of rows are affected, either the database will take the initiative and escalate to page locks, or you’ll have a whole army of row locks filling your server’s memory and bogging down processing. One thing to be particularly aware of is the “Management/Current Activity” folder with Enterprise Manager. It takes a long time to load information about a lot of locks. The information is valuable, and this technique is very helpful, but don’t be surprised if you see hundreds of locks in the “Locks/Processes” folder after employing this technique. Just be glad you don’t have lock timeouts or deadlocks.

Notes:
I get the sense that SQL Server honors NOLOCK requests religiously, but is more discretional with ROWLOCK requests. You can only use NOLOCK in SELECT statements. This includes inner queries, and the SELECT clause of the INSERT statement. You can and should use NOLOCK in joins:

SELECT COUNT(Users.UserID) FROM Users WITH (NOLOCK)
JOIN UsersInUserGroups WITH (NOLOCK) ON Users.UserID = UsersInUserGroups.UserID

Conclusion

Results
Use a low isolation level (also a good idea: NOLOCK takes this to an extreme), and use bound connections to allow processes to (share locks and) cooperate (sounds like a very complicated bad idea).

Big Disclaimer
Use these techniques with caution and discretion. The way I approached it was to look at all my stored procedures and ad hoc queries, and based on my understanding of where and how they were used, I decided whether it would be acceptable for the caller or user to get possibly incorrect results for NOLOCK, and whether it was likely that more than a few dozen rows would be locked with ROWLOCK. In almost all cases it was fine, but maybe for your code you should be more careful. You might need to produce separate procedures based on whether or to lock, and how to lock. There are other incantations (PAGLOCK, TABLOCK) which you might want to use when you know the UPDATE or DELETE query will affect many rows.

Hello world!

November 7, 2006 1 comment

Welcome to WordPress.com. This is your first post. Edit or delete it and start blogging!

Categories: Uncategorized