Archive

Archive for the ‘Database’ Category

Stored Procedures are EVIL

October 1, 2009 4 comments

I found this interesting article thought i’d share

Stored Procedures are EVIL – by Tony Marston

A lot of developers are taught to use database stored procedures, triggers and database constraints at every possible opportunity, and they cannot understand why an old dinosaur like me should choose to take an opposite view. The reason can be summed up quite simply:

You only know what you have been taught, whereas I know what I have learned.

I was weaned on file systems and databases which did not have any facilities for stored procedures and triggers, so I learned how to build applications without them. When such facilities became available my colleagues and I still never used them for practical reasons:

  • It meant learning a new language, and we didn’t have the time.
  • It meant taking longer to implement and maintain, therefore cost more to develop. This is an important consideration for a software house which can only win business by providing cost-effective solutions.
  • There was no advantage in doing so, so why bother?

Our golden rule was:

Use stored procedures and triggers only when it is an absolutely necessity.

This is in total conflict with the attitude of today’s wet-behind-the-ears tenderfoot greenhorn who seems to think:

Use stored procedures and triggers at every possible opportunity simply because you can.


Amongst the arguments in favour of stored procedures are:

Stored procedures are not as brittle as dynamic SQL

Some people argue that putting ad-hoc SQL in your business layer (BL) code is not that good. Agreed, but who said that the only alternative is stored procedures? Why not have a DAL that generates the SQL query at runtime based on information passed to it by the BL? It is correct to say that small changes to the database can have severe impacts on the application. However, changes to a relational model will always have an impact on the application that targets that model: add a non-nullable column to a table and you will see what I mean. You can use stored procedures or ad-hoc queries, you have to change the calling code to make sure that column gets a value when a new row is inserted. For Ad-hoc queries, you change the query, and you’re set. For stored procedures, you have to change the signature of the stored procedure, since the INSERT/UPDATE procs have to receive a value for the new column. This can break other code targeting the stored procedure as well, which is a severe maintenance issue. A component which generates the SQL on the fly at runtime doesn’t suffer from this: it will for example receive an entity which has to be saved to the database, that entity contains the new field, the SQL is generated and the entity is saved. No maintenance problems. With a stored procedure this wouldn’t be possible.

Stored procedures are more secure

This is a common argument that many people echo without realising that it became defunct when role-based security was made available. A good DBA defines user-roles in the database, and users are added to those roles and rights are defined per role, not per user. This way, it is easy to control which users can insert / update and which users can for example select or delete or have access to views in an easy way.

With a view it is possible to control which data is accessed on a column basis or row basis. This means that if you want user U to select only 2 or so columns from a table, you can give that user access to a view, not the underlying table. The same goes for rows in one or more tables. Create a view which shows those rows, filtering out others. Give access rights to the view, not the table, obviously using user-roles. This way you can limit access to sensitive data without having to compromise your programming model because you have to move to stored procedures.

It is also said that stored procedures are more secure because they prevent SQL injection attacks. This argument is false for the simple reason that it is possible to have a stored procedure which concatenates strings together and therefore open itself up to sql injection attacks (generally seen in systems which use procedures and have to offer some sort of general search routine), while the use of parameterized queries removes this vulnerability as no value can end up as being part of the actually query string.

Stored procedures are more efficient

The execution of SQL statements in stored procedures may have been faster than with dynamic SQL in the early days of database systems, but that advantage has all but disappeared in the current versions. In some cases a stored procedure may even be slower than dynamic SQL, so this argument is as dead as a Dodo.

Performance should not be the first question. My belief is that most of the time you should focus on writing maintainable code. Then use a profiler to identify hot spots and then replace only those hot spots with faster but less clear code. The main reason to do this is because in most systems only a very small proportion of the code is actually performance critical, and it’s much easier to improve the performance of well factored maintainable code.

While stored procedures may run faster, they take longer to build, test, debug and maintain, therefore this extra speed comes at a price. If the same function can be performed inside application code at an acceptable speed, what is the advantage of spending more money to make it run faster at a more-than-acceptable speed? It is OK to use stored procedures when you absolutely need a performance gain, but until then they’re nothing but premature optimization.

The company has paid for them, so why not use them?

A similar argument is that by not using what the company has paid for, you are effectively wasting the company’s money. I’m sorry, but using something because it’s there is just not good enough. If I can achieve something inside my application with application code, then I must be given a very good reason to move it out of my application and into the database. Believe it or not there are costs involved in moving logic from one place to another, and those costs must be offset by measurable benefits.

Application code or database code – it’s still code, isn’t it?

No it’s not. Application code is built using a programming language whereas SQL is nothing more than a data manipulation language, and is therefore very limited in its scope. There is absolutely nothing that can be done in a stored procedure that cannot also be done in application code, but the converse is not true.


Amongst the arguments against stored procedures are:

It mangles the 3 Tier structure

Instead of having a structure which separates concerns in a tried and trusted way – GUI, business logic and storage – you now have logic intermingling with storage, and logic on multiple tiers within the architecture. This causes potential headaches down the road if that logic has to change.

Stored procedures are a maintenance problem

The reason for this is that stored procedures form an API by themselves. Changing an API is not that good, it will break a lot of code in some situations. Adding new functionality or new procedures is the “best” way to extend an existing API. A set of stored procedures is no different. This means that when a table changes, or behaviour of a stored procedure changes and it requires a new parameter, a new stored procedure has to be added. This might sound l

ike a minor problem but it isn’t, especially when your system is already large and has run for some time. Every system developed runs the risk of becoming a legacy system that has to be maintained for several years. This takes a lot of time, because the communication between the developer(s) who maintain/write the stored procedures and the developer(s) who write the DAL/BL code has to be intense: a new stored procedure will be saved fine, however it will not be called correctly until the DAL code is altered. When you have Dynamic SQL in your BL at your hands, it’s not a problem. You change the code there, create a different filter, whatever you like and whatever fits the functionality to implement.

Microsoft also believes stored procedures are over: it’s next generation business framework MBF is based on Objectspaces, which generates SQL on the fly.

Stored procedures take longer to test

Business logic in stored procedures is more work to test than the corresponding logic in the application. Referential integrity will often force you to setup a lot of other data just to be able to insert the data you need for a test (unless you’re working in a legacy database without any foreign key constraints). Stored procedures are inherently procedural in nature, and hence harder to create isolated tests and prone to code duplication. Another consideration, and this matters a great deal in a sizable application, is that any automated test that hits the database is slower than a test that runs inside of the application. Slow tests lead to longer feedback cycles.

BL in stored procedures does not scale

If all the business logic is held in the database instead of the application then the database becomes the bottleneck. Once the load starts increasing the performance starts dropping. With business logic in the application it is easy to scale up simply by adding another processor or two, but that option is not readily available if all that logic is held in the database.

If you have a system with 100’s of distributed databases it is far more difficult to keep all those stored procedures and triggers synchronized than it is to keep the application code synchronized.

Stored procedures are not customisable

This is a big issue if you want an application where the customer can insert their own business logic, or where different logic is required by different customers. Achieving this with application code is a piece of cake, but with database logic it is a can of worms.

Database triggers are hidden from the application

A big problem with database triggers is that the application does not know that they exist, therefore does not know whether they have run or not. This became a serious issue in one application (not written by me) which I was maintaining. A new DBA who was not aware of the existence of all these triggers did something which deactivated every trigger on the main database. The triggers were still there, they had not been deleted, but they had been turned off so did not fire and do what they were supposed to do. This mistake took several hours to spot and several days to fix.

Version Control

It is easy to control all changes to application code by running it through a proper version control system, but those facilities do not exist for stored procedures and triggers. How much damage could be caused if a stored procedure were to get out of sync with the application code? How easy is it to check that the application is running with the correct versions? How much more difficult would it be if the application you were supporting was running on a remote site with nothing more than a dial-up connection?

This is a reason why some teams avoid stored procedures like the plague – it eliminates an area of potentially disastrous screw-ups.

Vendor lock-in

You may think that this is not a problem if you build and maintain the applications for a single company where a change in database vendor is highly unlikely, but what happens should the company decide that their DBMS is no longer flavour of the month and they want to change to a different DBMS? This may be due to several factors, such as spiraling costs or poor performance, but when it happens you will find that a lot of code will have to be rewritten. Porting the data will be one exercise, but porting the stored procedures and triggers will be something else entirely. Now, if all that logic were held inside the application, how much simpler would it be?

Believe it or not there are people out there who write applications which are database-independent for the simple reason that the applications may be used by many different companies, and those many companies may not all use the same DBMS. Those that do use the same DBMS may not be using the same version, and stored procedures written for one version may not be compatible with another.


As far as I am concerned the use of stored procedures, database triggers and foreign key restraints is OPTIONAL, not MANDATORY, therefore I am free to exercise my option not to use them. That is my choice, and the software that I produce does not suffer in any way, therefore it cannot be defined as the wrong choice.

The web application framework that I have built using PHP does not use stored procedures, database triggers or foreign key constraints, yet it does not suffer from any lack of functionality. This is possible simply because I can do everything I want inside my application where it is instantly accessible and customisable. To those of you who instantly jump to the (wrong) conclusion that this must mean that I have to write a huge amount of duplicated SQL statements my answer is simple – I don’t write any SQL statements at all, they are all generated dynamically at runtime. This is all due to the framework being built using the 3 Tier Architecture which has a clear separation of concerns:

  • There is a separate object in the Business Layer for each database table. This is where all business rules are applied as data passes from the Presentation Layer (UI), through the Business Layer to the Data Access Layer, and back again. The Business Layer does not have any direct communication with the database – this is all handled by the Data Access Layer.
  • There is a single object in the Data Access Layer known as the Data Access Object (DAO). The DAO receives a request from the Business Layer and dynamically constructs and executes the SQL query string to satisfy that request. This implementation means that I can easily switch to another DBMS simply by switching to another DAO, and without having to change a single line of code in any Business Layer object.
  • Referential integrity is also handled by standard code within the framework and requires no additional coding from any developer whatsoever. It uses information which is exported from the Data Dictionary which tells it what to do with every relationship, and the standard code in the framework simply performs the relevant processing. The advantage of this approach is that it is easy to amend or even turn off any of these rules at runtime, which makes the application infinitely more flexible.
  • All changes made to the database can be logged without using a single database trigger. How? By adding extra code into the DAO to write all relevant details out to the AUDIT database. This functionality is totally transparent to all the objects in the Business Layer, and they do not need any extra code to make it work
Advertisements

How do I estimate database growth?

December 2, 2006 5 comments

Unfortunately, SQL Server does not come with any built-in tools to estimate database size.
If you don’t have a third-party estimating tool, you have to do it the hard way, which is by calculating the space taken up by each table, and then adding the total space needed for each table to get a grand total for the database.

Here are some guidelines for estimating the space needed for a database.

  1. For each table, find out how many bytes each row will use up on average. It is easy to calculate the size of fixed length columns, but calculating the space used by variable length fields is much more difficult. About the only way to do this is to get some of the actual data that will be stored in each variable length column, and then based on the data you have, estimate the average byte length of each variable length column. Once you know the typical sizes of each column, you can then calculate the size of each typical row in your table.
  2. The above step is a good start, but it usually significantly underestimates the amount of space you need. Besides the actual data, you must also estimate the type and number of indexes you will use for each table. Indexes can use a huge amount of space, and you must estimate how much space you think they will take. This is a function of the type of index (clustered or non-clustered, the number of indexes, and the width of the indexes).
  3. Besides estimating the size of the indexes, you also must take into consideration the Fillfactor and Pad Index used when the indexes are created. Both of these affect how much empty space is left in an index, and this empty space must be included in your estimate.
  4. And one more factor affecting how much space it takes to store data in a table is how many rows can be fitted onto one SQL Server 8K data page. Depending on the size of each row, it is likely that not all of the space in each data page is fully used. This must also be accounted for when estimating page size.
  5. While tables, and their associated indexes take up most of the physical space in most databases, keep in mind that every object in SQL Server takes up space, and must be accounted for.

As you can see, without a tool to help out, manually estimating database size is not a fun task, and it is, at best, only a rough estimate.

Another option you might consider, assuming that you already have an existing database with data, is to extrapolate the current size of your database on a table by table basis. For example, if you know that a particular table has 100,000 rows, and it is 1MB in size, then assuming that neither indexing or the fillfactor changes, than when the table gets to 200,000 rows, that it should be about 2MB in size. If you do this for every table, then you can get a fairly good idea on how much disk space you will need in the future. To find out how much space a particular table uses, use this command:

sp_spaceused ' '

Categories: Database, SQL Server

What happens when my integer IDENTITY runs out of scope?

December 1, 2006 2 comments

Before we actually look at the answer, let’s recall some basics of the IDENTITY property and SQL Server’s numerical data types.

You can define the IDENTITY property on columns of the INT data type and on DECIMAL with scale 0. This gives you a range of:

TINYINT

0 – 255

SMALLINT

-32.768 – 32.767

INT

-2.147.483.648 – 2.147.483.647

BIGINT

-2^63 – 2^63-1

When you decide to use the DECIMAL datatype you have a potential range from -10^38 to 10^38-1.

So, keeping this in mind, we’re now ready to answer the original question here. What happens when an INTEGER IDENTITY value is about to run out of scope?

CREATE TABLE id_overflow ( col1 INT IDENTITY(2147483647,1) )

GO

INSERT INTO id_overflow DEFAULT VALUES INSERT INTO id_overflow DEFAULT VALUES SELECT * FROM id_overflow

DROP TABLE id_overflow

(1 row(s) affected) Server: Msg 8115, Level 16, State 1, Line 2 Arithmetic overflow error converting IDENTITY to data type int. Arithmetic overflow occurred.

This script creates a simple table with just one column of type INT. We have also created the IDENTITY property for this column. But instead of now adding more than 2 billion rows to the table, we rather set the seed value to the positive maximum value for an INTEGER. The first row inserted is assigned that value. Nothing unusual happens. The second insert, however, fails with the above error. Apparently SQL Server does not start all over again or tries to fill the maybe existing gaps in the sequence. Actually, SQL Server does nothing automatically here. You have to do this by yourself. But what can you do in such a case?

Probably the easiest solution is to alter the data type of the column to BIGINT, or maybe right on to DECIMAL(38,0) like so:

CREATE TABLE id_overflow ( col1 INT IDENTITY(2147483647,1) )

GO

INSERT INTO id_overflow DEFAULT VALUES ALTER TABLE id_overflow ALTER COLUMN col1 BIGINT INSERT INTO id_overflow DEFAULT VALUES

SELECT * FROM id_overflow

DROP TABLE id_overflow

col1

——————–

2147483647 2147483648

(2 row(s) affected)

If you know in advance that your table needs to keep that many rows, you can do:

CREATE TABLE bigint_t ( col1 BIGINT IDENTITY(-9223372036854775808, 1) )

GO

INSERT INTO bigint_t DEFAULT VALUES

SELECT * FROM bigint_t

DROP TABLE bigint_t

col1

——————–

-9223372036854775808

(1 row(s) affected)

Or the DECIMAL(38,0) variation:

CREATE TABLE decimal_t ( col1 DECIMAL(38,0) IDENTITY(-99999999999999999999999999999999999999, 1) )

GO

INSERT INTO decimal_t DEFAULT VALUES

SELECT * FROM decimal_t

DROP TABLE decimal_t

col1

—————————————-

-99999999999999999999999999999999999999

(1 row(s) affected)

One might be distressed in one’s aesthetical taste by those negative numbers, but it’s a fact, that one now shouldn’t have to worry about running out of scope for quite some time.

Categories: Database, SQL Server

DELETE and TRUNCATE explained.

November 24, 2006 2 comments

DELETE logs the data for each row affected by the statement in the transaction log and physically removes the row from the file, one row at a time. The recording of each affected row can cause your transaction log grow massively if you are deleting huge numbers of rows. However, when you run your databases in full recovery mode, detailed logging is necessary for SQL Server to be able to recover the database to the most recent state, should a problem arise. The fact that each row is logged explains why DELETE statements can be slow.

TRUNCATE is faster than DELETE due to the way TRUNCATE “removes” rows. Actually, TRUNCATE does not remove data, but rather deallocates whole data pages and removes pointers to indexes. The data still exists until it is overwritten or the database is shrunk. This action does not require a lot of resources and is therefore very fast. It is a common mistake to think that TRUNCATE is not logged. This is wrong. The deallocation of the data pages is recorded in the log file. Therefore, BOL refers to TRUNCATE operations as “minimally logged” operations. You can use TRUNCATE within a transaction, and when this transaction is rolled-back, the data pages are reallocated again and the database is again in its original, consistent state.

Some limitations do exist for using TRUNCATE.

· You need to be db_owner, ddl_admin, or owner of the table to be able to fire a TRUNCATE statement.

· TRUNCATE will not work on tables, which are referenced by one or more FOREIGN KEY constraints.

So if TRUNCATE is so much faster than DELETE, should one use DELETE at all? Well, TRUNCATE is an all-or-nothing approach. You can’t specify just to truncate those rows that match a certain criteria. It’s either all rows or none.

You can, however, use a workaround here. Suppose you want to delete more rows from a table than will remain. In this case you can export the rows that you want to keep to a temporary table, run the TRUNCATE statement, and finally reimport the remaining rows from the temporary table. If your table contains a column with the IDENTITY property defined on it, and you want to keep the original IDENTITY values, be sure to enabled IDENTITY_INSERT on the table before you reimport from the temporary table. Chances are good that this workaround is still faster than a DELETE operation.

You can also set the recovery mode to “Simple” before you start this workaround, and then back to “Full” one it is done. However, keep in mind that is this case, you might only be able to recover to the last full backup.

Categories: Database, SQL Server

Understanding NULLS

November 23, 2006 Leave a comment

Some SQL programmers and developers tend to think NULL as zero or blank. In fact, NULL is neither of these. NULL literally means that the value is unknown or indeterminate.

One side effect of the indeterminate nature of NULL value is it cannot be used in a calculation or a comparision.

Listed below are a few important rules to remember about the behaviour of NULL values.

  • A NULL value cannot be inserted into a column defined as NOT NULL.
  • NULL values are not equal to each other. It is a frequent mistake to compare two columns that contain NULL and expect the NULL values to match. (A NULL column can be identified in a WHERE clause or in a boolean expression using phrases such as ‘value is NULL’ and ‘value is NOT NULL’.)
  • A column containing a NULL value is ignored in the calculation of aggregate values such as AVG, SUM or MAX.
  • When columns that contain NULL values in a GROUP BY clause of a query are listed, the query output contains rows for those NULL values.
  • JOINS between tables, in which one join condition contains values and the other contains NULL, are governed by the rules for “outer joins”.
Categories: Database, SQL Server

Difference between UNIQUE constraint and PRIMARY key

November 22, 2006 52 comments

A UNIQUE constraint is similar to PRIMARY key, but you can have more than one UNIQUE constraint per table.

When you declare a UNIQUE constraint, SQL Server creates a UNIQUE index to speed up the process of searching for duplicates. In this case the index defaults to NONCLUSTERED index, because you can have only one CLUSTERED index per table.

* The number of UNIQUE constraints per table is limited by the number of indexes on the table i.e 249 NONCLUSTERED index and one possible CLUSTERED index.

Contrary to PRIMARY key UNIQUE constraints can accept NULL but just once. If the constraint is defined in a combination of fields, then every field can accept NULL and can have some values on them, as long as the combination values is unique.

Categories: Database, SQL Server

How do you find the Second highest Salary?

November 14, 2006 158 comments

This is the most common question asked in Interviews.

EMPLOYEE table has fields EMP_ID and SALARY how do you find the second highest salary?

Answer:

We can write a sub-query to achieve the result

SELECT MAX(SALARY) FROM EMPLOYEE WHERE SALARY NOT IN (SELECT MAX(SALARY) FROM EMPLOYEE)

The first sub-query in the WHERE clause will return the MAX SALARY in the table, the main query SELECT’s the MAX SALARY from the results which doesn’t have the highest SALARY.

Categories: Database, SQL Server