Microsoft Articles

Data Quality Services (DQS)–SQL Server 2012

Business Intelligence is all about taking better decisions and decisions can only be as accurate as the data we have at the time of decision making. Data quality is mostly about ensuring there are no typos in data. E.g. if province name is “British Columbia” then it should be consistent everywhere and not use “BC”. At times data can be inconsistent e.g.

  1. Receiving data in an txt / csv / excel / xml where the operator entered data by hand.
  2. Data entry application has lots of open-ended text boxes without field validation in place.
  3. Data is received from 3rd party company and they have different conventions.

DQS makes life a lot easier for both data stewards and developers. Data steward is a person who’s responsible for manually fixing the typos. Developer is a person who can use bunch of tools to automate data cleansing. DQS provides tools caters to both the data stewards and developers.

DQS comes with various components:

  1. Server side knowledgebase.
  2. Client side tool for creating adding data to the knowledgebase
  3. SSIS Data Cleansing Transformation that can be used by SSIS developers for automating data cleansing

For working with DQS there are various steps involved:

Step 1: Create Knowledgebase

Knowledgebase is a repository for domains. Domain is basically a column or a field where you define the possible values for that domain. To create the knowledgebase we can use the DQS Client tool which looks like this:

2013-11-21 5-29-58 PM

Step 2: Add Domains to the Knowledgebase

Add at least one domain to the knowledgebase. Here I have added two domains.

2013-11-21 5-32-55 PM - Copy

Knowledgebase resides in SQL Server databases that are creating by DQS. Here’s a list of databases that’s usually created by DQS:

dqs

Step 3: Populate Domains

After creating the domains we have to add data to the domains either manually or we can import it from Excel spread sheet or SQL server. In case if you want to import data from Oracle or some other source then you want to convert it to Excel / SQL first by using SSIS or any ETL tool. Here I have populated the domain by hand:

2013-11-21 5-32-55 PM

If you want to import data from Excel / SQL then we can use the following option:

2013-11-21 5-33-05 PM

Step 4: Perform Data Cleansing

Once we have the knowledgebase in place we can perform data cleansing. We can either do it manually or we can automate it by using SSIS. If you want to clean it manually then we can use the same DQS client tool that we used for creating the knowledgebase. In DQS client tool we can create a project and specify Excel / SQL database as the source. When we run the project it will let us output both the original column value and the fixed column value. Then from here we can choose to save the fixed data to Excel / SQL database.

2013-11-21 5-33-32 PM

In case if you want to automate data cleansing as part of your ETL operation then we can use DQS Cleansing Transformation which is new in SSIS 2012. Here’s how it looks like:

2013-11-21 5-32-44 PM

DQS cleansing transformation makes use of the knowledgebase that we defined earlier on. It’s never 100% automation since we do have to update the knowledgebase whenever new typos are discovered. It’s on going activity and over period of time the knowledgebase eventually becomes mature enough to fix most of the typos.

DQS Pattern Matching

On top of fixing the typos DQS can also be used for finding patterns e.g. we could have a mailing address that’s combination of various fields like street number, avenue, postal code etc. I will leave the details for some other time.

Summary

SQL Server 2012 received the first iteration of DQS and it’s really cool. It makes fixing typos a lot easier both for data stewards and developers. It will be exciting to see further improvements in the next version.

, ,

No Comments

Transparent Data Encryption in SQL Server 2008, 2008R2, 2012 (Part 2)

If you haven’t read Part 1 of this series then you can read it here. Data encryption allows us to encrypt data at field level. It’s done by the developers either in stored procedures or in in-line T-SQL code.

Transparent Data Encryption (TDE) allows DBAs to encrypt the whole database at SQL instance level. Once TDE is enabled at database level data is transparently encrypted and there’s no noticeable performance overhead. Developers don’t have to write any code on their end to encrypt / decrypt data. TDE encrypts both the database data files and the backups. TDE sort of binds the database to the SQL instance, not to the physical machine, where TDE is enabled.  Which means if someone steals the mdf / ndf / ldf files or the .bak files then it’s pointless since no one can attach / restore these files on any other instance of SQL server.

Before we discuss the technical details for enabling TDE it’s very important to know that both data encryption and TDE secure the data at the physical layer level i.e. hard drive. Data is not encrypted when it’s transmitting over the network. For encrypting the communication / network layer we have to encrypt the connections to the database engine which is primarily done by using configuration manager and certificates. I will leave network / communication level encryption for some other day. For now we are going to encrypt data at the physical layer level so that if the physical files (mdf / ndf / ldf / bak) files are stolen or “misplaced” then they shouldn’t be usable on any other instance / machine.

The overall process for with TDE requires creation of master key, certificates and enabling various TDE related options. Here I will guide you through the whole process. If you are following it then please ensure you perform them on a test / dev machine. DISCLAIMER: If these steps fry your machine, make you bald headed or gets your dog abducted by aliens then I won’t be held responsible.

Step 1:

To work with TDE your first step is to create a master key at the SQL instance level i.e. in the master database. Master key is used for creating certificates at the SQL level.

USE master

GO

CREATE MASTER KEY ENCRYPTION BY PASSWORD = ‘Pa$$w0rd’

GO

Step 2:

Create certificate that will be used for enabling TDE. Note: initially certificate is created in the master database and then assigned to a custom database

CREATE CERTIFICATE TestCert WITH SUBJECT = ‘TestCert’

GO

Step 3:

Now before we proceed with the actual encryption we should take the backup of the certificate and the encryption key right away otherwise if we lose the TDE enabled database then it will be nothing more than virtual paper weight, meaning we can’t decrypt the encrypted database without the right certificate. Notice we are taking the backup of the key and encrypting the backup with a password so that if the key gets stolen then it should not be recoverable without the password.

BACKUP CERTIFICATE TestCert TO FILE = ‘D:\backup\cert.bak’

WITH PRIVATE KEY ( FILE = ‘D:\backup\key.bak’ , ENCRYPTION BY PASSWORD = ‘KeyPa$$w0rd’ )

Step 4:

Now we are going to create the encryption key in our custom database (SalesDB) by using the certificate that was generated at the instance level. Since certificates rely on asymmetric cryptography so it means that one key remains in the instance and second key is in our database. This is what that sort of binds the database to the instance.

USE SalesDB

GO

CREATE DATABASE ENCRYPTION KEY

WITH ALGORITHM = AES_128

ENCRYPTION BY SERVER CERTIFICATE TestCert

GO

Step 5:

Time to flip the switch and turn on TDE at the database level.

ALTER DATABASE SalesDB

SET ENCRYPTION ON

GO

Houston we have a lift off. TDE has been enabled. All the applications will be able to access data from the database seamlessly without requiring manual encryption / decryption. To test TDE try these:

1. Try taking the database offline, then copy the mdf / ldf files to some other instance and attach the files. It will tell you that it’s not possible to attach it to some other instance since certificate is not found on that instance.

2. Try taking the backup of the TDE enabled database then restore it on some other instance. Again you will be out of luck and won’t be able to restore since the second instance doesn’t have the certificate that was used for encrypting the data.

 

Restoring certificate / legitimately copying TDE enabled database to another instance

Say that there was a disaster and the whole SQL instance, where TDE enabled database was residing, is lost. You will end up installing SQL server and then you will want to restore your TDE enabled database. Or, say that you want to copy the TDE enabled database, legitimately, to some other SQL server instance. How do we accomplish that? Well for that we have to ensure that the certificate that was used to encrypt the database exists at the instance level. Here are the steps that will be required for successful restoration / copying of database on some other instance.

 

Step 1:

We have to restore the certificate that was backed up previously.

USE master

GO

CREATE CERTIFICATE TestCert FROM FILE = ‘D:\backup\cert.bak’ WITH PRIVATE KEY ( FILE = ‘D:\backup\key.bak’, DECRYPTION BY PASSWORD = ‘KeyPa$$w0rd’ )

 

And that should do. Now you should be able attach your TDE enabled mdf / ldf or restore your .bak file

 

TDE is not for me. Get me out of here

If you don’t like TDE or if you want to disable TDE at the database level then run the following script:

ALTER DATABASE SalesDB

SET ENCRYPTION OFF

GO

 

TDE Pros & Cons

Pros:

  1. Encrypts the whole database and sort of binds it to the instance.
  2. There’s almost no performance overhead.
  3. Encryption / decryption is seamlessly done by SQL server. No manual coding is required.

Cons:

  1. It doesn’t prevent “man in the middle” attacks. Data is not encrypted when it’s getting transmitted over the network.
  2. Requires Enterprise edition
  3. Requires discipline on behalf of DBA. Certificate backup is mandatory otherwise without it there’s no way to recover the TDE enabled database.
  4. Older hardware might notice some performance overhead since encryption / decryption is taking place in the background.

, , ,

2 Comments

Data Encryption in SQL Server 2008, 2008R2, 2012 (Part 1)

Data is very important asset. If it’s stolen or “misplaced” then that can cost a fortune to the company. SQL Server provides encryption that can avoid the embarrassment and legal issues if data is stolen. Here we will discuss two important levels at which SQL Server supports encryption. In this part 1 of the blog post we will explore column level encryption. In Part 2 we will explorer encryption at the entire database level.

1. Data Encryption (column level encryption)

You can decide and select the columns that are of utmost importance and encrypt the data for those columns. Benefit is that if you have low end server then encrypting and decrypting hand picked columns won’t hurt the performance. Example of candidate columns for encryption would be password, credit card number, SIN number (Canadian, eh Smile), account number, PIN number etc.

Let’s create a sample database and table and apply encryption at column level

  1: CREATE DATABASE Bank
  2: GO
  3: USE Bank
  4: GO
  5: CREATE TABLE Account
  6: (
  7:   AccountId int NOT NULL IDENTITY(1,1) PRIMARY KEY,
  8:   AccountNumber varchar(50),
  9:   EncryptedAccountNumber varbinary(128)
 10: )
 11: GO
 12: 

AccountNumber will contain the data without any encryption where as EncryptedAccountNumber will contain the encrypted version of AccountNumber.

Now our next step is to create master key. Master key will be used for creating digital certificate and certificate will be used for generating the symmetric key that will finally be used for encrypting data in individual columns.

Create master key:

  1: CREATE MASTER KEY ENCRYPTION BY 
  2: PASSWORD = 'Pa$$w0rd'

 

Create digital certificate

  1: CREATE CERTIFICATE BankCert
  2:    WITH SUBJECT = 'Account Numbers';
  3: GO

Create symmetric key for encrypting data

  1: CREATE SYMMETRIC KEY BankAccountKey
  2:     WITH ALGORITHM = AES_256
  3:     ENCRYPTION BY CERTIFICATE BankCert;
  4: GO

For encrypting the data we use the EncryptByKey function and in order to use this function we have to first open the symmetric key that we created in one of the previous steps. Let’s insert some records:

  1: OPEN SYMMETRIC KEY BankAccountKey
  2:    DECRYPTION BY CERTIFICATE BankCert;
  3: 
  4: --insert original and encrypted values
  5: INSERT INTO Account 
  6: VALUES 
  7: ('123456789', EncryptByKey(Key_GUID('BankAccountKey'), '123456789')),
  8: ('987654321', EncryptByKey(Key_GUID('BankAccountKey'), '987654321'))
 
Let’s query the table and find out how the encrypted data looks like:
image

In the 3rd column we can see that Account Number is encrypted. Original data occupies 9 bytes but encrypted data is occupying 68 bytes. There’s overhead involved due to encryption.

Now how do we decrypt and read the data? Good question. Actually it’s quite similar to encryption. We use DecryptByKey function and before that we have to open the symmetric key that was used for encrypting the data. Let’s do it

  1: OPEN SYMMETRIC KEY BankAccountKey
  2:    DECRYPTION BY CERTIFICATE BankCert;
  3: GO
  4: 
  5: --list original and decrypted values
  6: SELECT AccountNumber, EncryptedAccountNumber,len(EncryptedAccountNumber) AS Size, CONVERT(varchar, DecryptByKey(EncryptedAccountNumber)) AS DecryptedAccountNumber FROM Account

And that’s about it.

Conclusion:

Column level security is useful for securing hand picked columns. DBA’s will have to make sure that they encrypt the certificates and keys so that we can recover the encrypted data in case of a disaster. Developers are responsible for writing the code that encrypts and decrypts the data. Actually it’s pretty straight forward to create stored procedures that encrypts and decrypts the data.

9 Comments

SQL – Common Table Expressions (CTE) and HierarchyId

SQL Server 2008/R2/2012 offers hierarchyid data type that can be used for handling hierarchical data. In this blog I will try to take you down the rabbit hole and explore common table expressions and hierarchyid.

Quite often we have  to deal with hierarchical or recursive data. e.g.

Product Category (Entertainment Device)

     Product Sub-category (TV)

         Product Sub-sub-category (LCD)

              Product Sub-sub-sub-category (HD)

                      …..

I guess you get the idea.

 

Another example would be Employee – Manager relationship where every manager is an employee (but not every employee is a manager). Here’s a sample diagram:

image

So, how do we store this data in our table and then how do we query it? We have two options:

1. Common Table Expressions – CTE

This works in older versions of SQL Server 2005 and also the newer versions like 2008/R2, 2012. We use the traditional data types like int etc. to handle relationship.

Let’s first create the table so we can store the records and then we will query it.

   1:  CREATE TABLE dbo.Employee
   2:   ( 
   3:     EmployeeId int NOT NULL IDENTITY(1,1) PRIMARY KEY, 
   4:     Name varchar(50) NOT NULL, 
   5:     ManagerId int 
   6:  ) 
   7:  GO

 

Next step is to populate the table with the data (note: I am using same data that I have in my diagram)

   1:  INSERT INTO dbo.Employee (Name, ManagerId) 
   2:  VALUES 
   3:  ('CEO', NULL), 
   4:  ('Sales Manager', 1), 
   5:  ('Accounts Manager', 1), 
   6:  ('Senior Sales Person 1', 2), 
   7:  ('Senior Sales Person 2', 2), 
   8:  ('Senior Accounts Person 1', 3), 
   9:  ('Senior Accounts Person 2', 3), 
  10:  ('Junior Sales Person 1', 4), 
  11:  ('Junior Sales Person 2', 5), 
  12:  ('Junior Accounts Person 1', 6), 
  13:  ('Junior Accounts Person 2', 7) 

Now we are ready to use common table expressions to query the table.

   1:  --Finding descendants
   2:  WITH EmployeeDescendants(EmployeeId, Name, ManagerId) 
   3:  AS 
   4:  ( 
   5:  select EmployeeId, Name, ManagerId from dbo.Employee 
   6:  WHERE EmployeeId = 1 -- this is parameter
   7:  union all 
   8:  select m.EmployeeId, m.Name, m.ManagerId 
   9:  from EmployeeDescendants CTE,dbo.Employee m 
  10:  WHERE CTE.EmployeeId = m.ManagerId 
  11:  ) 
  12:  SELECT EmployeeId, Name, ManagerId 
  13:  FROM EmployeeDescendants 
  14:  ORDER BY EmployeeId
  15:   
  16:  --Finding ancestors
  17:   
  18:  WITH EmployeeAncestor(EmployeeId, Name, ManagerId) 
  19:  AS 
  20:  ( 
  21:  select EmployeeId, Name, ManagerId from dbo.Employee 
  22:  WHERE EmployeeId = 11 -- this is parameter
  23:  union all 
  24:  select m.EmployeeId, m.Name, m.ManagerId 
  25:  from EmployeeAncestor CTE,dbo.Employee m 
  26:  WHERE CTE.ManagerId = m.EmployeeId 
  27:  ) 
  28:  SELECT EmployeeId, Name, ManagerId 
  29:  FROM EmployeeAncestor 
  30:  ORDER BY EmployeeId ASC
  31:   

Common Table Expressions aren’t too bad, eh. You could easily convert the above script into a stored procedure and pass arguments to parameters.

Query Execution Plan for finding descendant nodes looks like this (Note: Plan for finding ancestors is similar):

image

Ouch, lots is going on in the query execution plan.

2. HierarchyId

This won’t work in SQL Server 2005. You must have newer versions like 2008/R2, 2012. We use the newer managed data type called “hierarchyid” to handle relationship.

Let’s first create the table so we can store the records and then we will query it.

   1:  CREATE TABLE [dbo].[EmployeeNew]
   2:  ( 
   3:      [EmployeeId] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY, 
   4:      [Name] [varchar](50) NOT NULL, 
   5:      [ManagerId] [int] NULL, 
   6:      [OrgNode] [hierarchyid] NULL
   7:  )
   8:   
   9:  GO 

Next step is to populate the table with sample records

   1:  INSERT INTO dbo.EmployeeNew(Name, ManagerId, OrgNode) 
   2:  VALUES 
   3:  ('CEO', NULL, hierarchyid::Parse('/')), 
   4:  ('Sales Manager', 1, hierarchyid::Parse('/1/')), 
   5:  ('Accounts Manager', 1, hierarchyid::Parse('/2/')), 
   6:  ('Senior Sales Person 1', 2, hierarchyid::Parse('/1/1/')), 
   7:  ('Senior Sales Person 2', 2, hierarchyid::Parse('/1/2/')), 
   8:  ('Senior Accounts Person 1', 3, hierarchyid::Parse('/2/1/')), 
   9:  ('Senior Accounts Person 2', 3, hierarchyid::Parse('/2/2/')), 
  10:  ('Junior Sales Person 1', 4, hierarchyid::Parse('/1/1/1/')), 
  11:  ('Junior Sales Person 2', 5, hierarchyid::Parse('/1/2/1/')), 
  12:  ('Junior Accounts Person 1', 6, hierarchyid::Parse('/2/1/1/')), 
  13:  ('Junior Accounts Person 2', 7, hierarchyid::Parse('/2/2/1/')) 
     

Time to test the table:

   1:  --display all records 
   2:  select orgnode.ToString() AS OrgNodeString,orgnode.GetLevel()  Level,* 
   3:  from dbo.EmployeeNew 
   4:   
   5:   
   6:  --getting root 
   7:  DECLARE @root hierarchyid = hierarchyid::GetRoot(); 
   8:  SELECT @root, @root.ToString()
   9:   
  10:  --employee we are interested in 
  11:  DECLARE @node hierarchyid 
  12:  SELECT @node = OrgNode FROM dbo.EmployeeNew 
  13:  WHERE Name = 'Sales Manager'; 
  14:  SELECT @node, @node.ToString()
  15:   
  16:  --get ancestors 
  17:  SELECT * FROM dbo.EmployeeNew WHERE @node.IsDescendantOf(orgnode) = 1;
  18:   
  19:  --get descendants 
  20:  SELECT * FROM dbo.EmployeeNew WHERE OrgNode.IsDescendantOf(@node) = 1;
  21:   
  22:  --get by level (junior employees) 
  23:  SELECT * FROM dbo.EmployeeNew 
  24:  WHERE OrgNode.GetLevel() = 3;

And the query execution plan for finding descendants looks like this (Note: query execution plan for finding ancestors is similar):

image

That query execution plan is sweet and simple, eh. But, don’t let it mislead you. Under the hood there’s still lots going on. Bulk of the work is getting performed by the managed functions like IsDescendantOf, GetLevel etc.

HierarchyId definitely reduces the lines of code required to query the hierarchical or recursive data.

 

Conclusion

Use hierarchyid if you have newer SQL Server like 2008 and onwards. It’s designed just to tackle hierarchical scenarios.

Use CTE if you have older SQL Server like 2005. However, CTE is a generic query pattern which means it can be used even when we don’t have hierarchical data.

1 Comment

Power View – Highlight, Filter and Slicer

SQL Server 2012 brings Power View (aka. Project Crescent) which supports highlight filter and slicers. They are pretty much similar to PowerPivot. Let’s discuss when these features should be used

Let’s take the following scenario where we have a table as well a chart on the same Power View report:

image

Highlight:

It let’s you focus on the interested data. If you highlight a table it won’t do much except for highlighting the cell in the table.

image

But, if you highlight a bar in the chart then it filters the table control automatically

image

Filter:

Filter can be utilized at individual control level e.g. table, matrix, card, chart etc. It cuts down on the data that’s being displayed on the Power View canvas. Filter at control level doesn’t effect other controls on the canvas.

Here’s the filter applied on table (Note: applying filter on chart will have similar effect):

image

One benefit of filter is that it has it’s own section (the right hand side window which can be closed or opened) where it gets displayed so it doesn’t consume real estate on the screen, well at least not all the time. You can choose to display filter pane whenever you require it.

Slicer:

Slicer is a filter but it’s scope is broader than a filter. Slicer filters data at the View level. One drawback of slicer is that it’s always displayed in the Power View canvas which means it reduces the real estate on the screen for controls that we can use.

Here we have a slicer on the right hand side and it filters all the controls on the canvas:

image

 

Summary:

  • Use highlight for showing areas of interest as well for filtering (especially chart control)
  • Use filter when you have to filter individual controls without effecting other controls
  • Use slicer when you have to filter everything on the canvas

No Comments

Power View – Power to the User

Power Builder, PowerPivot. Let me guess what word is common in these technologies ,“POWER”.  Quite a few products and technologies have used the word “Power” to give a sense of empowerment to the users. And, I guess they deserve to use the name as they have delivered the results. Now there’s another one to join the family and that’s Power View (aka. Project Crescent) in SQL Server 2012.

 

So far Power View seems pretty good. It does empower the end users. Here’s the requirements for using Power View:

  • SharePoint 2010 Enterprise edition with SP1
  • SQL Server 2012 PowerPivot (SharePoint must be configured as well)
  • Silverlight 5 (What?? No HTML 5 or AJAX?? Sad smile)

 

Yes, you read it right. It’s a Silverlight 5 based application and it won’t run on iOS. That’s where Microsoft should have offered AJAX alternative as well so those who don’t have Silverlight plug-in could use it as well. If you could just come over this one itsy bitsy drawback Power View is definitely an excellent self-servicing BI feature for business users.

As far as data source goes, Power View requires one of the following:

  • Tabular Model (New in SSAS 2012)
  • PowerPivot based Excel report (this could use OLTP, OLAP or Tabular Model)

Here are some of the graphical visualizations that we can create in Power View:

  • Charts (Horizontal and Vertical)

image

  • Scatter Chart (with Play Axis. Simply love the animation for showing breakdown of the data over period of time)

image

  • Charts (with Tiles)

image

 

  • Views (OK, this is not really a visualization since it’s available out-of-the-box but I just got to mention it since it’s soooooo good. Just like Excel has sheets concept, PowerPoint has slides concept the Power View has views concept. It lets you add multiple views of the data in the same report. You could say that it’s Power View’s equivalent to Sub-report in SSRS)

image

Here are the pros, cons and the personal verdict (it’s by no means an exhaustive list of pros and cons)

Pros:

  • Easy to create reports in a matter of minutes
  • No programming / SQL knowledge is required
  • Out-of-the-Box animations and visualizations (Play Axis stole my heart away)
  • Easy to filter reports at control level or at the report level

Cons:

  • Uses Silverlight plug-in
  • Yet to see any support for creating custom extensions or plug-ins to develop custom visualizations and animations

 

Personal Verdict:

Excellent reporting tool for business users and delivers well considering it’s the first iteration of Power View. Definitely looking forward to it’s future iterations.

2 Comments