Data Quality Services
The Data Quality Services component of SQL Server can be used to establish a knowledge base to use in order to clean and standardize data coming into a system.
MSDN DQS
Installation
Need Master Data Services installed first (MDS requires IIS). I have a post on Installing MDS
From Start Menu, run the ‘Data Quality Server Installer’. This starts a command window.
You’ll be prompted to enter a password for the ‘Database Master Key’. The requirements for the password are listed in the window (at least 8 characters, etc).
Once an acceptable password has been entered, the installer will proceed with installing the DQS Client and creating a set of databases (I only had one 2012 instance available, so the databases were installed there).
Three databases will be installed; DQS_Main, DQS_Projects and DQS_Staging_Data.
Going back to the Start Menu, start the ‘Data Quality Client’. You’ll be prompted for a server name (supply the instance name where the DQS databases were installed).
Knowledge Base
The first step is to create a Knowledge Base (KB). For the Domain that you specify, the Knowledge Base is a list of valid values for that Domain. You can also specify invalid values, or values that map to another value. For example, if you are putting together a list of acceptable football teams to root for, ‘Falcons’ would be a valid entry, ‘Saints’ would be an invalid entry to be rejected, and ‘Dirty Birds’ would be corrected to ‘Falcons’.
Ways to create a Knowledge Base:
New Knowledge Base:
Domain Management – Create a list of values from manual entry.
Knowledge Discovery – Use existing records (in Excel or SQL Server) to create the Knowledge Base.
Matching Policy – Similar to Knowledge Discovery, with the creation of rules on matching duplicate records.
Domain Management:
Here you can import data to create a domain, or create your own. Click ‘Create a Domain’ to create a new domain.
You’ll name the domain, with an optional description, and select a data type for the entries (Date, Decimal, Integer or String). Save the Domain.
You’ll then see your Domain Properties table with the properties you just selected. The other tabs are:
‘Reference Data’ – To download data from an external source, like the Azure Datamarket
‘Domain Rules’ – Here you can specify rules such as min/max lengths for values, patterns that the value must or must not match, etc.
‘Domain Values’ – Here you can specify the valid values for your Domain, invalid values, or values that can be corrected to the correct value.
‘Term-Based Relations’ – Here you can specify a correction to part of a Domain Value, such as correcting ‘Main St’ to ‘Main Street’
Once you click ‘Finish’, you’ll be prompted to publish the knowledge base.
On the dashboard, there is an option to ‘Open Knowledge Base’. You can select ‘DQS Data’ – ‘Samples’ to see some example Domains, like ‘Country/Region’, ‘US – State’, etc.
New Data Quality Project
The second part of DQS is to use the Knowledge Base to clean data.
Create a name for the project and specify a KB and a Domain from the KB to use.
You can specify ‘Cleansing’ or ‘Matching’ (Available if the KN has matching policies set up) for the activity.
Then you specify the data to clean (database table or Excel), mapping domains to database columns.
After running the cleansing, you can review the results. You can see any corrected values, or any new values. You can then Approve or Reject the decisions the process made. You can then export the results to a database table, a CSV file or an Excel file. In the export, you can have just the clean data exported, or the data along with the decisions made during hg the process
Datamarket
Microsoft also hosts a Datamarket where third partied can make their knowledge bases available to others. A few are free, but most are pay services.
Datamarket