I’m visiting Dallas this week to speak at the North Texas SQL Server User Group this Thursday. I’ll be speaking about keys: primary keys, surrogate keys, clustered keys, GUIDs, SEQUENCEs, alternate keys…well, there’s a lot to cover about such a simple topic. The reason I put this presentation together is I see a lot of confusion about these topics. Some of it’s about terminology (“I can’t find anything about alternate keys in SQL Server…what the heck is that, anyway”), some of it is misunderstandings (“what do you mean IDENTITIES aren’t unique! of course they are…they are primary keys!”), some of it is just new (“Why the heck would anyone want to use a SEQUENCE?”).
We’ll be chatting about all these questions and more on Thursday, 17 March at the Microsoft venue in Irving, Texas starting at 6PM.
Attendance is free, but you need to register at http://northtexas.sqlpass.org/ to help organizers plan for the event.
Don’t worry if you don’t know about SQL Server or don’t use it: this presentation will focus on some SQL Server specific features, but the discussion is completely portable to other DBMSs.
So many of us have learned database design approaches from working with one database or data technology. We may have used only one data modeling or development tool. That means our vocabularies around identifiers and keys tend to be product specific. Do you know the difference between a unique index and a unique key? What about the difference between RI, FK and AK? These concepts span data activities and it’s important that your team understand each other and where they, their tools and approaches need to support these features. We’ll look at the generic and proprietary terms for these concepts, as well as where they fit in the database design process. We’ll also look at implementation options in SQL Server and other DBMSs.
Hope to see you there!
One of the most clichéd blogging tricks is to declare something popular as dead. These click bait, desperate posts are popular among click-focused bloggers, but not for me. Yet here I am, writing an “is dead” post. Today, this is about sharing my responses on-going social media posts. They go something like this:
OP: No one loves my data models any more.
Responses: Data modeling is dead. Or…data models aren’t agile. Or…data models died with the waterfalls. Or…only I know how to do data models and all of you are doing it wrong, which is why they just look dead.
I bet I’ve read that sort of conversation at least a hundred times, first on mailing lists, then on forums, now on social media. It has been an ongoing battle for modelers since data models and dirt were discovered…invented…developed.
I think our issues around the love for data modeling, and logical data models specifically, is that we try to make these different types of models be different tasks. They aren’t. In fact, there are many types, many goals, and many points of view about data modeling. So as good modelers, we should first seek to understand what everyone in the discussion means by that term. And what do you know, even this fact is contentious. More on that in another post.
I do logical data modeling when I’m physical modeling. I don’t draw a whole lot of attention to it – it’s just how modeling is done on my projects.
Data Modeling is Dead Discussion
One current example of this discussion is taking place right now over on LinkedIn. Abhilash Gandhi posted:
During one of my project, when I raised some red flags for not having Logical Data Model, I was bombarded with comments – “Why do we need LDM”? “Are you kidding”? “What a waste of time!". The project was Data Warehouse with number of subject areas; possibility of number of data marts.
I have put myself into trouble by trying to enforce best practices for Data Modeling, Data Definitions, Naming Standards, etc. My question, am I asking or trying to do what may be obsolete or not necessary? Appreciate your comments.
There are responses that primarily back up the original poster’s feelings of being unneeded on modern development projects. Then I added another view point:
I’ll play Devil’s advocate here and say that we Data Architects have also lost touch with the primary way the products of our data modeling efforts will be used. There are indeed all kinds of uses, but producing physical models is the next step in most. And we have lost the physical skills to work on the physical side. Because we let this happen, we also have failed to make physical models useful for teams who need them.
We just keep telling the builders how much they should love our logical models, but have failed to make the results of logical modeling useful to them.
I’ve talked about this in many of my presentations, webinars (sorry about the autoplay, it’s a sin, I know) and data modeling blog posts. It’s difficult to keep up with what’s happening in the modern data platform world. So most of us just haven’t. It’s not that we need to be DBAs or developers. We should, though, have a literacy level of the features and approaches to implementing our data models for production use. Why? I addressed that as well. Below is an edited version of my response:
We Don’t All Have to Love Logical Data Modeling
First of all, the majority of IT professionals do not need to love an LDM. They don’t even need to need them. The focus of the LDM is the business steward/owner (and if i had my way, the customer, too). But we’ve screwed up how we think of data models as artefacts that are "something done on an IT project". Sure, that’s how almost all funding gets done for modeling, and it’s broken. But it’s also the fact of life for the relatively immature world of data modeling.
We literally beat developers and project managers with our logical data modeling, then ask them “why don’t you want us to produce data models?” We use extortion to get our beautiful logical data models done, then sit back an wonder why everyone sits at another lunch table.
I don’t waste time or resources trying to get devs, DBAs or network admins to love the LDMs. When was the last time you loved the enterprise-wide AD architecture? The network topology? The data centre blueprints and HVAC diagrams?
Data Models form the infrastructure of the data architecture, as do conceptual models and all the models made that would fill the upper rows of the Zachman Framework. We don’t force the HVAC guys to wait to plan out their systems until a single IT application project comes along to fund that work. We do it when we need a full plan for a data centre. Or a network. Or a security framework.
But here we are, trying to whip together an application with no models. So we tell everyone to stop everything while we build an LDM. That’s what’s killing us. Yes, we need to do it. But we don’t have to do it in a complete waterfall method. I tell people I’m doing a data model. then I work on both an LDM and the PDM at the same time. The LDM I use to drive data requirements from business owners, the PDM to start to make it actually work in the target infrastructure. Yes, I LDM more at first, but I’m still doing both at the same time. Yes, the PDM looks an awful lot like the LDM at first.
Stop Yelling at the Clouds
The real risks we take is sounding like old men yelling at the clouds when we insist on working and talking like it is 1980 all over again. I do iterative data modeling. I’m agile. I know it’s more work for me. I’d love to have the luxury of spending six months embedded with the end users coming up with a perfect and lovely logical data model. But that’s not the project I’ve been assigned to. It’s not the team I’m on. To work against the team is a demand that no data modeling be done and that database and data integration be done by non-data professionals. You can stand on your side of the cubicle wall, screaming about how LDMs are more important, or you can work with the data-driving modeling skills you have to make it work.
When I’m modeling, I’m working with the business team drawing out more clarity of their business rules and requirements. I am on #TeamData and #TeamBusiness. When the business sees you representing their interests, often to a hostile third party implementer, they will move mountains for you. This is the secret to getting CDMs, LDMs, and PDMs done on modern development projects. Just do them as part of your toolkit. I would prefer to data model completely separately from everyone else. I don’t see that happening on most projects.
The #TeamData Sweet Spot
My sweet spot is to get to the point where the DBAs, Devs, QA analysts and Project Managers are saying "hey, do you have those database printouts ready to go with DDL we just delivered? And do you have the user ones, as well?" I don’t care what they call them. I just want them to call them. At that point, I know I’m also on #TeamIT.
The key to getting people to at least appreciate logical data models is to just do them as part of whatever modeling effort you are working on. Don’t say “stop”. Just model on. Demonstrate, don’t tell your teams where the business requirements are written down, where they live. Then demonstrate how that leads to beautiful physical models as well.
Logical Data Modeling isn’t dead. But we modelers need to stop treating it like it’s a weapon. Long Live Logical!
My friend Joey D’Antoni ( @jdanton | blog ) and I will be giving a workshop at NoSQLNow! about new database and datastore technologies like Hadoop, Neo4j, Cassandra, Vertica, Document DB, and others. This will be a fast-paced, demo-heavy, practical sessions for data professionals. We’ll talk about where a modern data architecture would best use these technologies and why it’s not an either/or question for relational solutions in a successful enterprise. And, as always, our goal is to make the time we spend fun and interactive. This session will be a great starting point for some other session on Monday that go into data modeling for NoSQL as well as for all the other in-depth, database-specific talks the rest of the week.
Sunday, April 17, 2016
We’ve been busy keeping relational data consistent, high quality, and available. But over the last few years, new database and datastore technologies have come to the enterprise with different data stories. Do we need all our data to be consistent everywhere? What does data quality mean for analytics? Will we need relational database?
Learn how traditional and new database technologies fit in a modern data architecture. We will talk about the underlying concepts and terminology such as CAP, ACID and BASE and how they form the basis of evaluating each of the categories of databases. Learn about graph, Hadoop, relational, key value, document, columnar, and column family databases and how and when they should be considered. We’ll show you demos of each.
Finally, we will wrap up with 7+ tips for working with new hybrid data architectures: tools, techniques and standards.
Use code “DATACHICK” to save:
$100 off for Tutorials Only + Seminar Only Registration and $200 off for Full Event, Conference+Tutorials, Conference +Seminar, and Conference Only Registration.
Super early registration ends 29 January, so take advantage of both discounts now (yes, they stack!).
It’s a new year and I’ve given Thomas LaRock (@@sqlrockstar | blog ) a few months to recover and ramp up his training since our last Throwdown. The trophies from all my wins are really cluttering my office and I feel back that Tom has not yet had a chance to claim victory. So we will battling again in just a few days.
I’ll be dishing out the knowledge along with a handkerchief for Tom to wipe up his tears at SQL Saturday #461 Austin, TX on 30 January 2016. This full day community-driven event features real database professionals giving free presentations on SQL Server and Data Platform topics. All you need to do is register (again, it’s free) before all the tickets are gone.
Database Design Throwdown
Duration: 60 minutes
Track: Application & Database Development
Everyone agrees that great database performance starts with a great database design. Unfortunately, not everyone agrees which design options are best. Data architects and DBAs have debated database design best practices for decades. Systems built to handle current workloads are unable to maintain performance as workloads increase.Attend this new and improved session and join the debate about the pros and cons of database design decisions. This debate includes topics such as logical design, data types, primary keys, indexes, refactoring, code-first generators, and even the cloud. Learn about the contentious issues that most affect your end users and how to avoid them.
One of the other great benefits of attending these events is that you get to network with other data professionals who are working on project just like yours…or ones you will likely work on at some point.
Join us an other data pros to talk about data, databases and projects. And make sure you give a #datahug to Tom after the Throwdown. He’s gonna need it.
I posted a tongue-in-cheek Twitter poll about some database design features that tend to polarize team members.
Of course, every design decision comes down to cost, benefit and risk. So there aren’t any evil choices.
How would you vote?
On 12 April 2011 it was Yuri’s Night — the night we space fans celebrate Yuri Gagarin’s 1961 history-setting flight into space. In 2011 we were celebrating 50 years of manned spaceflight. On that same day in 2011, we reached the end of support for SQL Server 2005 SP4. On 12 April 2016 we will reach the end of extended support for SQL Server 2005. That means no more hotfixes, no help from Microsoft and no love for your data still living in SQL 2005 databases.
I’m hoping your organization is already on its way to upgrading and migrating data and applications to newer versions of SQL Server. SQL Server 2016 is already being used in production by early access customers. No matter which version you will be migrating to, I want to share with you some of the features and perks you’ll have available to you now that you are moving away from a dead version. Of course there are hundreds of enhancements that have happened since 2005, but today I’m focusing on those that a data architect would want to use in a database design for enhanced performance, security and data quality.
If you are designing for data warehouse type solutions, this is the closest thing we have for a "turbo switch" for SQL Server. Columnstore Indexes achieve high compression rates since they store columns together instead of storing rows together. They also support much faster query performance for batch and aggregation queries. They typically achieve 10x performance increases, sometimes even more. This feature was introduced in SQL Server 2012, but you’ll want the advances to this feature that came with SQL Server 2014.
SEQUENCEs have been around in other DBMSs for a while, but were introduced in SQL Server 2012. These special objects work much like IDENTITY columns, but offer more flexibility and use cases. The main feature is that you can grab a sequence (or a bunch of them) before you insert a row. Many developers use GUIDs for similar reasons, but GUIDs are much longer and therefore had performance downsides. SEQUENCEs are integer types.
New Data Types
So many new data types have been introduced since SQL Server 2005, but the ones that really stand out for me are DATE, TIME, DATETIMEOFFSET, the geospatial types, and the deprecation of timestamp.
It wasn’t until SQL Server 2008 that we had access to data types that comprised only the DATE or TIME portion of a point in time. So we had to do all kinds of conversions just to strip out unwanted data (00:00:00). We also had to make room to store that unwanted precision. Storing millions of rows of unneeded zeros hurts performance, both operationally and for backup and recovery.
SQL Server 2008 also introduced DATETIMEOFFSET, which allows us to track data in context of its time zone. If you remember the days when meeting invites did not include this important piece of information, you’ll know why this is important.
The spatial data types GEOGRAPHY and GEOMETRY and have added a new and feature-rich way of tracking places, their geometry plus special features that make it much easier to answer questions like "which is the closest" or "is this address located in this neighbourhood". If your data story includes location points, you’ll want to use these.
SQL Server was always an oddball when it came to the data type TIMESTAMP. In other DBMSs, this data type was one that included date and time, to a very large precision. In SQL Server, TIMESTAMP is a type of row version identifier that has nothing to do with TIME. So data architects migrating from other DBMSs were often bitten when they used the wrong data type. Microsoft announced in 2008 that it was depreciating TIMESTAMP and recommending the use of ROWVERSION, which is similar (but not the same) in functionality.
SQL Server 2016 currently includes support of Always Encrypted, a feature that does just that: it support the encryption of data from application to database and back, so that it is better protected than solutions that encrypt data once it is written to the database. I’m always reminding you that keeping your CIO out of jail is part of your job description, right?
As our data gets bigger and bigger, the size of our databases is growing as well. That means that performance takes a hit. Developers want us to take shortcuts on data quality to improve performance because size matters. One of the ways to help manage data volumes is to move "cold" data to other storage locations. Starting in SQL Server 2016, we can stretch a database to Azure, which means that data that isn’t accessed as often can be stored in the cloud and retrieved when needed. This allows our hot data to be local and fast, while the cooler data is more economical to store and still there and your application doesn’t even have to manage this difference.
In SQL Server 2016 we are getting support for JSON processing. This isn’t the same as a JSON data type like we have with XML, but a set of import and export features for providing relational data as JSON documents and brining JSON data into SQL Server. Now you won’t have to manage all those curly brackets on your own.
One Last Thing…
As vendors withdraw support for their products, third party tool makers do so as well. If you are supporting older, out of support versions of databases, it’s likely that your data modeling, data quality and data integration tools are also dropping support for these solutions. You’ll be left supporting database systems without vendor support and without professional enterprise class modeling and design tools. I know how hard it is to keep track of databases that my tools can’t connect with. Don’t let sticking with an old version be the end of data modeling support for that data.
If you like geeking out about space and data types, you might want to check out my 1 April 2014 post on a new data type.
Show Your Data Some Love
These are just a tiny number of the types of features that will be available to you when you upgrade to modern versions of SQL Server. The advent of security, data quality and performance features are leaving your old solutions behind, putting your data at risk and leaving your customer data feeling unloved. There’s a data space race going on. Don’t live your company using old technology to manage data. Go fix that!
I have a couple of presentations where I describe how generalized data modeling can offer both benefits and unacceptable costs. In my Data Modeling Contentious Issues presentation, the one where we vote via sticky notes, we debate the trade-offs of generalization in a data model and database design. In 5 Classic Data Modeling Mistakes, I talk about over-generalization.
Over the last 20 some years (and there’s more “some” there than ever before), I’ve noticed a trend towards more generalized data models. The means that instead of having a box for almost every noun in our business, we have concepts that have categories. Drawing examples from the ARTS Data Model, instead of having entities for:
- Purchase Order
- Shipping Notice
…we have one entity for InventoryControlDocument that has a DocumentType instance of Purchase order, Shipping Notice, Receipt, Invoice, etc.
See what we did there? We took metadata that was on the diagram as separate boxes and turned them into rows in a table in the database. This is brilliant, in some form, because it means when the business comes up with a new type of document we don’t have to create a new entity and a new table to represent that new concept. We just add a row to the DocumentType table and we’re done. Well, not exactly…we probably still have to update code to process that new type…and maybe add a new user interface for that…and determine what attributes of InventoryControlDocument apply to that document type so that the code can enforce the business rules.
Ah! See what we did there this time? We moved responsibility for managing data integrity from the data architect to the coders. Sometimes that’s great and sometimes, well, it just doesn’t happen.
So my primary reason to raise generalization as an issue is that sometimes data architects apply these patterns but don’t bother to apply the governance of those rules to the resulting systems. Just because you engineered a requirement from a table to a row does not mean it is no longer your responsibility. I’ve even seen architects become so enamoured with moving the work from their plate to another’s that they have generalized the heck out of everything while leaving the data quality responsibility up to someone else. That someone else typically is not measured or compensated for data integrity, either.
Sometimes data architects apply these patterns but don’t bother to apply the governance of those rules to the resulting systems
Alec Sharp has written a few blog posts on Generalizations. These posts have some great examples of his 5 Ways to Go Wrong with Generalisation. I especially like his use of the term literalism since I never seem to get the word specificity out when I’m speaking. I recommend you check out his 5 reasons, since I agree with all of them.
1 – Failure to generalize, a.k.a. literalism
2 – Generalizing too much
3 – Generalizing too soon
4 – Confusing subtypes with roles, states, or other multi-valued characteristics
5 – Applying subtyping to the wrong entity.
By the way, Len Silverston and Paul Agnew talk about levels of generalization in their The Data Model Resource Book, Vol 3: Universal Patterns for Data Modeling book (affiliate link). Generalization isn’t just a yes/no position. Every data model structure you architect has a level of generalization.
Every data model structure you architect has a level of generalization.
I’m wondering how many of you who have used a higher level of generalization and what you’ve done to ensure that the metadata you transformed into data still has integrity?
Leave your recommendations in the comments.
Update: I updated the link to Alec’s blog post. Do head over there to read his points on generalization.
Subscribe via E-mail
- September 2016
- August 2016
- June 2016
- May 2016
- April 2016
- March 2016
- February 2016
- January 2016
- December 2015
- November 2015
- September 2015
- July 2015
- June 2015
- May 2015
- April 2015
- March 2015
- February 2015
- January 2015
- December 2014
- November 2014
- October 2014
- August 2014
- July 2014
- June 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- December 2013
- November 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- April 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011
- January 2011
- December 2010
- November 2010
- September 2010
- August 2010
- July 2010
- February 2009