Building a Scalable
Data Warehouse with
Data Vault 2.0
Daniel Linstedt
Michael Olschimke
AMSTERDAM • BOSTON • HEIDELBERG • LONDON
NEW YORK • OXFORD • PARIS • SAN DIEGO
SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
Morgan Kaufmann is an Imprint of Elsevier
Publisher: Todd Green
Editorial Project Manager: Amy Invernizzi
Project Manager: Paul Prasad Chandramohan
Designer: Matthew Limbert
Morgan Kaufmann is an imprint of Elsevier
225 Wyman Street, Waltham, MA 02451, USA
Copyright © 2016 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical,
including photocopying, recording, or any information storage and retrieval system, without permission in writing
from the publisher. Details on how to seek permission, further information about the Publisher’s permissions poli-
cies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing
Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other
than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our
understanding, changes in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any
information, methods, compounds, or experiments described herein. In using such information or methods they
should be mindful of their own safety and the safety of others, including parties for whom they have a professional
responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability
for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or
from any use or operation of any methods, products, instructions, or ideas contained in the material herein.
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
Library of Congress Cataloging-in-Publication Data
A catalog record for this book is available from the Library of Congress
ISBN: 978-0-12-802510-9
For information on all Morgan Kaufmann publications
visit our website at www.mkp.com/
Authors Biography
DANIEL LINSTEDT
Daniel has more than 25 years of experience in the Data Warehousing and Business Intelligence field
and is internationally known for inventing the Data Vault 1.0 model and the Data Vault 2.0 System of
Business Intelligence. He helps business and government organizations around the world to achieve
BI excellence by applying his proven knowledge in Big Data, unstructured information management,
agile methodologies and product development. He has held training classes and presented at TDWI,
Teradata Partners, DAMA, Informatica, Oracle user groups and Data Modeling Zone conference. He
has a background in SEI/CMMI Level 5, and has contributed architecture efforts to petabyte scale data
warehouses and offers high quality on-line training and consulting services for Data Vault.
MICHAEL OLSCHIMKE
Michael has more than 15 years of experience in IT and has been working on business intelligence
topics for the past eight years. He has consulted for a number of clients in the automotive indus-
try, insurance industry and nonprofits. In addition, he has consulted for government organizations in
Germany on business intelligence topics. Michael is responsible for the Data Vault training program
at Dörffler + Partner GmbH, a German consulting firm specialized in data warehousing and business
intelligence. He is also a lecturer at the University of Applied Sciences and Arts in Hannover, Germany.
In addition, he maintains DataVault.guru, a community site on Data Vault topics.
xiii
Foreword
I met Daniel Linstedt during a speech at Lockheed Martin in the early 1990’s for the first time. By
the time, he was an employee of the company, working for government projects. He approached me
because he wanted my opinion about a concept that he had invented at the Department of Defense, in
order to store large amounts of data. Back then, the term Big Data was not invented yet. But from what
Daniel explained to me, the concept to deal with such huge amounts of data, was born.
Because back then, the end user had cried for “give me my data!”. But over time the end user be-
came more sophisticated. The end user learned that it was not enough to get one’s data. What a person
needed was the RIGHT data. And then the sophisticated end user cried for “give me my accurate and
correct data!”
The data warehouse represented the architectural solution to the issue of needing a single version of
the truth. The primary reason for the existence of the data warehouse was the corporate need for integ-
rity and believability of data. As such the data warehouse became the major architectural evolutionary
leap beyond the early application systems.
But the data warehouse was not the end of architecture. Indeed, the data warehouse was only one
stepping stone – architecturally speaking – in the progression of the evolution of architecture. It was
Daniel’s idea that followed the data warehouse. In many ways the data warehouse set the stage for him.
Daniel used the term common foundational modeling architecture to describe a model based on
three simple entities, focusing on business keys, their relationships and descriptive information for
both. By doing so, the model closely followed the way business was using the data in the source sys-
tems. It allowed to source all kinds of data, regardless its structure, in a fully auditable manner. This
was a core requirement of government agencies at the time. And due to Enron and a host of other
corporate failures, Basel, and SOX compliance auditability was pushed to the forefront of the industry.
Not only that, the model was able to evolve on changing data structures. It was also easy to extend
by adding more and more source systems. Daniel later called it the “Data Vault Model” and it was
groundbreaking.
The data vault became the next architectural extension of the data warehouse. But the data vault con-
cept – like all evolutions – continued to evolve. He asked me what to do about it and, as a professional
author, I gave him the advice to “publish the heck out of it.” But Daniel decided to take it to the long run.
Over multiple years, he improved the Data Vault and evolved it into Data Vault 2.0. Today, this System of
Business Intelligence includes not only a more sophisticated model, but an agile methodology, a refer-
ence architecture for enterprise data warehouse systems, and best practices for implementation.
The Data Vault 2.0 System of Business Intelligence is ground-breaking, again. It incorporates con-
cepts from massively parallel architectures, Big Data, real-time and unstructured data. And after all the
time, I’m glad that he followed my advice and has started to publish more on the topic.
This book represents that latest, most current step in the larger evolution of the Data Vault that has
been occurring. This book had been carefully and thoughtfully prepared by leaders in the thought and
implementation of the Data Vault.
Bill Inmon
June 29, 2015
xv
Preface
When I was asked by the Department of Defense to build a scalable data warehouse, I was confronted
with a problem. Back then, before the term Big Data was invented, there was no approach for building
such systems – systems that could accommodate large data sets, delivered at high frequencies, and in
multiple structures.
I started intensive research to come up with a viable solution for this challenge. The analysis was
based on patterns from nature, because I expected that a partial solution would already exist some-
where. Over more than 10 years, from 1990 to early 2000, I tested the applicability of these natural pat-
terns in data warehousing. By doing so, I reduced the initial list of 50 potential entities down to three.
These remaining entity types were based on a hub-and-spoke architecture that scaled well and was
easy to extend. This model is known today as Data Vault modeling. The three entities are: hubs, which
provide a unique list of business keys from business processes; links, which integrate the business keys
within and over source system boundaries; and satellites, which provide descriptive data.
This model enabled my clients to build the most sophisticated systems and complete their assigned
tasks. When I left the government context, the system was storing and processing more than 15 pet-
abytes of data and is still growing today.
However, over the years, Data Vault modeling evolved. It became one of the pillars of the Data Vault
2.0 Standard. The Data Vault 2.0 Architecture and the Data Vault 2.0 Methodology are the other pillars,
in conjunction with the Data Vault 2.0 Implementation best practices. Without these other pillars, a
Data Vault 2.0 model is just a model. The pillars together provide a set of best practices, standards, and
techniques that organizations rely on to build scalable data warehouse systems by using agile practices.
Data Vault 2.0 enables data warehouse teams around the world to exploit the Data Vault as a system
of business intelligence. This is what I teach: how to take advantage of the Data Vault 2.0 Standard, in
rapid, small steps; and it is what this book is all about.
Daniel Linstedt
Inventor of Data Vault modeling and the Data Vault 2.0
System of Business Intelligence
St. Albans, Vermont, USA
This book is the result of my own evolution regarding the Data Vault. When I heard of the concept for
the very first time in 2011 or 2012 from Oliver Cramer, I remained very skeptical. This was due to the
fact that, at that time, Data Vault was seen primarily as a model. It was different, but the model by itself
was not enough for me to become convinced of the value of it.
But Christian Haedrich, CEO of Dörffler, wanted to find out what’s behind Data Vault and decided
to go for a training in 2013 with the inventor, Daniel Linstedt, in Vermont. To be honest, my first
thought was: “what a waste of time.” I was not very happy to board a plane for six or more hours, head
over to Vermont, sit in a training class for four days, and spend another six hours on the return trip.
And because I hate to waste time, I decided to take advantage of it. My goal became not to waste
my time during the flight or in Vermont. Instead, I wanted to seriously understand what the Data Vault
xvii
xviii
Preface
is, but certainly not to use it in business. Instead, I wanted to rule it out with confidence. That’s not a
lot of value, honestly, but at least you lose the uncertainty that you might miss some great technology
because you don’t understand it.
That was the plan, and I failed miserably at it. In fact, Daniel convinced me that the Data Vault
was the technology you don’t want to miss if you’re building data warehouse solutions. Most people
in the industry are unaware that he had further developed the concept and integrated best practices for
implementation and methodology, as well as a reference architecture. These were the pieces that were
missing for me. This now explained to me why the model is as it is, along with all the background in-
formation that described why some designs are fundamentally different in Data Vault.
Since then, I have asked Daniel many questions, because I wanted to fully understand the Data
Vault, the concepts behind it and what his intentions are behind his design decisions. Our discussions
back then started a work relationship and learning experience that I have truly enjoyed. This book is
the outcome of this time spent.
I might have failed when I tried to rule out Data Vault as a viable solution for business intelligence
projects. But I always try to make mistakes only once in life. I’m glad that I changed my mind. Since
that time, the Data Vault has become part of daily work and success in the industry.
My personal wish is that this book becomes part of your success, too.
The file name of the source code file is provided in the companion site, please refer the site for more
details: http://booksite.elsevier.com/9780128025109
Michael Olschimke
Hannover, Germany
Acknowledgments
DANIEL LINSTEDT
I would like to acknowledge my wife and family for granting me the support and love I needed to finish
this book. I would also like to acknowledge my co-author Michael Olschimke for working extremely
hard at trying to understand my writing, and spending countless hours on Skype calls with me in order
to discuss my ideas. Furthermore, I would like to personally thank Scott Ambler for all his contribu-
tions over time (especially to my last book); many of these ideas have made it into the foundations of
Disciplined Agile Delivery embedded in the Data Vault 2.0 methodology. I am also pleased to thank
Bill Inmon (the father of the data warehouse) for not only writing the foreword but also creating the
industry I earn a living in. Without the “Data Warehouse” I would not have been able to create the Data
Vault 2.0 System of Business Intelligence.
In addition, I would like to thank Roelant Vos for kick-starting the Australian Data Vault market,
as well as my partners: Doerffler & Partner, and Analytics8, who assist me with training in the Data
Vault 2.0 space. I also would like to thank AnalytixDS, for their brilliant work on Automation of
Data Vault 2.0 templates through their incredible product, Mapping Manager. Without their assis-
tance, we could not generate much of the work that goes into Data Vault 2.0 systems worldwide.
In addition, there are some customers I would like to thank for trying out the Data Vault 2.0 ideas
as I refined them over the past several years. This includes Commonwealth Bank in Australia, QSuper
in Australia, Intact Financial in Canada, and Microsoft – not only for creating the wonderful technol-
ogy we have applied in this book, but also for utilizing Data Vault Modeling in-house for their own
solutions.
MICHAEL OLSCHIMKE
My acknowledgements go to Dörffler + Partners who have financed my contributions to this book
and gave me a safe harbor to be able to focus on writing. This certainly includes the management
team around Werner Dörffler, Christian Hädrich and Siegfried Heger, but also the current and former
employees of the firm, especially Timo Cirkel, Dominik Kroner, and Jens Lehmann. I would also like
to thank our customers, especially the team of Gabriela Goldner at SwissLife and the team of Marcus
Jacob at DEVK for giving me some valuable opportunities and feedback.
Furthermore, I’d like to thank all those who have helped me become what I am today. This includes
my parents Barbara and Paul Olschimke, for obvious reasons; Udo Bornschier who encouraged me to
take an academic career; Prof. Cornelius Wille (Bingen) who promoted my scientific interest and en-
couraged me to continue my academic career; Dr. Betty Robbins (OU) who teached me how to write,
with the help of large amounts of red ink, which I deserved; Dr. Albert Schwarzkopf (OU) who helped
me to discover my interest for data warehousing; Udo Apel who supervised my bachelor’s thesis at
Borland and gave me some valuable advice when I started my graduate studies at Santa Clara Univer-
sity; Prof. Manoochehr Ghiassi (SCU) who teached me how to organize a research team, among other
valuable things (such as data mining and the value of taking notes); Oliver Cramer who discovered the
Data Vault for me; and Daniel Linstedt for explaining it to me. The faculty at Santa Clara University
xix
xx
Acknowledgments
deserves credit for helping me to understand the value of the Data Vault and see the glory in the service
to others.
But the most life-changing person, and the one who enabled me to make my contribution to this
book, is Christina Woitzik, my partner for the last ten years. We strayed through darkness and went all
the way through hell. But in the early light of dawn, our love is still there.
By the time this book is published, she should be my lovely wife.