How to Design a Clinical Data Warehouse

How to Design a Clinical
Data Warehouse
by Philip Puls and Dr. Niels Buch Leander
The implementation of a Clinical Data Warehouse
(CDW) is, first and foremost, a drive towards
standardisation in order for a company to reap the
following benefits:
•Better use of internal resources
•Reduction in critical time path for statistical
•Standard exchange of data with CROs, partners
and regulatory agencies
•Cross-trial analysis and leveraged use of
historic data
•Globalisation and knowledge sharing
•Compliance with regulations
The implementation of a CDW is, however,
complex and critical as it may threaten to restrain
an organisation that is already struggling to get
its products to market as quickly as possible.
Therefore, this article describes how to design
a CDW that can facilitate its implementation
significantly by thinking through the entire
standardisation process, right from the start.
Getting the Building Blocks of a CDW right
The ‘foundation’ clinical data warehouse consists
of data load programmes for most common data
sources, e.g. CDMS, EDC, SDTM/ODM, IVRS, Safety
and CTMS, company specific code-lists as well as
transformation and enrichment of the data into a
single standardised data model. The data load
programmes must include a load of the study
metadata so that, at a later stage, it will be possible
to utilise features such as code-lists, dictionary
versions and trial designs, including trial arms
and visit schedules. The CDW also includes a
number of ‘data marts’, special collections of
data organised for a specific purpose, such as
a SDTM–data mart that can be exported or
reported and a signal detection data mart.
Clinical Data Warehouse
Analysis platform
CDW operations applications
Std program
Data set
Study metadata management
Audit trail
Global Access
Source metadata management
Administration and data transfer
Data Repository
Clinical Data
Clinical Applications
Trial Metadata
Clinical data
Clinical data
Trial Metadata
Figure 1: Elements of the in-process Clinical Data Warehouse
Data sources to CDW
The most obvious way to design the underlying
Clinical Data Repository (CDR) is to look to the
Janus Model which is a normalised data model that
allows for cross-trial analysis and the creation of
SDTM and ADAMs data sets. This is an industrywide data model that comes with a full SDTM
mapping description. The benefit of using an
enhanced Janus data model is that it provides
maximum automation capabilities, an ‘FDA-view’
of the data and exploratory analysis possibilities
across studies, projects and compounds.
Furthermore, the data model is easily adapted
if company-specific needs are not covered, for
example, in relation to study metadata.
However, since the CDW enables in-process data
review, the clinical data repository data model
cannot be a mere ‘copy’ of the Janus data model.
Instead, it must be enhanced to include the
necessary traceability back to the source as well
as a data quality status that allows users and
programmes to filter on ‘approved’ data. The
conversion of the study-data stream to the
normalised CDR does introduce some latency
which must be taken into account, especially if,
for example, titration and safety are reviewed
on the basis of data availability in the CDW.
The ‘foundation’ CDW includes a set of programmes that read SAS transport files (xpt) in
SDTM standard and Define.xml files and load
the data into the Clinical Data Repository (CDR).
The programmes should handle the current
productive version of SDTM (version 3.1.2) and be
ready for future versions. The programmes should
use define.xml and the company-approved codelist/value level metadata to verify that the data set
can be loaded correctly and comply with approved
company standards. This functionality is similar to
what is found in the tool that the FDA uses when
checking and loading applicant data files.
The SDTM load programmes handle new or
proprietary domains by storing all events in
one EVENTS table. All findings are stored in one
FINDINGS table, and all interventions are stored
in one INTERV table. The load programme stores
all additional supplemental qualifiers in a single
SUPPQUAL table. The define.xml metadata
are also loaded into the data model in order to
correctly store historic values such as dictionary
versions as well as trial design definitions and to
specify the link between code variables and
codes. Finally, the ‘foundation’ CDW should
include at least a SDTM-data mart that, in a
standardised format, makes data available to
all users and connecting systems, such as a
business intelligence tool.
The Value of Dealing with Metadata in the CDW
These outlined features of the CDW are often not
enough to satisfy the average user or justify the
investment to the company as they only maintain
current standards and do very little in terms of
automation. Furthermore, the ‘foundation’ CDW
system only stores the trial metadata as it looked
at the time of its collection. It will be necessary
to have different versions available online for
reporting or submission, for example, and perhaps
for exploratory purposes. Without the ability to
dynamically shift clinical and study metadata
during the trial life cycle, there is an enormous risk
that the data will grow stale, thereby reducing the
data warehouse to a storage facility with reduced
value to the users and to the company.
Therefore, it is vital to expand the functionality
of the ‘foundation’ CDW to include a metadata
repository that can organise metadata for clinical
study reporting in order to facilitate creation of
standard programme libraries and study design
and to drive data source mapping.
The metadata facilitate re-usability of programming code, integration of data into standardised
data structure, optimisation of data preparation
and reporting and frontloading. As such, metadata
play a pivotal role in the drive towards
In CDW, there are two types of metadata: clinical
metadata and operational metadata. The former
are defined as all data related to the subject in the
trial and are thereby independent of the trial; the
latter are defined as data that describe the trial
and are therefore specific to a single trial. For
design purposes, it is important to keep in mind
that all metadata are not standard, whereas all
standards are metadata. The consequence of this
asymmetry is that besides metadata covered by
SDTM and ADAM, it is necessary to include process
metadata for transformation, transport, presentations, QC, study and submission and business
process and control.
The Clinical Data Warehouse Operations
application includes three modules:
In order to maintain a high level of standardisation
and actively pursue frontloading of resources,
maintenance of metadata and the preparation of
metadata for new studies should be done as early
as possible in the trial design process. Preferably,
all new study protocols should be based on the
metadata library, and any changes necessary to
accommodate new trial designs in the metadata
library should be made and approved along with
the internal approval of the study protocol. This
process ensures, first, that all activities that can
be front-loaded will be performed and, second,
that analysis and reporting can be executed
automatically once the study data are loaded.
1. A Metadata Module (MMA)
2. A Source Data Mapping Module (SDM)
3. An Administration Module
To Be Serious about the Drive towards
The first of these modules, the Metadata Module,
maintains the following metadata: clinical
metadata, study metadata, study design, visit
structure, study flow chart, clinical metadata
versions and cross-study metadata.
Metadata governance
Study Metadata library
Data base
The above figure shows where study metadata are
applied during the clinical study process. It also
illustrates that throughout the study life cycle, it
is necessary to establish a metadata governance
process and define responsibilities clearly in order
to preserve the integrity of the clinical data repository. Furthermore, the figure also highlights that
decisions made at the level of ‘Protocol Design’
Figure 2: Process and Data standardisation
impact the downstream task of ‘Statistical Analysis’.
This is why the metadata repository implementation
must be coordinated with a standardisation of
protocol and CRF design. Ideally, the protocol
authoring tool pulls its protocol components from
the Metadata Repository as this will guarantee
consistency between the way data are collected
and the way data are stored and reported.
The Source Data Mapping (SDM) module includes
one mapping design (ETL) for each source from
which the CDW is loading, thereby replacing
several of the ‘foundation’ CDW features described above. How static or dynamic the ETL
needs to be depends on the level of flexibility and
variation in the data source. If data are sourced
from a Clinical Data Management (CDM) system,
there will be variations in how the trials are
defined and structured. In this case, one should
consider making the source data mapping dynamic and letting the SDM handle any necessary
data conversion.
The amount of effort spent on the SDM is related
to the strategy of migrating legacy data. If it has
been decided to migrate some or all historic
Philip Puls is a Senior Project Manager at NNIT
in Zürich, Switzerland. Dr. Niels Buch Leander is
a Business Consultant at NNIT in Copenhagen,
Denmark. They both specialise in designing and
implementing IT solutions for pharmaceutical
clinical study data, the company may, in the worst
case scenario, end up having to design one ETL
for each study.
A smartly designed SDM will, however, reduce this
migration effort. The cost and benefit of the SDM
should therefore also be held up against the cost
of migrating legacy data and the benefit of having
legacy data available in the CDW.
In addition to the two modules described above,
the CDW operational application also includes
an administration module that maintains users,
the security system and a centre for managing
load processes.
In this way, the Metadata Management Application
can become the company’s global repository for
clinical trial handling and reporting.
By thus comprehending the full extent of the
drive towards standardisation and the critical
role of metadata, the CDW can be designed in
such a way that it will support the company’s
business aims with a minimum of disruption
during its implementation.
Please contact Frederico Braga, Key Account Manager [email protected] or on +41 794 395 865
to learn more about our services.
NNIT A/S Lottenborgvej 24 DK-2800 Lyngby tel: +45 4442 4242
NNIT Switzerland Bandliweg 20 CH - 8048 Zurich tel: +41 44 405 9090
NNIT Czech Republic Lazecka 568/53A CZ-77900 Olomouc tel: +420 585 204 821
NNIT China 358 Nanjing Rd. CN-Tianjin 300100 tel: +86 (22) 5885 6666
NNIT Philippines 24/F 88 Corporate Center 141 Valero St. Makati City 1227 tel: +63 2 889 0999