NeoMark: how to predict oral cancer recurrence through multiscale data...

NeoMark: how to predict oral cancer recurrence through multiscale data analysis
Marco Picone, Sebastian Steger, Konstantinos Exarchos, Marco De Fazio, Gianfranco Chiari, Diego Ardigò,
Elena Martinelli
Early prediction of cancer reoccurrence constitutes a challenge for oncologists and surgeons. In the EU-Project
NeoMark, scientists from different medical and biology research fields joined efforts with Information
Technology experts to identify methods and algorithms able to early predict the reoccurrence risk for OSCC.
The main challenge is to design and develop algorithms able to identify a “signature'” or bio-profile of the
disease, by integrating multiscale and multivariate data from medical images, genomic profile from tissue and
circulating cells RNA and other medical parameters collected from patients before and after treatment. A limited
number of relevant biomarkers will be identified and used in a real-time PCR device, for early detection of
disease reoccurrence.
The idea behind NeoMark is that, by analyzing a sufficient set of different types of data (clinical, biomedical,
genomic, histological, from digital imaging, from surgery evidence, etc.) of patients affected by OSCC before
treatment and at the time of remission, a set of relevant biomarkers appearing only in presence of the disease
might be identified. The recurrence of the same biomarker phenotype during post-remission follow-up may
precede the clinical manifestation of the relapse thus allowing earlier intervention.
Figure 1 - NeoMark System overview
The versatile user requirements and especially the integration of heterogeneous input data required a careful
design of the NeoMark system. Our goal was to integrate as much functionality as possible in a single unified
service oriented system, achieving great flexibility and usability. These properties increase the user acceptance
and may decrease human error. The basic scheme of the implemented system can be seen in Fig.1. The proposed
architecture is a service-oriented architecture (SOA) able to support the use of Web services to ensure
interoperability between different systems. There are some individual applications that work as modules in the
system in order to provide a single starting point to meet the needs of the users that can add new information,
review and edit available data and make analyses with the stored information. The main module of this
architecture is the data repository located on the NeoMark Server. For the interaction with this central
component there are some different tools for Data Entry, Genomic Analyses, Imaging processing, Data Mining
and Security. Some of those tools have a web-based access point and the others for some computational
constraints are located on the client’s machine, but always with an interaction with the central unit. The
NeoMark System is scalable because we can add in easily new Hospitals or Centers that after a small
initialization procedure (Sensitive Data database and standalone application) can immediately start with the data
storage and with the data analyses. The central repository is called Integrated Health Record Repository (IHRR)
and its purpose is to store many types of heterogeneous data coming from the different modules and layers of the
system. Generally all systems handling patient data need to provide a concept of handling sensitive data in order
to protect the patient’s right of privacy and to prevent data abuse. All data that allows identifying the identity of
a patient just by itself or in combination with other data is considered as sensitive. To manage this kind of issue
the NeoMark Architecture has a central DB for the clinical data (IHRR) and different local databases to store the
sensitive data according to each hospital constraints and restrictions. These local databases are located in each
specific hospital's network in order to be accessible only by local doctors that are authorized to see patient's
personal information and that are connected from the same hospital's network. The interaction between sensitive
information and the NeoMark repository in managed by a specific standalone application that hide this kind of
approach to final users and allows to create, edit and manage sensitive information with clinical data stored in
the centralized repository.
Figure 2 - Sensitive Data Management Architecture
Most of the user interaction is done via the web interface. The physician can manage patients, enter clinical data,
view all features and the NeoMark results. The clinician can upload genomic data and researchers can view
anonymous statistics, which could serve as a base for future research on oral cancer. However there are three
exceptions to this architecture:
The NeoMark Image Processing Tool. This standalone Win64 application is installed on the
radiologist's workstation. It is used to semi automatically extract relevant features from medical images.
Due to the huge amount of imaging data and the computational complexity of the sophisticated image
processing and analysis algorithms, it was not feasible to integrate this functionality in the rest of the
system. However the tool is connected to the NeoMark system via a network connection. The task of
the feature extraction module is to extract from that huge amount of data meaningful numeric features
from tumors and suspicious lymph nodes that appear to be important for reoccurrence prediction.
Whether or not a feature is really important will eventually be determined by correlation analysis of
each feature with the NeoMark result for a given training set. All images are acquired before treatment
and then every 6 month during follow up. The high resolution (1mm slice thickness) CT images cover
the entire head and neck region, whereas the MR images only cover the tumors and significant lymph
nodes. Before they can be loaded by the tool, the need to be anonymized to be compliant with privacy
regulations regarding the handling of sensitive data.
Figure 3- Image Processing Tool
The PCR Chip Upload Tool. This tool downloads genomic features from a PCR chip reader device and
submits them to the NeoMark system. Due to the direct access to external hardware, this tool could not
be integrated, but rather is a standalone application which is installed on the clinician's workstation. The
qRT-PCR platform is under development in STMicroelectronics, in order to obtain quantitative
information about the PCR amplification of the targeted genes. It is a portable, real-time, integrated
analytical system based on qRT-PCR performed in an array of silicon micro chambers. The small size
of the components, as well as its low power requirements make this system an ideal candidate for
further miniaturization into a hand-held, point-of-care device. The qRT-PCR lab-on-chip is disposable
and relatively inexpensive in order to make this method of analysis economically viable. The excellent
thermal conductivity of silicon makes it ideal in applications requiring rapid cycles of heating and
Figure 4-PCR Tool
The Genomic Data Cleaning and Filtering is used to analyze information taken from gene expression
data coming from Feature Extraction (FE) files. The analyses in based on Control and Duplicate
Features, Filtering of Genes based on low data quality and Filtering of Genes with high number of
missing values taken from. The relevant information that are stored in the database are Feature Name,
Probe Name Gene Name, Systematic Name, Description and Log2-ratio. Application generates as
output a cleaned file with a small dimension that contains only these relevant information and that can
be. uploaded from a specific page of NeoMark WebApplication into the database.
The analysis of the heterogeneous data constitutes the cornerstone of the NeoMark artificial intelligence
component. The aim of this component is twofold: i) to assess the risk of reoccurrence in the very early stages of
treatment, i.e. as soon as the patient reaches remission, and ii) to efficiently and effectively model the disease
evolution during the whole follow-up period based on a multitude of heterogeneous data, thus monitoring the
patient’s therapeutic progression. As described in the clin ical scenario of the NeoMark project, for each patient
that has been diagnosed with oral cancer a wide range of heterogeneous data is collected and analyzed. Specif
ically, due to the complex nature of the disease, a holistic approach is performed which integrates a great
multitude of clinical, imaging and genomic data in order to “frame” every possible aspect related to the onset
and progression of oral cancer. In the present study we employ DBNs in order to early identify potential relapses
of the disease, during the follow-up. As described in the clinical scenario, a snapshot of the patient’s medical
condition is acquired during every predefined follow-up with the doctor. By exploiting the information of history
snapshots we aim to model the progression of the disease in the future. The proposed prognostic model is based
on DBNs, which are temporal extensions of Bayesian Networks (BNs.).
We have presented a novel ICT enabled cancer reoccurrence prediction method and have described the system
implementing this idea. In addition to the great innovation of collecting and jointly interpreting such an
enormous amount of heterogeneous data, the development of the NeoMark system led to further innovations:
• The data analysis component not only predicts the probability of a relapse over- all, but also the
probability at a given time. All predictions are updated upon retrieval of follow up input data.
For the first time genomic data obtained from a PCR chip will eventually replace the expensive and
complex laboratory based genomic data extraction.
The innovative semi-automatic multimodal image feature extraction alghorithms extract imaging
features of tumors and lymph nodes that are well suited for further processing by the data analysis
component due to their numeric manner and robustness.