Business Intelligence Integration Joel Da Costa, Takudzwa Mabande, Richard Migwalla

Business Intelligence Integration
Joel Da Costa, Takudzwa Mabande, Richard Migwalla
Antoine Bagula, Joseph Balikuddembe
Project Description
Business Intelligence (BI) is the practice of using computer software to aid data analysis and
decision making in businesses. It represents a set of processes, tools and technologies which
improve productivity, sales and service of an enterprise, and so profitability in general. BI works
primarily by collecting, organizing and analyzing corporate data and then creating useful
knowledge out this analysis (reporting). BI as a whole incorporates a wide spectrum of software
functions including ad-hoc querying, on-line analytical processing (OLAP), dashboards,
scorecards, search, visualization and more.
BI differentiates itself through its interdepartmental focus and general overview which is geared
towards total business performance. The implementation of BI gives knowledge and understanding
to departmental groups which previously may not have had access to or understanding of the data.
Increased analytics and ad hoc reporting allow organisations to better understand trends within
their business and apply a variety of different measures and attributes to understanding these
trends. Once the BI system has been implemented, a company will typically find it has more ideas
for new initiatives, more efficient and precise data collection processes, more effective marketing
techniques and a better understanding of its customers‟ needs and characteristics, and a better
understanding of the state of the market. This improved business agility and efficiency through BI
results in a long term performance gain which can result in significant profit increases.
The BI system itself is typically segmented into several key areas. The first is Business Modelling
in order to create the framework of the system and how the information flows need to be
established. Data warehouses are used as a centralized repository for all the data gathered, and
maintained through the 'Extraction, Transformation and Loading' (ETL) processes. OLAP is a
technique by which the data sourced from the data warehouse is visualized and summarized to
provide a perspective view across multiple dimensions in order to quickly answer multi-dimensional
queries. Essentially, OLAP tells a business what has happened, and Data mining explains why it
happened, and what is likely to happen in the future based on past patterns.
Problem Statement
The project is going to focus on the underlying technologies which enable Business Intelligence
(BI) and their application to two key scenarios. Previously, various technologies have been
developed and implemented from a “one size fits all” approach, but this approach is likely to result
in less effective and accurate analysis. Different areas and analyses require more adaptability
rather than such a singular approach. Our aim then is to evaluate which technologies would be the
most effective for the particular cases. The technologies being evaluated are Bayesian Belief
Networks, Neural Networks and Artificial Immune Systems which will be expanded on later in the
The project will be done in cooperation with Sanlam, who will provide the necessary data to be
The first case is to analyse customer data in order to create profiles of them so that they may be
targeted with the correct marketing techniques. By doing so, it would allow an increase in sales as
well as a decrease in the cost of marketing. Using the data provided, the 3 different technologies
will be applied to try and gain the most accurate customer profiles.
The second case is „Predictive Sales Forecasting‟. Using historical and current data, the same 3
technologies will be applied to try and create an accurate forecast of the future trends. Forecasting
allows better business decisions to be made and mitigation to be taken in order to improve the
likely outcome.
While the cases operate individually, they are all being implemented with the same aim. What we
want to ascertain from the project results is the variance of each approach's results when
measured against the same data and also bench-marked on the known sales figures. This will help
define the strengths and weaknesses of the particular technologies in developing BI functions.
Procedures and Methods
This project is primarily designed as a research venture, with the main objective being the
synthesis of usable research results, as per the Problem statement. The scope does however
extend beyond that, as shall be illustrated through the following breakdown of procedures and
Implementation will occur in the form of a java application. It will make use of 3 different intelligent
systems to analyse historical data provided by Sanlam, to predict the required output.
Rationale for chosen approach
Before considering the actual approach that will be used in addressing the Problem statement, it is
first necessary to mention the reasoning behind the choice of algorithms. As the research
conducted indicated, industry tends to favour the use of these 3 algorithms, particularly exhibiting a
distinct liking to Bayesian Belief networks. Furthermore, as per the initial meeting with the Sanlam
representative, these 3 algorithms are of particular interest to Sanlam. More detail as to industry‟s
use of these algorithms can be seen in the related works section.
Thus, the next step is to elaborate on the chosen approach, i.e. the choice to address the problem
in the form of an application. The following points summarize the motivation behind this:
Application development allows further extensibility: By choosing to develop this project in
the form of an application, there is more room to generalize and adapt the application,
making it useful in other spheres of Business.
Extensibility also allows room for improvement. Thus, developing this application allows
room for continuation, evolution and progress.
Providing Sanlam with a concrete showing of the results obtained, as well as how they were
obtained is also reason for developing this application.
Recreating the results of this research experiment will also be made easier given the
platform of an application.
Development process
Because of the collaborative nature of this project, it is key that the primary stakeholders i.e.
Sanlam, submit a clear description of their requirements and expectations. For this reason, the
project will involve a Sanlam delegate, as well as the project team. Meetings will be held with the
delegate in order to generate a specific set of user requirements from which the solution can be
Once the requirements have been finalised, the next phase will then be implemented. This will
consist of developing the application which will model abstractions of the selected intelligent
systems. The application will have various forms of clientele information, (provided by Sanlam) as
input. This information will include elements such as Incomes, Premiums as well as purchasing
history, to name a few. Based on this input, the application will then use the embedded intelligent
systems to generate output, offering the user the choice as to which algorithm is applied in the
simulation. This functionality will thus allow for comparison of results. The output will be displayed
in a format that is relevant to business users, and a graphical user interface will be implemented as
part of the application. By hiding a significant portion of the underlying technicalities, and displaying
only what is relevant to shareholders and other business analysts the interface will thus achieve its
functionality (more detail on this is provided later).
Ethical, Professional and Legal Issues
Ethical Issues
We will be using Sanlam sales and customer data which is to remain confidential. It may not be
redistributed to any external parties and no personal information may be extracted for use outside
the project. For demonstration purposes, the software may not display personal information that
may lead to the identification of particular individuals. This information may be used in the
generation of results/forecasts but it will be abstracted with the use of IDs for names if necessary.
Legal Issues
All sales and customer data from Sanlam must be kept private within the realms of the project. Any
copies of the database must be deleted once testing is completed and may not be archived outside
of Sanlam. No copies of the database may be created for use outside the project for any purpose.
Related Work
Customer Profiling
Sebastiani et al. used Bayesian Networks to profile customers in order to predict profits. They used
two networks: the first to describe the probability of response from customers, and the second to
model price factors. The results were reasonable, and by understanding the characteristics of
customers, the models thus help to potentially increase profits [1].
Similar work has been done by Elalfi et al. who combined Bayesian networks with genetic
algorithms. An algorithm was used to extract accurate and comprehensible rules from a database
using trained artificial neural networks, which in turn were trained by genetic algorithms to find the
optimal values for the model. These rules were then used to define customer profiles in order to
make for more profitable e-business [2].
Customer life cycles
Baesens et al. introduces a measure of a customer‟s future spending evolution that might improve
relationship marketing decision making. The method suggested predicts whether a customer will
increase or decrease spending from their initial purchase information. It had a 75% classification
accuracy in predicting the customer lifecycle using purchase volume and purchase category [3].
Repeat Purchase Modeling
Baesens et al. focuses on the need for companies such as mail-order companies to identify which
customers are most likely to purchase before they send out costly catalogues. This involves
profiling customers according to several parameters and calculating the probability of repurchase.
A Bayesian Neural Network was used and had a correct classification result of 71% given the data
set used [4].
Modelling Customer Attitudes
Ishigaki et al. use Bayesian networks to model customer attitudes based on questionnaire data.
The model can then be used to gauge customers‟ feelings towards a product, and how they should
be marketed to. The model was fairly successful with a 73.5% success rate on testing [5].
Sales Forecasting
Recently, Chang et al. developed on the idea of sales forecasting by including clustering in the
model. The K-mean technique is used to cluster the data, which is then used with a fuzzy neural
network, which once trained, can generate sales forecasts. The model proved very effective in
providing accurate forecasts, and was more accurate than a series of other models it was tested
against [6].
Anticipated Outcomes
We will create a package that will read in data from the Sanlam database, use different machine
learning techniques to profile customers and compare the accuracy of the different techniques
using actual data.
The software will be composed of:
An interface to the database that will read in relevant data.
The core of the program that will contain three different Intelligent System techniques that a
user can utilize.
The front-end interface that will give the user results of the classification comparing actual
data to inferred information.
The major component will be the implementation of the different techniques. However, it will still be
important to have good interfaces with the database and the user. The user interface will need to
display interpretable information on the performance of each technique, which will entail
aggregating the results in a way that a user will quickly and easily understand. It will need to allow
changes in parameters to allow optimisation for particular data sets.
Expected Impact
We expect to identify the best machine learning technique to use for customer profiling and sales
forecasting for Sanlam in particular. From our initial investigation it seems that Bayesian networks
are very good classifiers (useful in customer profiling) and neural networks are very good
forecasters. The performance of each technique however is highly dependent on the task, data
and results required. This may mean that the performance results in Sanlam‟s case will not
necessarily match the results for other organisations/companies.
Key Success Factors
The results of the simulations will need to be compared to existing data of what the simulations are
trying to predict. The comparisons will be used to rank each technique according to accuracy of its
results. All simulations will be expected to complete within an acceptable time frame (performance
and scalability are out of scope for this project but each implementation will need to run within a
determined acceptable time, thus making performance negligible in determining the best technique
to use).
Project Plan
Risk Management
The risks that follow are to be evaluated based on the following risk Matrix
The following table gives a breakdown of the predicted risks associated with this project, paying
special attention to their impact and probability. It also highlights 2 courses of action: Avoidance
that is an on-going process as well as mitigation should the risk materialize.
Loss of a project team
member. (This would
occur if one or more
members abandoned
the Honours
Programme for any
number of reasons)
Have sufficiently
D. Serious/ Pressure to stay on the
independent deliverable
project as failure to do so
modules for each team
Probability means not graduating.
Delay in Delivery of test
data. (Dependent on
Sanlam for DataExternal factor)
Pressure Sanlam to
provide data as soon as
Create random test data
or use alternative
available data.
Scope creep (Plan too
many tasks, Cannot
complete tasks in time)
E. Marginal/ Project planned in detail
with supervisor and
Probability department approval.
Start with fundamental
features first and leave
other things to the end.
Data loss due to
hardware failure,
(External Factor)
C. Serious/
Frequent backups of all
progress on different
machines or storage
C. Serious/
Review and reassess
Constant reference to the
deadlines; readjusting
project timeline and clear
where necessary- as
communication between
cost-effectively as
project members
Missing project
Misunderstanding User
D. Serious/
(Resultant of
ambiguity in user-team
Constant communication
with Sanlam to maintain
correct direction. Also,
providing Sanlam with
project plan and design in
order to detect flaws.
Roll back to last backup.
Iterations through
development so that
inconsistencies can be
detected early.
Timeline & Gantt Chart
Resources Required
The resources required to complete the project are fairly standard, with the software and
equipment in the Honours Lab sufficing for development. Apart from this though, Joseph
Balikuddembe is necessary as a representative of Sanlam and as co-supervisor for the project.
Furthermore, the data regarding customers and sales that Sanlam will provide is crucial to the
project development.
Necessary Resources:
Sanlam Database Access
Java Development Platform
The following Table illustrates a detailed list of the deliverables necessary for the completion of this
Final Project Proposal
Project Proposal Presentation
Project Web Presence
Project Poster
Project Web Page
Project Report
Project Application
Final copy of Proposal for evaluation.
Presentation to supervisor and class.
Online availability of proposal and project timeline.
Poster representation of Project.
Open Availability of Project Webpage.
A report on the results of the research.
The actual project.
Further detail as to dates can be seen as per the Milestones.
Literature Synthesis
Project Proposal
Project Proposal Presentations
Finalized Project Proposal
Project Web Presence
Background/Theory Chapter
Design Chapter
Database interface setup
Customer Profiling BI Techniques
Sales Forecasting BI Techniques
First Implementation
Final Prototype
Chapters on Implementation and Testing
Outline of complete report
Final Complete Draft of Report
Web Page
Reflection Paper
Project Demonstrations
Final Project Presentations
Mon 3-May
Wed 12-May
Mon 17-May
Mon 31-May
Tue 1-Jun
Fri 4-Jun
Fri 22-Jun
Mon 6-Jul
Tue 20-Jun
Tue 24-Aug
Tue 14-Sep
Fri 17-Sep
Mon 20-Sep
Wed 29-Sep
Mon 4-Oct
Mon 11-Oct
Mon 25-Oct
Thu 04-Nov
Mon 8-Nov
Fri 12-Nov
Wed 03-Nov
Thu 18-Nov
Work Allocation
Joel Da Costa will implement the Bayesian Belief Networks algorithm for the two cases.
Additionally, he will handle the necessary implementation of connecting to, or drawing data from
the Sanlam database.
Takudzwa Mabande will implement the Neural Networks algorithm for the two cases. Additionally,
he will handle the usage of data for the Sales Forecasting case, as well as the output visualization
for both cases.
Richard Migwalla will implement the Artificial Immune Systems algorithm for the two cases.
Additionally, he will handle the usage of data for the Customer Profiling case, as well as the
general GUI implementation.
[1] Sebastiani P., Ramoni M., Crea A. Profiling your Customers using Bayesian Networks. SIGKDD
Explorations 1(2). 91 – 97.
[2] Elfalfi A., Haque R., Elalami M. Extracting rules from trained neural network using GA for
managing E-business. Applied Soft Computing 4. 65-77
[3] Baesens, B., Verstraeten, G., Van Den Poel, D., Egmont-Petersen, M., Van Kenhove, P. And
Vanthienen, J. 2004. Bayesian network classifiers for identifying the slope of the customer lifecycle
of long-life customers. European Journal of Operational Research 156, 508-523.
[4] Baesens, B., Viaene, S., Van Den Poel, D., Vanthienen, J. And Dedene, G. 2002. Bayesian
neural network learning for repeat purchase modelling in direct marketing. European Journal of
Operational Research 138, 191-211.
[5] Ishigaki T., Motomura Y., Dohi M., Kouchi M., Mochimaru M. Knowledge Extraction by
Probabilistic Cognitive Structure Modeling Using a Bayesian Network for Use by a Retail Service.
MEDES October 2009. 141-149
[6] Chang P, Lio C, Fan C. Data clustering and fuzzy neural network for sales forecasting: A case
study in printed circuit board industry. Knowledge Based Systems 22. 344- 355.