Data classification (business intelligence)


Data classification (business intelligence)

In business intelligence, data classification has close ties to data clustering, but where data clustering is descriptive, data classification is predictive.[1][2] In essence data classification consists of using variables with known values to predict the unknown or future values of other variables. It can be used in e.g. direct marketing, insurance fraud detection or medical diagnosis.[2]

The first step in doing a data classification is to cluster the data set used for category training, to create the wanted number of categories. An algorithm, called the classifier, is then used on the categories, creating a descriptive model for each. These models can then be used to categorize new items in the created classification system.[1]

According to Golfarelli and Rizzi, these are the measures of effectiveness of the classifier:[1]

  • Predictive accuracy: How well does it predict the categories for new observations?
  • Speed: What is the computational cost of using the classifier?
  • Robustness: How well do the models created perform if data quality is low?
  • Scalability: Does the classifier function efficiently with large amounts of data?
  • Interpretability: Are the results understandable to users?

Typical examples of input for data classification could be variables such as demographics, lifestyle information, or economical behaviour.

Challenges for data classification

There are several challenges in working with data classification. One in particular is that it is necessary for all using categories on e.g. customers or clients, to do the modeling in an iterative process. This is to make sure that change in the characteristics of customer groups does not go unnoticed, making the existing categories outdated and obsolete, without anyone noticing.

This could be of special importance to insurance or banking companies, where fraud detection is extremely relevant. New fraud patterns may come unnoticed, if the methods to surveil these changes and alert when categories are changing, disappearing or new ones emerge, are not developed and implemented.

References

  1. ^ a b c Golfarelli, M. & Rizzi, S. (2009). Data Warehouse Design : Modern Principles and Methodologies. McGraw-Hill Osburn. ISBN 0071610391
  2. ^ a b Kimball, R. et al. (2008). The Data Warehouse Lifecycle Toolkit. (2. Ed.). Wiley. ISBN 0471255475

Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Data classification — may refer to: Data classification (data management) Data classification (business intelligence) Classification (machine learning), classification of data using machine learning algorithms Assigning a level of sensitivity to classified information …   Wikipedia

  • Data classification (data management) — In the field of data management, data classification as a part of Information Lifecycle Management (ILM) process can be defined as tool for categorization of data to enable/help organization to effectively answer following questions: What data… …   Wikipedia

  • Data profiling — is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to: Find out whether existing data can easily… …   Wikipedia

  • Data stream mining — is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using… …   Wikipedia

  • Data mining — Not to be confused with analytics, information extraction, or data analysis. Data mining (the analysis step of the knowledge discovery in databases process,[1] or KDD), a relatively young and interdisciplinary field of computer science[2][3] is… …   Wikipedia

  • Data Analysis Techniques for Fraud Detection — Fraud is a million dollar business and it is increasing every year. The PwC global economic crime survey of 2009 suggests that close to 30% of companies worldwide reported fallen victim to fraud in the past year[1] Fraud involves one or more… …   Wikipedia

  • Data analysis — Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches,… …   Wikipedia

  • Data, context and interaction — (DCI) is a paradigm used in computer software to program systems of communicating objects. Its goals are: To improve the readability of object oriented code by giving system behavior first class status; To cleanly separate code for rapidly… …   Wikipedia

  • Intelligence analysis management — This article deals with the roles of processing/analysis in the real world intelligence cycle as a part of intelligence cycle management. See Intelligence analysis for a discussion of the techniques of analysis. For a hierarchical list of… …   Wikipedia

  • Business and Industry Review — ▪ 1999 Introduction Overview        Annual Average Rates of Growth of Manufacturing Output, 1980 97, Table Pattern of Output, 1994 97, Table Index Numbers of Production, Employment, and Productivity in Manufacturing Industries, Table (For Annual… …   Universalium