BIG DATA TECHNOLOGIES
There are a growing number of technologies used to aggregate, manipulate, manage and analyze big data. We have detailed some of the more prominent technologies but this list is not exhaustive, especially as more technologies continue to be developed to support big data techniques.
Big Table: Proprietary distributed database system build on the Google File System. Inspiration from HBase.
Business intelligence (BI): A type of application software designed to report, analyze and present data. BI tools are often used to read data that have been stored in a data warehouse or data mart. BI tools can also be used to create standard reports that are generated on a periodic basis, or to display information on real-time management dashboards, i.e. integrated displays of metrics that measure the performance of a system.
Cassandra: An open source (free) database management system designed to handle huge amounts of data on a distributed system. This system was originally developed on Facebook and is now managed as a project of the Apache Software foundation.
Cloud computing: A computing paradigm in which highly scalable computing resources, often configured as a distributed system, are provided as a service through a network.
Data mart: Subset of a data warehouse, used to provide data to users usually through business intelligence tools.
Data warehouse: Specialized database optimized for reporting, often used for storing large amounts of structured data. Data is uploaded using ETL (extract, transform and load) tools from operational data stores, and reports are often generated using business intelligence tools.
Distributed system: Multiple computers, communicating through a network, used to solve a common computational problem. The problem is divided into multiple tasks, each of which is solved by one or more computers working in parallel. Benefits of a distributed system include higher performance at a lower cost (i.e. because a culture of lower-end computers can be less expensive than a single higher-end computer), higher reliability (i.e. because of a lack of a single point of failure) and more scalable (i.e. because increasing the power of a distributed system can be accomplished by simply adding more nodes rather than completely replacing a central computer)
Dynamo: Proprietary distributed data storage system developed by Amazon.
Extract, transform, and load (ETL): Software tools used to extract data from outside sources, transform them to fit operational needs, and load them into a database or data warehouse.
Google File System: Proprietary distributed file system developed by Google; part of the inspiration for Hadoop.
Hadoop: An open source (free) software framework for processing huge datasets on certain kinds of problems on a distributed system. Its development was inspired by Google’s MapReduce and Google File System. It was originally developed at Yahoo! and is now managed as a project of the Apache Software Foundation.
HBase: An open source (free), distributed, non-relational database modelled on Google’s Big Table. It was originally developed by Powerset and is now managed as a project of the Apache Software Foundation as part of Hadoop.
MapReduce. A software framework introduced by Google for processing huge datasets on certain kinds of problems on a distributed system. Also implemented in Hadoop.
Mashup: An application that uses and combines data presentation or functionality from two or more sources to create new services. These applications are often made available on the Web, and frequently use data accessed through open application programming interfaces or from open data sources.
Metadata: Data the described the content and context of data files (e.g. means of creation, purpose, time and data or creation and author).
Non-relational database: A database that does not store data in tables (rows and columns). (In contrast to relational database)
R: An open source (free) programming language and software environment for statistical computing and graphics. The R language has become a de facto standard among statisticians for developing statistical software and is widely used for statistical software and data analysis. R is part of the GNU Project, a collaboration that supports open source projects.
Relational database: A database made up of a collection of tables (relations), i.e., data is stored in rows and columns. Relational database management system (RSBMS) store a type of structured data. SQL is the most widely use language for managing relational databases.
Semi-structured data: Data that does not confirm to fixed fields but contains tags and other markets to separate data elements. Examples of semi-structured data include XML or HTLM-tagged text. Contrast to structured data and unstructured data.
SQL: Originally an acronym for structured query language, SQL is a computer language designed for managing data in relationional databases. This technique includes the ability to insert, query, update and delete data, as well as manage data scheme (database structures) and control access to data in the database.
Stream processing: Technologies design to process large real-time streams of event data. Stream processing enables applications such algorithmic trading in financial services, RFID event processing applications, fraud detection, process monitoring and location-based services in telecommunication. Also known as event stream processing.
Structured data: Data that reside in fixed fields. Examples of structured data include relational databases or data in spreadsheets. Contrast with semi-structured data and unstructured data.
Unstructured data: Data that do not reside in fixed fields. Examples include free-form text (e.g. books, articles, body of e-mail messages), untagged audio, image and video data. Contract with structured data and semi-structured data.
Visualization: Technologies used for creating images, diagrams, or animations to communicate a message that often used to synthesize the result of big data analyses (see the next section for examples)
TECHNIQUES FOR ANALYZING BIG DATA
There are many techniques that draw on disciplines such as statistics and computer science (particularly machine learning) that can be used to analyse datasets. We provide an illustrative list of some categories of techniques applicable across a range of industries. This list is by no means exhaustive. Indeed, researchers continue to develop new techniques and improve on existing ones, particularly in response to the need to analyse new combinations of data. We note that not all of these techniques strictly require the use of big data – some of them can be applied effectively to smaller datasets (e.g. A/B testing, regression analysis). However, all of the techniques we list here can be applied to big data and, in general, larger and more diverse datasets can be used to generate more numerous and insightful results that smaller, less diverse ones.
A/B testing: A technique in which a control group is compared with a variety of test groups in order to determine what treatments (i.e. changes) will improve a given objective variable (e.g. marketing response rate). This technique is also known as split testing or bucket testing. An example application is determining what copy text, layouts, images or colours will improve conversion rates on an e-commerce Web site. Big data enables huge numbers of tests to be executed and analysed, ensuring that groups are of sufficient size to detect meaningful (i.e. statistically significant) differences between the control and treatment groups (see statistics). When more than one variable is simultaneously manipulated in the treatment, the multivariate generalization of this technique, which applies statistical modelling, is often called “A/B/N” testing.
Association rule learning: A set of technique for discovering interesting relationships, i.e. “association rules” among variables in large databases. These techniques consist of a variety of algorithms to generate and test possible roles. One application is market basket analysis, in which a retailer can determine which products are frequently bought together and use this information for marketing (a commonly cited example is the discovery that many super market shoppers who buy diapers also tend to buy beer). Used for data mining.
Classification: A set of techniques to identify the categories in which new data points below, based on a training set containing data points that have already been categorized. One application is the prediction of segment-specific customer behaviour (e.g. buying decisions, churn rate, consumption rate) where there is a clear hypothesis or objective outcome. These techniques are often described as supervised learning because of the existence of a training set; they stand in contrast to cluster analysis, a type of unsupervised learning. Used for data mining.
Cluster analysis: A statistical method for classifying objects that splits a diverse group into smaller groups of similar objects, whose characteristics of similarity are not known in advance. An example of cluster analysis is segmenting consumers into self-similar groups for targeted marketing. This is a type of unsupervised learning because training data are not used. This technique is in contract to classification, a type of supervised learning. Used for data mining.
Crowdsourcing: A technique for collecting data submitted by a large group of people or community (i.e. the “crowd”) through an open call, usually through networked media such as the Web. This is a type of mass collaboration and an instance of using Web 2.0.
Data fusion and data integration: A set of techniques that integrate and analyse data from multiple sources in order to develop insights in ways that are most efficient and potentially more accurate than if they were developed by analysing a single source of data. Signal processing techniques can be used to implement some types of data fusion. One example of an application is sensor data from the Internet of Things being combined to develop an integrated perspective on the performance of a complex distributed system such as an oil refinery. Data from social media, analysed by natural language processing, can be combined with real-time sales data, in order to determine what effect a marketing campaign is having on customer sentiment and purchasing behaviour.
Data mining: A set of techniques to extract patterns from large datasets by combining methods from statistics and machine learning with database management. These techniques include association rule learning, cluster analysis, classification and regression. Applications include mining customer data to determine segments more likely to respond to an offer, mining human resources data to identify characteristics of most successful employees, or market basket analysis to model the purchase behaviour of customers.
Ensemble learning: Using multiple predictive models (each developed using and/or machine learning) to obtain better predictive performance than could be obtained from any of the constituent models. This is a type of supervised learning.
Genetic algorithms: A techniques used for optimisation that is inspired by the process of natural evolution or “survival of the fittest.” In this technique, potential solutions are encoded as “chromosomes” that can combine and mutate. These individual chromosomes are selected for survival with a modelled “environment” that determines the fitness or performance of each individual in the population. Often descripted as a type of “evolutionary algorithm”, these algorithms are well –suited for solving non-linear problems. Examples of applications include improving job scheduling in manufacturing and optimizing the performance of an investment portfolio.
Machine learning: A subspecialty of computer science (with a field historically called “artificial intelligence”) concerned with the design and development of algorithms that allow computers to evolve behaviours based on empirical data. A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent design based on data. Natural language processing is an example of machine learning.
Natural language processing (NLP): A set of techniques from a subspecialty of computer science. Many NLP techniques are types of machine learning. One application of NLP is using sentiment analysis on social media to determine how prospective customers are reacting to a branding campaign.
Neural networks: Computational models, inspired by the structured and workings of biological neutral networks (i.e. the cells and connections within a brain), that find patterns in data. Neural networks are well-suited for finding non-linear patterns. They can be used for pattern recognition and optimisation. Some neural network application involves supervised learning and other involve unsupervised learning. Examples of applications include identifying high-value customers that are at risk of leaving a particular company and identifying fraudulent insurance claims.
Network analysis: A set of techniques used to characterize relationships among discrete nodes in a graph or a network. In social network analysis, connections between individuals in a company or organization are analysed (e.g. how information travels, or who has the most influence over whom). Examples of applications include identifying key opinion leaders to target for marketing, and identifying bottlenecks in enterprise information flows.
Optimization: A portfolio of numerical techniques used to redesign complex systems and processes to improve their performance according to one or more objective measures (e.g. cost, speed or reliability). Examples of applications include improving operational processes such as scheduling, routing and floor layout, and making strategic decisions such as product range strategy, linked investment analysis and R&D portfolio strategy. Genetic algorithms are an example of an optimization technique.
Pattern recognition: A set of machine learning techniques that assign some sort of output value (or label) to a given input value (or instance) according to a specific algorithm. Classification techniques are an example.
Predictive modelling: A set of techniques in which a mathematical model is create or chose to best predict the probability of an outcome. An example of an application in customer relationships management is the use of predictive models to estimate the likelihood that a customer will “churn” (i.e. change providers) or the likelihood that a customer can be cross-sold another product. Regression is one example of the many predictive modelling techniques.
Regression: A set of statistical techniques to determine how the value of the dependent variables changes when one or more independent variables are modified. Often used for forecasting or prediction. Examples of applications include forecasting sales volumes base on various market and economic variables or determining what measurable manufacturing parameters most influence customer satisfaction. Use for data mining.
Sentiment analysis: Application of natural language processing and other analytic techniques to identify and extract subjective information from source text material. Key aspects of these analyses include identifying the feature, aspect, or product about which a sentiment is being expressed, and determining the type, “polarity” (i.e. positive, negative, or neutral) and the degree and strength of the sentiment. Examples of applications include companies applying sentiment analysis to analyse social media (e.g. blogs, microblogs, and social networks) to determine how different customer segments and stakeholders are reacting to their products and actions.
Signal processing: A set of techniques from electrical engineering and applied mathematics originally developed to analyse discrete and continuous signals, i.e. representations of analog physical quantities (even if represented digitally) such as radio signals, sounds and images. This category includes techniques from signal detection theory, which quantifies the ability to discern between signal and noise. Sample applications include modelling for time-series analysis or implementing data fusion to determine a more precise reading by combining data from a set of less precise data sources (i.e. extracting the signal from the noise).
Spatial analysis: A set of techniques, some applied from statistics, which analyse the topological, geometric or geographic properties encoded in a data set. Often the data for special analysis come from geographic information systems (GIS) that capture data including location information (e.g. addresses or latitude/longitude coordinates). Examples of applications include the incorporation of spatial data into special regressions (e.g. how is consumer willingness to purchase a product correlated with location?) or simulations (e.g. how would a manufacturing supply chain network perform with sites in different locations?). An example of an application is A/B testing to determine what types of marketing material will most increase revenues.
Supervised learning: The set of machine learning techniques that infer a function or relationship from a set of training data. Examples include classification and support vector machines. This is different from unsupervised learning.
Simulation: Modeling the behaviour of complex systems, often used for forecasting, predicting and scenario planning. Monte Carlo simulations, for example, are a class of algorithms that rely on repeated random sampling (i.e. running thousands of simulations, each based on different assumptions. The result is a histogram that gives a probability distribution of outcomes. One application is assessing the likelihood of meeting financial targets given uncertainties about the success of various initiatives.
Time series analysis: Set of techniques from both statistics and signal processing for analysing sequences of data points, representing values at successive times, to extract meaningful characteristics from the data. Examples of time series analysis include the hourly value of a stock market index or the number of patients diagnosed with a given condition every day. Time series forecasting is the use of a model to predict future values of a time series based on known past values of the same or other series. Some of these techniques, e.g. structural modelling, decompose a series into trend, seasonal, and residual components, which can be useful for identifying cyclical patterns in the data. Examples of applications include forecasting sales figures, or predicting the number of people who will be diagnosed with an infectious disease.
Unsupervised learning: A set of machine learning techniques that finds hidden structure in unlabelled data. Clustered analysis is an example of unsupervised learning (in contrast to supervised learning)
Visualization: Techniques used for creating images, diagrams, or animations to communicate, understand and improve the results of big data analyses.
BIG DATA VISUALIZATION
Clustergram: A clustergram is a visualization technique used for cluster analysis displaying how individual members of a dataset are assigned to clusters as the number of clusters increases. The choice of the number of clusters is an important parameter in cluster analysis. This technique enables the analyst to reach a better understanding of how the results of clustering vary with different numbers of clusters.
History flow: A visualization technique that charts the evolution of a document as it is edited by multiple contributing authors.
Spatial information flow. A visualization technique is one that depicts spatial information flows.
Tag Cloud: A weighed visual list, in which words that appear most frequently are larger and words that appear less frequently smaller. This type of visualization helps the reader to quickly perceive the most salient concepts in a large body of text.