# Text Analytics Glossary

**Text analytics** is the process of deriving information from text sources. It is used for several purposes, such as: summarization (trying to find the key content across a larger body of information or a single document), sentiment analysis (what is the nature of commentary on an issue), explicative (what is driving that commentary), investigative (what are the particular cases of a specific issue) and classification (what subject or what key content pieces does the text talk about).

## Text Classification & Machine Learning with Support Vector Machines

KMX uses a new class of algorithms called support vector machines, SVM, discovered by professor Vladimir Vapnik. We combine the algorithmic power of SVM with the computational power of large scale / cloud computing to calculate the relationships between many models, allowing the KMX software to reveal insight in the context of many patterns in structured and unstructured information. Hereby we provide a novel approach to obtain insight in large complex data sets without the human expertise or apriori assumptions.

More information in this presentation.

## Text Classification Techniques

Text classification techniques can be divided into *rule based oriented approaches*, *machine learning techniques* and *techniques based on natural language processing*. The first group will only work on simple well defined cases but in general there is the risk of not being able to generalize well for real world data. The last approach requires very detailed knowledge of natural languages which is still a research area on its own and will require detailed input knowledge for each text classification case.

*Machine learning techniques* have proven to be able to produce very good results and in many cases at least equally good or much better then other techniques. It is also know that machine learning techniques generalize very well for many different cases with different requirements. These features are even truer for support vector machines (svm), which are a subset of machine learning techniques. The support vector machine algorithms are discovered by Vladimir Vapnik in 1975 and from 1990-1995 till now there it has become a fast growing research field. Support vector machines algorithms basically focus on descriptive problems such *classification, clustering* and *segmentation* and *predictive problems* such as regression. The algorithms work for multivariate data, such as structured data, table data in databases, and also for text data which is in most cases considered unstructured.

The machine learning task for text classification has a set of requirements which can be characterized by the following properties.

### Large input space

The input space consists of the documents and the words within those documents whereby the text data is representation using the vector space model. For each document a vector representation is used which contains the frequency distribution of all words within the document. The dimensionality of the data is then easy larger than 30.000 words even with strong simplifications.

### Little training data

For most learning algorithms, the required number of training examples to produce accurate classifications scales with the dimensionality of the input space. For accurate text classifiers you normally has fewer training examples then dimensions in the feature spaces and therefore it is important to experiment to determine the optimal text classifier given certain classification requirements.

### Noise

Most documents contain spelling mistakes, words which do not contribute to the information contained in the document. This is considered noise to the machine learning algorithms. It is therefore important to improve the signal to noise ratio of the feature vectors in the pre-processing phase.

### Complex learning tasks

Classification of text is generally based the semantic understanding of natural language by humans whereby the interpretation of the text and the context of the words is very important. Text classifiers must be able to approximate such complex tasks whereby they have to be accurate and robust.

### Computational efficiency

Training text classifiers in a high dimensional input space can be computational difficult and for practical approaches it is important to be able to handle large number of features efficiently.

Support vector machines are currently the only approach which is computational efficient and for which there is a well-defined theory that describes the mechanisms with respect to accuracy and robustness.

More information in this presentation:

## Text mining: Under fitting vs Over fitting the data

Many data mining techniques often suffer from under fitting or over fitting the data. *Under fitting* occurs when the algorithm used has not the capacity to express the variability in the data. *Over fitting* is the opposite case, the algorithm has too much capacity and therefore it also fits noise present in the data. The cause of under and over fitting is *the complexity which the model represents* which determines how much variability in the data the model will express. If too much complexity is allowed, the variability due to noise in the data is modelled as well which causes a model which over fits patterns in the data. If the complexity is too low the models will fail to account for the true variability in the data and therefore under fit the data. You want to use models which represent the optimal complexity and you wants to be able to control it to obtain good generalisation properties. Good generalisation means the model’s ability to predict unseen data based on the known learning data sets. This is the focus of machine learning algorithms, being able to control the accuracy and robustness of the models which are generated by the machine learning algorithms, which represents patterns in the data.

Where classical statistics deal with large sample size problems, statistical learning theory is the first theory which is able to address also small sample learning problems. The complexity of models generated with machine learning algorithms using empirical data is dependent on the sample size. By taking the sampling size into account you can obtain better results than by applying asymptotic results from classical statistics.

Vladimir Vapnik developed in the mid 1980’s the statistical learning theory (SLT), from which a new kind of learning machine, called Support Vector Machine (SVM) was introduced. The basic idea of support vector machines is to determine a classifier or regression machine which minimizes the empirical risk (that is, the training set error) and the confidence interval (which corresponds to the generalization or test set error).

In SVMs, the idea is to fix the empirical risk associated with an architecture and then to use a method to minimize the generalization error. The primary advantage of SVM as adaptive models for binary classification and regression is that they provide a classifier with minimal VC dimension which implies low expected probability of generalization errors. SVM can be used to classify linearly separable data and nonlinearly separable data. They can be used as nonlinear classifiers and regression machines by mapping the input space to a high dimensional feature space. In this high dimensional feature space, linear classification can be performed.

## Benefits from Automated Classification using Visualization

- Representative selection of training documents data by information professional
- A clean training set is used (no noise/outliers) for building a Classifier
- Visual feedback if the Classifier results in a separated section within the frame of reference of the whole dataset.
- Insight into the performance of a Classifier

More information in this presentation: