Knowledge base (FAQ)
Q: How does the Clustering in KMX work?
The main concepts behind the KMX Clustering capabilities is that after the text pre-processing phase the KMX software:
- make a vector representation of each document (based on weighted feature extraction, which is is language independent) and,
- KMX calculates a similarity metric between all documents. This is still in a high dimensional space (the dimension is equal to the number of features and thus the vector length). The similarity in high dimensional space is then mapped on a 2D plane and visualized.
These two steps are essential for the high quality of the clustering and thus for finding (brushing or selecting visually) good training documents for building good classifiers.
For a deeper understanding you can download the following research papers in our Download Center:
- Least Square Projection: a fast high precision multidimensional projection technique and its application to document mapping
- Content-based visualization of web search results
Q: What are the key KMX functions/libraries that could be embedded in other solutions?
Much of the existing KMX libraries are clustered around the four main functions. Under each of the functions, we have listed examples of existing KMX implementations. Each module can potentially be integrated with another solutions or systems workflow.
- Proprietary algorithms for dealing with large collections of text documents determining document similarity with projection techniques of the high dimension text documents for 2D cluster visualization (landscaping)
- Binary classification where the classifier determine for each document a score which determines if the document belongs to the class that is represented by the classifier
- Multiclass classification where the classifier determines for each class a score and one can determine the probability of documents to belong to one or multiple classes represented by the multi class classifier
- Compound classification where the user can combine multiple classifiers for more complex classification tasks
- A suggestion system for finding the most relevant documents to provide an extra labelled document to perform assisted classifier performance tuning. The important advantage is the user can build an initial classifier using a small amount of labelled documents and the suggestion system make the classifier optimization process efficient since the system ask the user to only label 10 extra documents for which the system has determined that these 10 documents will improve the classifier significantly
- Calculation of precision , recall and F1 measures cross-validation. The system samples multiple times a subset of labelled documents to build multiple classifiers and calculate the P,R and F1 based on all the labelled documents. We also calculate an error margin for these performance indicators. Additionally the KMX system can calculate and plot ROC (Reciever Operating Curves) which gives good quantitative information on the performance even in the case of un-even distributions of labelled data. Lastly the system provides histogram plots for the number of documents having a certain classification score (0-100) for binary and multiclass classifiers
- We also have technology to perform Batch Classification (server based) supporting large scale parallel processing
Visualization, Interaction and Reporting
- landscaping visualization with automatic annotation and multi resolution support
- visualization of the classification score for a multi class classifier using parallel coordinates
- user interaction to select one or multiple documents and make them part of a subset of the data (brushes)
- user interaction in both visualization (multiple coupled view interaction) to select one or more documents and place them in a brush
- support to provide additional information in the selection of one document (more annotation terms to showing the full document)
- reporting of precision and recall and F1 (also mention under cross validation)
- plots showing the frequency distribution of documents and their classiﬁcation scores
Management, Monitoring & Integration
- import and export of data and labels
- support of management of the data (documents) and the metadata for which KMX makes workspaces.
- support for multiple users
- support for multiple data sets over multiple directories that all users can see and use which is valuable in collaborative work
- parallel processing framework
- specialised data importers (for patent and non-patent data)
- support for document retrieval (based on SOLR and lucene)
- text preprocessing for performing the stemming and tokenization for multiple languages
- multi language stop word lists
Q: Optimize precision & recall of documents: How much training data do you need?
What are the requirements for building KMX Classifiers that can be optimized well (in precision ‘P’ or for recall ‘R’ or both)? Based on extensive testing with KMX when optimizing a KMX Classifier and in depth analysis of research papers on Machine Learning the basic fundamentals are:
- Start with a small balanced training set, roughly between 10 to 30 positive and a similar number of negatives. So, 18 P and 22 N is fine.
- Use active learning algorithms to find more training documents and these training documents should improve the Classifier. KMX uses an active learning algorithm that samples the maximum margin around the hyperplane for those documents (either P or N) that, when labelled, will improve the Classifier. We call this the ‘suggestion system’ since binary classifiers can be optimized more easily. Combining binary Classifiers in a compound classier is a good way to improve towards a case where a Classifier should represent multiple perspectives.
- Treparel knows that the best way to improve a Classifier is when providing more training documents and keep the training data (P and N) balanced. Before doing this one has selected the number of words (feature_vector_length) and should then try or test a few variations, before for go for less or more training documents….until you have reached the optimum. This should be done before optimizing the number of training documents but standard 750 is a good starting (default) feature_vector_length. After testing and storing it, then one can look for the number of training data (P and N documents) that give the optimal P and R.
One of our customers asked us about ‘the ratio of training data versus test data versus evaluation data’. In Data Mining a rule-of-thumb for: Training versus Test versus Evaluation is usually 20% : 20% : 60%. However Treparel has seen other ratios as well. Since KMX uses SVM (Support Vector Machines) it has the advantage that SVM can be trained using a small sample size distribution (20 docs P and 20 docs N even from a large set). KMX just requires (very) good training documents. This is one of the key reasons why KMX is supporting the explorative process through visualization. It helps the analist or data scientist to find and label the documents that are the best training documents.
Q: What is ideal workflow to work with KMX?
The KMX user can select an operation on the set of documents at every point that you are analyzing (using functions like; filter/brush/classify etc). The order in which you uses the available functions (operators on the set of documents) determines the results. KMX provides a lot of freedom (no fixed order of processing steps is required) which users appreciate to solve their specific tasks in what they see as the optimal approach in the view.
However we can understand that a novice user welcomes advice how certain analysis typically can be done.
For a certain set of searches the task is to:
- find all relevant documents and preferably in a ranked list (highest ranked document is the most important result) and this can be done by classification (preferably binary classification since this can be easily optimized to high precision with a relative small amount of training documents).
- Then the user can there after apply a filter to select all documents with a score above for instance 0.8 and thus avoid looking down in the list.
The steps before are all about building a good Classifier (a special KMX algorithm to be able to classify large quantities of text) and selecting good training documents (positive and negative) is the most important part. How the user does this is partly dependent on his knowledge of the technology field but for instance fora biotech user a gene-sequence will probably be part of the selection of training documents which is an example where one can use search/filter in a specific way.
Q: Term Annotation on Visual Clusters
The cluster visualization of KMX comes with term annotation that provides for each cluster very descriptive terms and automatically adapts when one zoom’s in or out. The annotation of terms is very valuable to visually discover where the large set of data is about. Does KMX consider single words as terms, or does it consider phrases? And what are the key differences in both approaches?
ANSWER: KMX uses every word as a term during the ‘feature extraction’ step in the analysis workflow. Because of the tokenization and stemming and then the TFIDF* (weighting of the individual words) KMX calculates the term distribution over all documents.
KMX internally calculates the term distribution per document and manages in-memory a large term-document matrix. During the clustering KMX calculates also a distance matrix using the a distance metric based on the vector in-product of all vector pairs between all documents.
Because KMX can statistically sample the set of relevant descriptive terms from each document and generate a vector representation which captures also words that are next to each other. This approach includes all the phrases that are relevant. Phrases that appear only a few times are therefore not relevant as features in the vector representation for SVM based classification. Therefore KMX does not take these type of phrases not into account for the term annotation on the cluster.
Phrases may look great as annotation in the visualization but the annotation should first give a few (KMX uses three) of the most descriptive terms and thus give the user a quick way to explore the full data landscape. KMX uses 3 terms in the annotation since more terms give problems with placement (clutter) and require smaller fonts. KMX provides users with the term annotation, where he can edit the ranked list of annotation terms for each cluster separately.
If the users wants to edit a term and make the term a phrase , he can do this but this does not change the underlying text mining approach.
To summarize: phrases are often well recognized when a user is reading a text but they are not necessarily the statistical most important features (terms) to build good Classifiers.
* TFIDF means Term Frequency divided by the Inverse Document Frequency. It is an approach to calculate a weighting and is often used in information retrieval and text mining to calculate the overall important of all the terms (as a statistical measure). The weight increases proportional to the number of times a word appears in the document and if offset by the frequency of the word in the whole corpus. When a term appears very frequent in only a few documents but not in a large set of documents the weight will be adjusted for that.
Q: Overfit vs Underfit in relevant terms when training for large data sets
The advantage of KMX is that it focus on capturing a large enough set of relevant terms (the features) to be able to calculate accurate Classifiers. That can be used on other data sets very well which means that the Classifier is robust and accurate (which means the Classifier can provide a high precision and recall).
By this approach KMX will avoid the underfit (not enough terms in the vector representation to build SVM based Classifiers) to provide a high performance of the outcome (high Precision and Recall). It also avoids the overfit (too many terms). The default setting in KMX is 750 terms for the vector representation – which is extensively tested and has proven to be a good default setting. The user has the freedom to change these settings if he wants to test if more (or less) terms will deliver a better result.
Example Use Case for this feature: In the case of a Novelty Search (when analyzing Patents) it is preferred to give the words in the descriptive part a higher weight and that he wants to test it with more words (750->1000). A user can obtain better results based on the argument that an Patent Attorney is a bit more open at the end of the descriptive part in his word usage. KMX provides the freedom to play with this.
Q: Binary versus Composite Classifier versus Multiclass (free) Classifiers; which approach for what situation?
Generally speaking the three types of Classifiers should first be looked at on (1) how can one get the best results in terms of precision (P) and recall (R) and (2) how easy (or difficult) is it for a user to build such a Classifier to get the result and when does he/she know when to stop providing more training documents.
We know (from research and client feedback) that a binary Classifier can generally be optimized with less documents than that are needed for a Multiclass (or Compound) Classifier if one want to obtain the same P and R. Therefore the binary Classifier is advised for most tasks. It is for a user also easier to select good training documents for a binary case. Also in the optimization, to find more training documents to keep a balanced amount of positive and negative training documents, a binary Classifier will strongly improve the Precision and Recall. The user is advised to use the suggestion system here (based on an active learning algorithm) to optimize the Classifier with the minimum amount of work (labeling 10 documents) to achieve an easy improvement.
A binary Classifier determines of a document belongs to one class and in the case of a multiclass classification the classes are considered non-overlapping so the multiclass classificaton determine to which class a document most likely belongs and this is expressed in the score (from 0-100 as in a probability) and the highest score is the class where the document belongs to.
In a compound Classifier one can express multiple (technology) aspects (like overlapping classes) which are to be used in the classification. It allows for ranking using different aspects that are expressed by combining then in combining the Classifiers.
The built in suggestion system only works for binary classification and this can not be used for multiclass classification. A user can of course use the landscaping view to see if he/she can select more positive training documents from the center of a cluster and more negative documents on the boundary of the adjacent clusters where documents are not describing the same technology but close to it. Since the compound Classifier is a collection of binary Classifiers the suggestion system can be used of course.
In the multiclass case you must tag/label a document with exactly one label and making this judgment is sometimes difficult when there are overlapping aspects (captured by the labels). Recommendation: Start with a binary classification first and then select the most relevant documents (a small set with high binary classification score).
Q: Determining imprecise terminology and ambiguity
Can KMX be used to analyze documents and text stored in a database (system, functional and nonfunctional requirements) and determine how well they are written by looking for imprecise terminology and ambiguity?
The KMX technology is used to analyze large complex document sets that are managed using a database. KMX is used to analyse very precise and complete complex documents such as patents. The language of patents is complex since they need to claim as broad as possible an invention and at the same time disclose as little as possible. KMX uses advanced machine learning algorithms combined with strong visualization to support large companies in the Classification and landscaping of their patents. KMX also supports analysis of non-patent literature.
The technology does not need a thesaurus, taxonomy or ontology and is language independent. It has a database abstraction layer to be able to work with different databases.
Solving ambiguity is a use case or application area that KMX supports using the statistical term distribution from a document collection, by this KMX can spot imprecise terms.