ClassifiR simplifies the task of labeling and categorizing entire documents based on predefined taxonomies, industry classifications, or customized label sets. It works seamlessly with StructR, which identifies text segment properties, providing a comprehensive data analysis solution.
With a user-friendly graphical interface, ClassifiR facilitates the creation and exploration of custom ontologies through clustering, labeling, and querying. Its interactive capabilities empower users to effortlessly develop and train personalized classifiers that automate the classification process. The results are conveniently accessible through a unified endpoint, regardless of the chosen classification method.
Document classification, a fundamental task in NLP, involves categorizing text documents into predefined classes or categories based on their content. The goal is to automatically assign the most appropriate category or label to each document, enabling efficient organization, retrieval, and analysis of large text corpora. The most relevant types of document classification, to our context, include topic classification, news categorization, sentiment analysis, emotion classification, and intent prediction.
Document classification can be divided into the following:
- Multiclass Classification: In multiclass classification, each document is assigned to one and only one class or category. The goal is to accurately assign each document to a single predefined class label from a set of multiple mutually exclusive classes. For example, if there are three classes (A, B, C), a multiclass classifier would assign each document to one of these three classes.
- Multilabel Classification: In multilabel classification, each document can be assigned to multiple class labels simultaneously. Instead of being limited to a single class label, a document may belong to multiple categories or have multiple attributes. The classifier assigns a binary label to each class, indicating whether the document belongs to that class or not. This allows for more flexibility and captures the possibility of documents having multiple topics or attributes. For instance, a document might be labeled as belonging to both “Sports” and “Entertainment” categories.
- Hierarchical Classification: Hierarchical classification involves organizing classes or categories in a hierarchical or tree-like structure. Instead of directly assigning documents to specific classes, the classifier operates in a hierarchical manner, making decisions at different levels of the hierarchy. Each class is organized into parent and child relationships, where the child classes represent specific subcategories or attributes of the parent classes. This approach allows for a more structured and granular classification scheme. For example, in a hierarchical classification system for news articles, the top-level classes could be “Sports,” “Politics,” and “Entertainment,” with further subcategories such as “Football,” “Basketball,” “Elections,” “Legislation,” “Movies,” and so on.
In summary, multiclass classification assigns each document to a single class label, multilabel classification allows for multiple class labels per document, and hierarchical classification organizes classes in a hierarchical structure to provide a more structured and granular classification scheme.
Two tools from the “ClassifR collection are provided by the enRichMyData consortium. In particular, the tool provided by Expert AI is called ” Document Classification ” and the second tool called ” InfoMiner ” is provided by the team of Jozef Stefan Institute.
Expert AI Platform Document Classification
Document Classification by Expert AI is meant to analyze text to label and identify media topics, emotional traits, geographical references, and more.
Document classification determines what a text is about in terms of categories of a taxonomy.
Available taxonomies are:
Taxonomy | English | Spanish | French | German | Italian |
iptc | ✔ | ✔ | ✔ | ✔ | ✔ |
geotax | ✔ | ✔ | ✔ | ✔ | ✔ |
emotional-traits | ✔ | ✔ | |||
behavioral-traits | ✔ | ✔ |
In the Natural Language API terminology, taxonomy “x” is both a specific set of categories and the name of the API resources capable of classifying a text according to that set.
InfoMiner (JSI)
InfoMiner, an offshoot of the original Ontogen, provides a modern web user interface with useful visualizations underpinned by data analysis and machine learning algorithms with the objective of rapidly constructing labeled datasets, their taxonomies, and classifiers.
- Data Grouping features allow the user to quickly identify similar documents with user defined metrics. This is supported by automatic methods such as clustering.
- Smart Visualization techniques allow the user to understand the data quickly. InfoMiner uses centroid based methods to summarize each cluster. It also automatically creates visualizations such as word clouds, treemaps, and timelines.
- Data Filtering allows the user to query data over its set of properties, the most important of which typically being its textual content or metadata.
- Taxonomy creation allows the user to use all the previously mentioned methods (analysis, grouping, filtering) to create a taxonomy and navigate.