The JSI Categorizer is a powerful tool designed to classify textual data into specific knowledge fields, treating this task as a multi-label problem. At the heart of this tool is the KnowMap Taxonomy, a hierarchical knowledge structure that aligns with the widely recognized Cooperative Patent Classification (CPC) schema.
Why KnowMap?
KnowMap enhances the classification process by:
- Merging several class entities within the CPC schema based on the scope and size of each knowledge field.
- Allowing for more efficient categorization into fine-grained classes.
Methods and Materials
The Categorizer employs cutting-edge techniques, including deduplication, random sampling and conditional random sampling, pre-trained language models. The Categorizer has been rigorously evaluated on patent data from the Google Patents Public Datasets (via BigQuery). Fine-tuned pre-trained transformer models power this innovative classification approach.
Key Features and Capabilities
Performance
The Categorizer achieves exceptional classification quality with an F1 score exceeding 0.8. This reflects the tool’s ability to maintain accuracy even when handling complex multi-label classification problems.
Scalability
Designed with scalability in mind, the Categorizer can process and classify over 100,000 documents, making it suitable for large-scale projects such as patent analysis or corporate research tasks.
Tool Openness
The model is open-source and available for download, promoting transparency and collaboration among researchers and developers.
Applications and Use Cases
The Categorizer aligns perfectly with business cases like InnoGraph in enRichMyData project, where categorization tools are applied to analyze and map innovation trends. By addressing challenges such as dataset balancing for multi-label problems and fine-tuning performance metrics, the Categorizer offers value across a wide range of applications.
Useful Resources
KnowMap Taxonomy: Explore the hierarchical knowledge structure that powers the Categorizer. KnowMap Taxonomy