Introducing the Categorizer: A Step Forward in Knowledge Field Classification  

The JSI Categorizer is a powerful tool designed to classify textual data into specific knowledge fields, treating this task as a multi-label problem. At the heart of this tool is the KnowMap Taxonomy, a hierarchical knowledge structure that aligns with the widely recognized Cooperative Patent Classification (CPC) schema.

Why KnowMap?

KnowMap enhances the classification process by:

  • Merging several class entities within the CPC schema based on the scope and size of each knowledge field.
  • Allowing for more efficient categorization into fine-grained classes.

Methods and Materials

The Categorizer employs cutting-edge techniques, including deduplication, random sampling and conditional random sampling, pre-trained language models. The Categorizer has been rigorously evaluated on patent data from the Google Patents Public Datasets (via BigQuery). Fine-tuned pre-trained transformer models power this innovative classification approach.

The figure displays an example of a branch extension in KnowMap from the root to the lowest level, showing the association of KnowMap classes with corresponding CPC classes at each level.


Key Features and Capabilities

Performance

The Categorizer achieves exceptional classification quality with an F1 score exceeding 0.8. This reflects the tool’s ability to maintain accuracy even when handling complex multi-label classification problems.

Scalability

Designed with scalability in mind, the Categorizer can process and classify over 100,000 documents, making it suitable for large-scale projects such as patent analysis or corporate research tasks.

Tool Openness

The model is open-source and available for download, promoting transparency and collaboration among researchers and developers.

Applications and Use Cases

The Categorizer aligns perfectly with business cases like InnoGraph in enRichMyData project, where categorization tools are applied to analyze and map innovation trends. By addressing challenges such as dataset balancing for multi-label problems and fine-tuning performance metrics, the Categorizer offers value across a wide range of applications.

Useful Resources

KnowMap Taxonomy: Explore the hierarchical knowledge structure that powers the Categorizer. KnowMap Taxonomy

 

Scroll to Top