Spark NLP 2.4 sets new accuracy records for common natural language processing tasks including NER, OCR & Matching

February 14 11:04 2020

The AI industry’s most widely used NLP library delivers a major new release of its software & models – improving accuracy and scale for named entity recognition, object character recognition, matching, and entity resolution.

February 14th, 2020 – Delaware – John Snow Labs team is pleased to announce the immediate availability of Spark NLP 2.4. This is the library’s biggest release ever, with major accuracy & scalability improvements across the open source, enterprise, and healthcare editions.

The changes include improvements to the core architecture of the library, retraining of all pre-trained models from scratch, and a suite of new pre-trained models & deep-learning networks that leverage new academic research results from 2019. In most cases, this release is the first production-grade, scalable, and trainable implementation of these new research made available to the AI community.

Named entity recognition: Spark NLP 2.4 still makes half as many mistakes as spaCy 2.2

Named entity recognition (NER) for entities such as people, places, drugs, genes, and others from free text is one of the most widely used NLP tasks. Transformers such as BERT, ELMO and others have improved the achievable accuracy on NER over the past two years – and Spark NLP 2.4 now comes with several out-of-the-box pipelines that make the most of these innovations:

Out-of-the-box Spark NLP models deliver an F1 score of 95.9% using BERT-large on the standard en_core_web benchmark – versus 88.3% delivered by the spaCy 2.2 BERT model.
This means that Spark NLP models will make between one half to one third of the mistakes that the spaCy model is expected to make.
Spark NLP includes five pre-trained NER models – enabling users to trade off accuracy for speed or memory. The least accurate Spark NLP models is still more accuracy than all spaCy models, including the largest one.
In addition, Spark NLP NER models are trainable – so that users can train & tune even more accurate for their own domain-specific applications.

Object Character Recognition (OCR): Automated image enhancement & scalable pipelines

Spark OCR is now a separate library from Spark NLP – enabling to configure object character recognition pipelines that improve accuracy for specific document types.

Spark OCR is now being in production in various large-scale, high-compliance use cases to read clinical records, faxes, invoices, books, and other document types. This new release has enabled customers to reach and surpass the accuracy previously achieved by OCR industry leaders such as Abbey, AWS, and Google Cloud – by implementing image processing algorithms, automating their selection and use, and enabling users to tune OCR pipelines for domain-specific document types.

Spark OCR is unique in its ability to scale OCR processing on any Spark cluster, unify image processing with downstream information extraction from text (using NLP techniques), and running on a customer’s infrastructure without requiring sharing or sending documents to a cloud provider.

Context-Based Text Matching: Accurately extract facts from large documents

A common NLP use case is extracting structured data from large documents. Financial statements, medical records, and legal documents can often be hundreds of pages long. In such cases, finding a specific fact – like a date, a monetary value, or a name – can be challenging since a document can include hundreds of such values to choose from.

Spark NLP 2.4 includes a context-based text matcher which enables users to specify the context inside a document in which a match should be searched for. The algorithm then first finds the relevant context and then performs a deeper search for the request fact within it.

Clinical entity resolution: Accurately map entities to large, hierarchical ontologies

Spark NLP for Healthcare already had the ability to map clinical entities to medical terminologies – such as drugs to RxNorm codes, procedures to ICD-10-PCS or CPT codes, and others. This release brings new pre-trained models with better accuracy:

All entity resolution models have been re-trained from scratch for improved accuracy based on newer algorithms & deep learning networks
Mapping clinical terms within a specific category (like cancer staging or body part) can now be tuned to be more accurate for specialty-specific use cases
Models for larger terminologies, like SNOMED-CT, are now faster and require a smaller memory footprint to run
All models have been re-trained to reflect the most recent medical terminologies

More new functionality

The Spark NLP 2.4 Release Notes list the entire set of new features, upgrades, and bug fixes within this major release. Major new features include:

Document classification – supporting this common NLP task directly within Spark NLP
Shared in-memory storage – efficient loading & reusing of large models & embeddings
Recursive pipelines – enabling better support for multi-lingual & hierarchical pipelines
Lazy annotators – enabling a small memory footprint when running very large pipelines

“This release continues our years-long commitment to provide our customers and the AI community the world’s most accurate, fast, and scalable NLP library”, said Saif Addin-Ellafi, lead Spark NLP developer at John Snow Labs.

About John Snow Labs

John Snow Labs Inc. is an award-winning healthcare AI & NLP company, accelerating progress in data science with state-of-the-art platforms, models and data. A third of the team have a PhD or MD degree and 75% of team members have at least a Master’s, coming from multiple disciplines covering data science, medicine, data engineering, pharma, security, and DataOps. A Delaware Corporation, John Snow Labs runs as a global virtual team located in 20 countries around the globe. We believe in being great partners, in making customers wildly successful, and in using data science to make the world a better place.

About John Snow Labs’ Spark NLP

Spark NLP is an open-source library for natural language processing in Python, Java, and Scala. Based on the most recent O’Reilly was AI Adoption in the Enterprise survey of 1,300 practitioners, Spark NLP is the most widely used NLP library in the enterprise today. It provides the AI industry with state-of-the-art, production-grade natural language processing capabilities – based on the most recent research results in deep learning, transformers, and distributed systems. The Spark NLP community enjoys new releases of the library every two weeks on average since the beginning of 2018. John Snow Labs continues to grow its team and support of the library, as well as license commercial Enterprise and Healthcare NLP software products that extend it.

Media Contact
Company Name: John Snow Labs
Contact Person: Ida Lucente
Email: Send Email
Phone: +1 (302) 786-5227
Address:16192 Coastal Highway
City: Lewes
State: Delaware 19958
Country: United States
Website: www.johnsnowlabs.com

Categories:

Press Release

Spark NLP 2.4 sets new accuracy records for common natural language processing tasks including NER, OCR & Matching

Loading, Please Wait!