How to Extract Biomedical Text with ScispaCy

About ScispaCy
Installation of Stand-Alone REST API Wrapper for SciSpacy Python-based NER
How to Install Miniconda
How Install and Activate SciSpacy
Installation Verification and Continuous Operation
Running and Maintaining Your SciSpacy Instance
Configure the SciSpacy NER Models
How to Add the SciSpacy Component to a Pipeline
SciSpacy Pipeline Configuration
Pipeline Order

About ScispaCy

ScispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text.

Installation of Stand-Alone REST API Wrapper for SciSpacy Python-based NER

Prerequisites

Software

Anaconda Console (Miniconda)
- This automatically installs Python and PIP
Administrator rights on the local machine (to install the program as an Administrator)

Hardware

Server Requirements

Your server must have sufficient physical memory.
The Python application for the NER model consumes approximately 9 GB of RAM (with the NER model loaded in memory for efficiency)
Optional:
- For best performance install the SpacyWrapper on a separate server in your intranet.
- To install the SciSpacy wrapper on a separate server, copy the PythonRestAPIWrapper (see below) folder on that server and continue there with the installation steps

How to Install Miniconda

Miniconda is a free minimal installer for Conda.

It is later used to install and activate ScispaCy.

Install Miniconda for separate virtual python environments:
- Available for download from https://docs.conda.io/en/latest/Miniconda.html
Run the downloaded .exe file as an Administrator.
Install the application using the wizard.
Check the settings:
1. "Add Miniconda to my PATH environment variable"
2. "Register Miniconda as my default Python"
3. Info: You can choose not to add the PATH environment variables or register default Python, but scripted installation may fail and will require manual changes to your environment.

How Install and Activate SciSpacy

Locate the PythonRestAPIWrapper folder in the BA Insight <AutoClassifier_engine_install_root>\Admin Site directory.

The files inside this folder are used during the following installation process.

Open Windows PowerShell as an Administrator
Navigate to the PythonRestAPIWrapper folder (run command: cd " your path to PythonRestAPIWrapper")
Next, run the following command:

Copy

powershell -ExecutionPolicy ByPass -File RegisterScheduledTaskForSciSpacyNLP.ps1

When prompted, specify the user and password for a local administrator and press Enter.
1. This user is used to run Spacy NLP install commands and must have permissions to create scheduled tasks and run PowerShell scripts on your system.
When prompted to proceed, enter 'y'for Yes.

Installation Verification and Continuous Operation

The installation process is complete when you see a message displayed such as:

"Serving on http://yourcomputername.domain:8080http://yourcomputername.domain:8080"

See the screenshot below.

Check that Scheduled Task to automatically start the REST API wrapper was successfully created:

Open Windows Task Scheduler
Check for the task:
BAIRunSciSpacyNLPOnStartup

Note: This task automatically starts the REST API server at system startup.

Running and Maintaining Your SciSpacy Instance

Your SciSpacy instance runs from the Anaconda (Miniconda3) command shell.

NOTE: This shell MUST remain active for your SciSpacy instance to run.

If you close your Anaconda (Miniconda3) shell, or if it is closed automatically for reasons such as:

Server maintenance
Power outage
User error
Application error

Your SciSpacy instance is then closed and non-operational.

Restart Your SciSpacy Instance

In the event you must restart your SciSpacy instance, perform the steps below.

Note: If you close this shell or shutdown the machine running the shell you must:

Restart your system to automatically trigger the scheduled task that starts the service
OR
Manually run the created scheduled task
OR
Take the steps below:

Run Windows PowerShell as an Administrator.
Navigate to the PythonRestAPIWrapper folder:
cd " your path to PythonRestAPIWrapper"
Activate your spaCy environment:
powershell -ExecutionPolicy ByPass -File InstallSciSpacy.ps1

Configure the SciSpacy NER Models

To configure (add/remove) the SciSpacy NER models used by your SciSpacy Instance:

Navigate to SciSpacy models page and review the models available.
Decide on the new models you want to add.
Download all new models.
Create a new SciSpacyRequirements.txt.
- Example: SciSpacyRequirements0.txt
Add the download links of the models you want to use on separate lines as in the example below:
If the SciSpacy Python wrapper is running, close the Anaconda Console to stop the service.
Open the Anaconda Console as an Administrator and run the following commands:
Copy
```
conda activate scispacy
pip install -r requirements0.txt --no-cache-dir --user
```
After the requirements are downloaded successfully:
1. Navigate to the PythonRestAPIWrapper folder
2. Edit the file "SciSpacyWrapper.py" as in the example below, depending on which models you want to use:
Add or remove lines similar with line 19 in the example above containing the name of the model you want to use (and downloaded via the SciSpacyRequirements.txt file).
To start the SciSpacy Python instance using the new model configuration run command in the Anaconda console you have opened:
python.exe SciSpacyWrapper.py
The process is complete when you see a message displayed such as:

"Serving on http://yourcomputername.domain:8080"

How to Add the SciSpacy Component to a Pipeline

To add the SciSpacy component to a pipeline in AutoClassifier:

See the generic component installation instructions here: How to Add Components to Pipelines.
Select "SciSpacy NER" (see the graphic below) as your component when you add your component in the "How to Add a Component" section.
Configure the SciSpacy NER component as specified in the topics below.

SciSpacy Pipeline Configuration

Within BA Insight AutoClassifier, navigate to the Pipelines section.
Open the SciSpacy pipeline component you added to your pipeline earlier.
Configure your component with the following information:

Api Endpoint:
- Specify the API endpoint: http://<servername>:<port>/ent.
- See the API Endpoint field in the graphic below as an example.
Input Property:
- Specify the input metadata to be used for processing.
Entity Hitcount Threshold:
- Number of times an entity must be found (or "hit") in a document in order for the document to be returned.
- Accepted range of values: 0-1.
Maximum No. of Requested Entities:
- The number of top entities to be returned.
- For example, a value of "10" indicates the top 10 entities should be returned.
UMLS Links Confidence Threshold:
- A scoring threshold.
- Accepted range of values: 0-1.
- Example: 0.85.
Send raw response as metadata:
- Output from Python API in JSON format.

Pipeline Order

The order of your AutoClassifier pipeline stages impacts the operation of the pipeline.

Note the following general guidelines:

Pipeline Stage	Order	Description
Tika	1^st stage	When AutoClassifier must extract `body` from RawData received from any caller, the Tika stage must be the first stage Tika extracts the body stored in JSON files (in a JSON entry).
GSK JSON	2^nd stage	Next stage after Tika. This stage has the scope to overwrite body metadata with the extracted text from JSON files as it would belong to the original document.
SciSpacy NER Mesh Tags MS Text Analytics	Any	These stages cannot be 1st or 2nd, but can be any stage after that
DocumentSummary	after MS Text Analytics	Must be after MS Text Analytics as it uses the output from the MS Text Analytics stage.

Example of Pipeline Stage Order

The following is an example of one possible valid order of pipeline stages:

Tika
SciSpacy NER
Document Summary Generator

Extracted Metadata

The following metadata is extracted by SciSpacy when your pipeline is run:

Detected entity classes in documents and their values:
- Gene-protein; diseases; drug-chemical
- Examples: rWGS, NiFeO, oxalate, and so on
Predominant entity type for documents:
- Does the document contain more protein entities or does the document contain more drugs as detected entities:
- This will do possible tagging the document with a document type based on what the document is more likely to be about
List of UMLS CUI IDs of the concepts detected in the input property.
List of detailed information about the UMLS concepts including the concept name and concept definition.

Input Properties

The metadata property specified in the stage configuration in the Input Property field.

Input Property Example

Output Properties

For the NER result, the component returns a separate output property for each detected entity type, depending on the NER trained model.
Output Property Name Format: "SciSpacy_ENTITY_TYPE" as in the examples below.

Property	Type	Description
SciSpacy_ORGAN	Text – Multi	Example: 'cardiovascular', 'abdominal'
SciSpacy_TISSUE		Example: 'subcutaneous', 'adipose tissue'
SciSpacy_SIMPLE_CHEMICAL		Example: 'cardiovascular', 'abdominal'
SciSpacy_GENE_OR_GENE_PRODUCT
SciSpacy_CANCER
SciSpacy_MULTI-TISSUE_STRUCTURE		Example: 'oral glucose', 'visceral'
SciSpacy_ORGAN_ORGANISM_SUBSTANCE
SciSpacy_ORGANISM		Example: 'patients'
SciSpacy_CUIDDetails
SciSpacyCUIs
SciSpacyUMLSAliases		Example: 'babies', 'Infant (person)', 'Infant [Disease/Finding]'
SciSpacy_DominantEntityType	Text – Single
SciSpacySerializedEntitiesJson	Text-Single	Serializied value of top important entities. Note this is useful forsummary generation.