How to Extract Biomedical Text with ScispaCy

About ScispaCy

  • ScispaCy is a Python package containing spaCy models for processing biomedicalscientific or clinical text.

Installation of Stand-Alone REST API Wrapper for SciSpacy Python-based NER

Prerequisites

Software

  • Anaconda Console (Miniconda)
    • This automatically installs Python and PIP
  • Administrator rights on the local machine (to install the program as an Administrator)

Hardware

Server Requirements

  • Your server must have sufficient physical memory.
  • The Python application for the NER model consumes approximately 9 GB of RAM (with the NER model loaded in memory for efficiency)
  • Optional:
    • For best performance install the SpacyWrapper on a separate server in your intranet.
    • To install the SciSpacy wrapper on a separate server, copy the PythonRestAPIWrapper  (see below) folder on that server and continue there with the installation steps

How to Install Miniconda

Miniconda is a free minimal installer for Conda. It is later used to install and activate ScispaCy. Use the following instruction to download and install Miniconda:

  1. Install Miniconda3 for separate virtual python environments. The SciSpacy component only supports Miniconda with Python 3.8 and 3.9.
    • You can download Miniconda3 from the Miniconda repo. You must download a version that includes "py39" or "py38" in the download link. For example, Miniconda3-py39_24.7.1-0-Windows-x86_64.exe
  2. Run the downloaded .exe file as an Administrator.
  3. Install the application using the wizard. Check the settings:
    1. "Add Miniconda to my PATH environment variable"
    2. "Register Miniconda as my default Python"
    3. Info: You can choose not to add the PATH environment variables or register default Python, but scripted installation may fail and will require manual changes to your environment.
  4. Open Control Panel > System and Security > System, then, on the right pane, select Advanced system settings (you can also navigate to this window by searching environment variables in the search bar next to start menu on the desktop)
  5. Click Environment Variables, select Path in the System Variables section, and click on Edit button.This image shows the Environment variables page
  6. On the Edit environment variable page, Click New and add the following highlighted paths. Click OK > OK on the Environment Variable window and to close it.

  7. It is recommended that you restart your machine after updating the environment variables.

  8. To verify if Miniconda is installed successfully, open Windows command prompt and run the command conda list. If it returns list of packages, conda was installed successfully. If the command returns "conda is not recognized as internal or external command", please make sure to add all the environment variables specified in step 6.

How Install and Activate SciSpacy

Locate the PythonRestAPIWrapper folder in the BA Insight <AutoClassifier_engine_install_root>\Admin Site directory.

The files inside this folder are used during the following installation process.

  1. Open Windows PowerShell as an Administrator
  2. Navigate to the PythonRestAPIWrapper folder (run command: cd " your path to PythonRestAPIWrapper")
  3. Next, run the following command:

  4. Copy
    powershell -ExecutionPolicy ByPass -File RegisterScheduledTaskForSciSpacy.ps1
  5. When prompted, specify the user and password for a local administrator and press Enter.
    1. This user is used to run Spacy NLP install commands and must have permissions to create scheduled tasks and run PowerShell scripts on your system.
  6. When prompted to proceed, enter 'y'for Yes.

Installation Verification and Continuous Operation

The installation process is complete when you see a message, similar to the following, displayed in a new powershell console:

"Serving on http://yourcomputername.domain:8080"

See the screenshot below.

Check that Scheduled Task to automatically start the REST API wrapper was successfully created:

  1. Open Windows Task Scheduler
  2. Check for the task:
      BAIRunSciSpacyNLPOnStartup

Note: This task automatically starts the REST API server at system startup.

Running and Maintaining Your SciSpacy Instance

Your SciSpacy instance runs from the Anaconda (Miniconda3) command shell.

NOTE: This shell MUST remain active for your SciSpacy instance to run.

If you close your Anaconda (Miniconda3) shell, or if it is closed automatically for reasons such as:

  • Server maintenance
  • Power outage
  • User error
  • Application error

Your SciSpacy instance is then closed and non-operational.

Restart Your SciSpacy Instance

In the event you must restart your SciSpacy instance, perform the steps below.

Note: If you close this shell or shutdown the machine running the shell you must:

  • Restart your system to automatically trigger the scheduled task that starts the service  
    OR
  • Manually run the created scheduled task  
    OR
  • Take the steps below:
  1. Run Windows PowerShell as an Administrator.
  2. Navigate to the PythonRestAPIWrapper folder:
     cd " your path to PythonRestAPIWrapper"
  3. Activate your spaCy environment: 
     powershell -ExecutionPolicy ByPass -File InstallSciSpacy.ps1

Configure the SciSpacy NER Models

To configure (add/remove) the SciSpacy NER models used by your SciSpacy Instance:

  1. Navigate to SciSpacy models page and review the models available.
  2. Decide on the new models you want to add.
  3. Download all new models.
  4. Create a new SciSpacyRequirements.txt.
    • Example: SciSpacyRequirements0.txt
  5. Add the download links of the models you want to use on separate lines as in the example below:


  6. If the SciSpacy Python wrapper is running, close the Anaconda Console to stop the service.
  7. Open the Anaconda Console as an Administrator and run the following commands:

    Copy
    conda activate scispacy
    pip install -r requirements0.txt --no-cache-dir --user
  8. After the requirements are downloaded successfully:
    1. Navigate to the PythonRestAPIWrapper folder
    2. Edit the file "SciSpacyWrapper.py" as in the example below, depending on which models you want to use:


  9. Add or remove lines similar with line 19 in the example above containing the name of the model you want to use (and downloaded via the SciSpacyRequirements.txt file).
  10. To start the SciSpacy Python instance using the new model configuration run command in the Anaconda console you have opened:
       python.exe SciSpacyWrapper.py
  11. The process is complete when you see a message displayed such as:

"Serving on http://yourcomputername.domain:8080"

How to Add the SciSpacy Component to a Pipeline

To add the SciSpacy component to a pipeline in AutoClassifier:

  1. See the generic component installation instructions here: How to Add Components to Pipelines.
  2. Select SciSpacy NER as your component when you add your component in the "How to Add a Component" section. 
  3. Configure the SciSpacy NER component as specified in the topics below.

SciSpacy Pipeline Configuration

  1. Within BA Insight AutoClassifier, navigate to the Pipelines section.
  2. Open the SciSpacy pipeline component you added to your pipeline earlier and complete the following fields:
    1. Api Endpoint: Specify the API endpoint: http://<servername>:<port>/ent.
    2. Input Property: Specify the input metadata to be used for processing. 
    3. Entity Hitcount Threshold: Specify the number of times an entity must be found (or "hit") in a document in order for the document to be returned. The accepted range of values is 0-1.
    4. Maximum No. of Requested Entities: Specify the number of top entities to be returned. For example, a value of "10" indicates the top 10 entities should be returned. 
    5. UMLS Links Confidence Threshold: Specify a scoring threshold. The accepted range of values is 0-1. For example, 0.85.
    6. Send raw response as metadata: Enable this field to output from Python API in JSON format. 

Pipeline Order

The order of your AutoClassifier pipeline stages impacts the operation of the pipeline. Note the following general guidelines:

Pipeline Stage Order Description

Tika

1st stage
  • When AutoClassifier must extract body from RawData received from any caller, the Tika stage must be the first stage

  • Tika extracts the body stored in JSON files (in a JSON entry).

GSK JSON 2nd stage
  • Next stage after Tika.

  • This stage has the scope to overwrite body metadata with the extracted text from JSON files as it would belong to the original document.

SciSpacy NER

Mesh Tags

MS Text Analytics

Any These stages cannot be 1st or 2nd, but can be any stage after that

DocumentSummary

after MS Text Analytics Must be after MS Text Analytics as it uses the output from the MS Text Analytics stage.

Example of Pipeline Stage Order

The following is an example of one possible valid order of pipeline stages:

  • Tika
  • SciSpacy NER
  • Document Summary Generator

Extracted Metadata

The following metadata is extracted by SciSpacy when your pipeline is run:

  • Detected entity classes in documents and their values:
    • Gene-protein; diseases; drug-chemical
    • Examples: rWGS, NiFeO, oxalate, and so on
  • Predominant entity type for documents:
    • Does the document contain more protein entities or does the document contain more drugs as detected entities:
    • This will do possible tagging the document with a document type based on what the document is more likely to be about
  • List of UMLS CUI IDs of the concepts detected in the input property.
  • List of detailed information about the UMLS concepts including the concept name and concept definition. 

Input Properties

  • The metadata property specified in the stage configuration in the Input Property field. 

Input Property Example

Output Properties

  • For the NER result, the component returns a separate output property for each detected entity type, depending on the NER trained model.
  • Output Property Name Format: "SciSpacy_ENTITY_TYPE" as in the examples below. 
Property Type Description
SciSpacy_ORGAN

Text – Multi

Example: 'cardiovascular', 'abdominal'
SciSpacy_TISSUE Example: 'subcutaneous', 'adipose tissue'
SciSpacy_SIMPLE_CHEMICAL Example: 'cardiovascular', 'abdominal'
SciSpacy_GENE_OR_GENE_PRODUCT
SciSpacy_CANCER
SciSpacy_MULTI-TISSUE_STRUCTURE Example: 'oral glucose', 'visceral'
SciSpacy_ORGAN_ORGANISM_SUBSTANCE
SciSpacy_ORGANISM Example: 'patients'
SciSpacy_CUIDDetails
SciSpacyCUIs
SciSpacyUMLSAliases Example: 'babies', 'Infant (person)', 'Infant [Disease/Finding]'
SciSpacy_DominantEntityType Text – Single
SciSpacySerializedEntitiesJson Text-Single

Troubleshooting

Note the following troubleshooting guidelines if you run into errors during setup.

Unspecified environment variables

  • If you see an error similar to the following, you must make sure you have added the specified paths in your How to Install Miniconda.

Invalid Microsoft c++ build tools

  • If you see an error similar to the following, you must download the Microsoft c++ build tools. For more information, see How to install build tools.