How to Extract Biomedical Text with ScispaCy
About ScispaCy
-
ScispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text.
Installation of Stand-Alone REST API Wrapper for SciSpacy Python-based NER
Prerequisites
Software
- Anaconda Console (Miniconda)
- This automatically installs Python and PIP
- Administrator rights on the local machine (to install the program as an Administrator)
Hardware
Server Requirements
- Your server must have sufficient physical memory.
- The Python application for the NER model consumes approximately 9 GB of RAM (with the NER model loaded in memory for efficiency)
- Optional:
- For best performance install the SpacyWrapper on a separate server in your intranet.
- To install the SciSpacy wrapper on a separate server, copy the PythonRestAPIWrapper (see below) folder on that server and continue there with the installation steps
How to Install Miniconda
Miniconda is a free minimal installer for Conda. It is later used to install and activate ScispaCy. Use the following instruction to download and install Miniconda:
- Install Miniconda3 for separate virtual python environments. The SciSpacy component only supports Miniconda with Python 3.8 and 3.9.
- You can download Miniconda3 from the Miniconda repo. You must download a version that includes "py39" or "py38" in the download link. For example, Miniconda3-py39_24.7.1-0-Windows-x86_64.exe
- You can download Miniconda3 from the Miniconda repo. You must download a version that includes "py39" or "py38" in the download link. For example, Miniconda3-py39_24.7.1-0-Windows-x86_64.exe
- Run the downloaded .exe file as an Administrator.
- Install the application using the wizard. Check the settings:
- "Add Miniconda to my PATH environment variable"
- "Register Miniconda as my default Python"
- Info: You can choose not to add the PATH environment variables or register default Python, but scripted installation may fail and will require manual changes to your environment.
- Open Control Panel > System and Security > System, then, on the right pane, select Advanced system settings (you can also navigate to this window by searching environment variables in the search bar next to start menu on the desktop)
- Click Environment Variables, select Path in the System Variables section, and click on Edit button.
-
On the Edit environment variable page, Click New and add the following highlighted paths. Click OK > OK on the Environment Variable window and to close it.
-
It is recommended that you restart your machine after updating the environment variables.
-
To verify if Miniconda is installed successfully, open Windows command prompt and run the command
conda list
. If it returns list of packages, conda was installed successfully. If the command returns "conda is not recognized as internal or external command", please make sure to add all the environment variables specified in step 6.
How Install and Activate SciSpacy
Locate the PythonRestAPIWrapper folder in the BA Insight <AutoClassifier_engine_install_root>\Admin Site directory.
The files inside this folder are used during the following installation process.
- Open Windows PowerShell as an Administrator
- Navigate to the PythonRestAPIWrapper folder (run command: cd " your path to PythonRestAPIWrapper")
-
Next, run the following command:
- When prompted, specify the user and password for a local administrator and press Enter.
- This user is used to run Spacy NLP install commands and must have permissions to create scheduled tasks and run PowerShell scripts on your system.
- This user is used to run Spacy NLP install commands and must have permissions to create scheduled tasks and run PowerShell scripts on your system.
- When prompted to proceed, enter 'y'for Yes.
powershell -ExecutionPolicy ByPass -File RegisterScheduledTaskForSciSpacy.ps1
Installation Verification and Continuous Operation
The installation process is complete when you see a message, similar to the following, displayed in a new powershell console:
"Serving on http://yourcomputername.domain:8080
"
See the screenshot below.
Check that Scheduled Task to automatically start the REST API wrapper was successfully created:
- Open Windows Task Scheduler
- Check for the task:
BAIRunSciSpacyNLPOnStartup
Note: This task automatically starts the REST API server at system startup.
Running and Maintaining Your SciSpacy Instance
Your SciSpacy instance runs from the Anaconda (Miniconda3) command shell.
If you close your Anaconda (Miniconda3) shell, or if it is closed automatically for reasons such as:
- Server maintenance
- Power outage
- User error
- Application error
Your SciSpacy instance is then closed and non-operational.
Restart Your SciSpacy Instance
In the event you must restart your SciSpacy instance, perform the steps below.
Note: If you close this shell or shutdown the machine running the shell you must:
- Restart your system to automatically trigger the scheduled task that starts the service
OR - Manually run the created scheduled task
OR - Take the steps below:
- Run Windows PowerShell as an Administrator.
- Navigate to the PythonRestAPIWrapper folder:
cd " your path to PythonRestAPIWrapper" -
Activate your spaCy environment:
powershell -ExecutionPolicy ByPass -File InstallSciSpacy.ps1
Configure the SciSpacy NER Models
To configure (add/remove) the SciSpacy NER models used by your SciSpacy Instance:
- Navigate to SciSpacy models page and review the models available.
- Decide on the new models you want to add.
- Download all new models.
- Create a new SciSpacyRequirements.txt.
- Example: SciSpacyRequirements0.txt
- Add the download links of the models you want to use on separate lines as in the example below:
- If the SciSpacy Python wrapper is running, close the Anaconda Console to stop the service.
-
Open the Anaconda Console as an Administrator and run the following commands:
Copyconda activate scispacy
pip install -r requirements0.txt --no-cache-dir --user - After the requirements are downloaded successfully:
- Navigate to the PythonRestAPIWrapper folder
- Edit the file "SciSpacyWrapper.py" as in the example below, depending on which models you want to use:
- Add or remove lines similar with line 19 in the example above containing the name of the model you want to use (and downloaded via the SciSpacyRequirements.txt file).
- To start the SciSpacy Python instance using the new model configuration run command in the Anaconda console you have opened:
python.exe SciSpacyWrapper.py
- The process is complete when you see a message displayed such as:
"Serving on http://yourcomputername.domain:8080
"
How to Add the SciSpacy Component to a Pipeline
To add the SciSpacy component to a pipeline in AutoClassifier:
- See the generic component installation instructions here: How to Add Components to Pipelines.
- Select SciSpacy NER as your component when you add your component in the "How to Add a Component" section.
- Configure the SciSpacy NER component as specified in the topics below.
SciSpacy Pipeline Configuration
- Within BA Insight AutoClassifier, navigate to the Pipelines section.
- Open the SciSpacy pipeline component you added to your pipeline earlier and complete the following fields:
- Api Endpoint: Specify the API endpoint: http://<servername>:<port>/ent.
- Input Property: Specify the input metadata to be used for processing.
- Entity Hitcount Threshold: Specify the number of times an entity must be found (or "hit") in a document in order for the document to be returned. The accepted range of values is 0-1.
- Maximum No. of Requested Entities: Specify the number of top entities to be returned. For example, a value of "10" indicates the top 10 entities should be returned.
- UMLS Links Confidence Threshold: Specify a scoring threshold. The accepted range of values is 0-1. For example, 0.85.
- Send raw response as metadata: Enable this field to output from Python API in JSON format.
Pipeline Order
The order of your AutoClassifier pipeline stages impacts the operation of the pipeline. Note the following general guidelines:
Pipeline Stage | Order | Description |
---|---|---|
1st stage |
|
|
GSK JSON | 2nd stage |
|
SciSpacy NER |
Any | These stages cannot be 1st or 2nd, but can be any stage after that |
after MS Text Analytics | Must be after MS Text Analytics as it uses the output from the MS Text Analytics stage. |
Example of Pipeline Stage Order
The following is an example of one possible valid order of pipeline stages:
- Tika
- SciSpacy NER
- Document Summary Generator
Extracted Metadata
The following metadata is extracted by SciSpacy when your pipeline is run:
- Detected entity classes in documents and their values:
- Gene-protein; diseases; drug-chemical
- Examples: rWGS, NiFeO, oxalate, and so on
- Predominant entity type for documents:
- Does the document contain more protein entities or does the document contain more drugs as detected entities:
- This will do possible tagging the document with a document type based on what the document is more likely to be about
- List of UMLS CUI IDs of the concepts detected in the input property.
- List of detailed information about the UMLS concepts including the concept name and concept definition.
Input Properties
- The metadata property specified in the stage configuration in the Input Property field.
Input Property Example
Output Properties
- For the NER result, the component returns a separate output property for each detected entity type, depending on the NER trained model.
- Output Property Name Format: "SciSpacy_ENTITY_TYPE" as in the examples below.
Property | Type | Description |
---|---|---|
SciSpacy_ORGAN |
Text – Multi |
Example: 'cardiovascular', 'abdominal' |
SciSpacy_TISSUE | Example: 'subcutaneous', 'adipose tissue' | |
SciSpacy_SIMPLE_CHEMICAL | Example: 'cardiovascular', 'abdominal' | |
SciSpacy_GENE_OR_GENE_PRODUCT |
|
|
SciSpacy_CANCER |
|
|
SciSpacy_MULTI-TISSUE_STRUCTURE | Example: 'oral glucose', 'visceral' | |
SciSpacy_ORGAN_ORGANISM_SUBSTANCE |
|
|
SciSpacy_ORGANISM | Example: 'patients' | |
SciSpacy_CUIDDetails |
|
|
SciSpacyCUIs |
|
|
SciSpacyUMLSAliases | Example: 'babies', 'Infant (person)', 'Infant [Disease/Finding]' | |
SciSpacy_DominantEntityType | Text – Single |
|
SciSpacySerializedEntitiesJson | Text-Single |
|
Troubleshooting
Note the following troubleshooting guidelines if you run into errors during setup.
Unspecified environment variables
-
If you see an error similar to the following, you must make sure you have added the specified paths in your How to Install Miniconda.
Invalid Microsoft c++ build tools
-
If you see an error similar to the following, you must download the Microsoft c++ build tools. For more information, see How to install build tools.