Machine Learning Training Data

Training data can be used with Machine Learning to create and train Machine Learning Classification Algorithms. 

  • Training data uses publicly available data from NCBI (National Center for Biotechnology)

  • Available data sets:

    • PubMed
    • PMC

To use training data:

  1. Open or create an existing Machine Learning Provider
  2. Expand the" Available Training Data" section
  3. Select an available Training Set:



  4. Configure the "Available Training Data" options:



  5. APIKey (Optional):
    1. NCBI provides the ability to register for an API Key which enables greater concurrent requests when retrieving data. 
    2. The current code throttles to the allowable limit of 3 concurrent requests per min.
  6. Train Leaf Nodes Only:
    1. When retrieving training data, the option is available to train all nodes in the Taxonomy Tree or just the leaf nodes.  (The lowest node). 
    2. To train only leaf nodes and not the whole hierarchy select this option.
  7. # Documents Per Node: 
    1. Specify the number of Documents to use per Node when training. 
    2. The more documents used typically provides more accurate results but requires longer training time.
  8. # Nodes to Train: 
    1. When testing, the number of Nodes to attach training data to can be limited by specifying a number. 
    2. Example: If only Train Leaf Nodes is selected and "20 Nodes to Train" is selected, then only the first 20 Leaf Nodes will have training documents attached
  9. Select a Creation/Training Option:
    1. Create and Train Taxonomy: Create both a new taxonomy and train the nodes per the configuration options
    2. Create Taxonomy Only: Will create a new taxonomy but not train any nodes
    3. Training Existing Taxonomy: If a Taxonomy already exists the process will train the taxonomy



  10. Nodes to Create: Option to limit the number of taxonomy nodes to create when testing
    1. Hit Load Training Data
      1. Depending on the options this can be a time-consuming process. 
      2. When completed, a taxonomy will be available with training data, depending on configuration, and attached to the Machine Learning Provider.