Using Python script capabilities

Prerequisites
Limitations
Configure the component
Use cases and examples
Configure logging

You can configure and apply Python scripting to your documents with the Python Scripting component to utilize the capability of a Python script of your choosing. This topic will show you how to configure the component and provide use cases that you can incorporate into your AutoClassifier instance.

Prerequisites

You must complete the following prerequisites before setting up the Python Script component:

You must have Python 3.9.0 or later installed on your AutoClassifier server.
You must have selected the Install AutoClassifier Python Script Runner Rest API option when you installed your AutoClassifier engine.
You must avoid using scripts that contain infinite loops. These will cause the Python service to hand and waste resources.
You must not run scripts containing malicious or harmful code.

Limitations

Note the following limitations:

If you reset your Python environment and did not reinstall the necessary packages, your script will not run successfully. However, if your Python script does not contain any syntax errors, clicking the Compile button will not inform you of this error. As a result, Upland BA Insight highly recommends that you test your component configuration before adding your component and crawling your data with Connectivity Hub.
If your Python environment is reset or a new pipeline is imported, you must recompile the Python script and reinstall the necessary packages. This applies to the following situations:
- The existing Python environment is deleted from the Upland BA Insight\AutoClassifier\Python Script Runner folder.
- A new AutoClassifier pipeline is imported that includes the Python Scripting Stage. In this case, do the following:
  1. If you have an existing Python environment, you must delete it.
  2. Recompile the Python script to ensure the latest version is validated and prepared.
  3. Reinstall all of the required Python packages to rebuild a clean environment.

Configure the component

To configure your component, do the following:

In the AutoClassifier administration portal, Add a new component to a new or existing pipeline.
When adding your component, select Python Script from the New Component list and provide a Name for your component.
In the Python Version drop-down menu, select the version of Python you want to use for your script. If you have multiple Python versions installed on your server, they will all be displayed in the list. You can click Reset if you wish to clear your Python version environment,
In the Enter packages in Python requirements.txt format field, enter the packages that are necessary for your Python script to run.
Click Install Packages to install the specified packages to your Python environment.
In the Enter your Python script field, enter your python script according to your specifications.
Click Compile. If there are any errors with your Python script syntax, they will display below the Compile button. If there are no errors, a "Script was successfully compiled" message displays.
Click Save.

Use cases and examples

Use LangChain text splitters

Langchain Text Splitters allow you to take a document and split into chunks that can be used for retrieval.

In the Enter packages in Python requirements.txt format field, enter the following packages:
```
langchain_text_splitters
bs4
langchain
```

In the Enter your Python script field, enter the following script:

from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(item.Get('body'))
for index, chunk in enumerate(texts):
    item.Set(f'chunk_{index}', chunk)

                                                        

You can update the chunk_size and chunk_overlap settings for your specifications:
- chunk_size: Specifies the number of characters included in each chunk.
- chunk_overlap: Specifies the number of characters that will carryover into subsequent chunks.

Use LangChain for HTML splitting

Langchain Text Splitters allow you to split an HTML page into configured chunks sizes that can be used for retrieval.

In the Enter packages in Python requirements.txt format field, enter the following packages:
```
langchain_text_splitters
bs4
langchain
```

In the Enter your Python script field, enter the following script:

import base64
import os
from bs4 import BeautifulSoup, Comment
from langchain_text_splitters.html import HTMLSemanticPreservingSplitter
 
# ==============================
# CONFIG
# ==============================
MAX_CHUNK_SIZE = 10000
CHUNK_OVERLAP = 200
MIN_CHUNK_SIZE = 1000  # Minimum characters per chunk (adjust as needed)
 
# ==============================
# HELPER FUNCTION: Metadata Key
# ==============================
def get_metadata_key(chunk, idx=None):
    """Generate a consistent metadata key for a chunk."""
    if chunk.metadata:
        return " | ".join(f"{k}:{v}" for k, v in chunk.metadata.items() if v is not None)
    return f"unlabeled_{idx}" if idx is not None else "unlabeled"
 
def merge_and_split_chunks(chunks, min_size=MIN_CHUNK_SIZE, max_size=MAX_CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP):
    """
    Merge small chunks and split large chunks, while preserving Document objects and metadata.
    Returns a list of Document objects.
    """
    # Step 1: Merge small chunks
    merged_chunks = []
    buffer = None
    separator = "<br><br><br>"
 
    for i, chunk in enumerate(chunks, start=1):
        metadata_key = get_metadata_key(chunk, i)
        chunk_text_with_marker = f"<strong>[{metadata_key}]\n{chunk.page_content}</strong>"
 
        if buffer is None:
            buffer = type(chunk)(
                page_content=chunk_text_with_marker,
                metadata=dict(chunk.metadata or {})
            )
        else:
            if len(buffer.page_content) < min_size:
                buffer.page_content += separator + chunk_text_with_marker
            else:
                merged_chunks.append(buffer)
                buffer = type(chunk)(
                    page_content=chunk_text_with_marker,
                    metadata=dict(chunk.metadata or {})
                )
 
    if buffer:
        merged_chunks.append(buffer)
 
    # Step 2: Split large chunks
    final_chunks = []
    for chunk in merged_chunks:
        text = chunk.page_content
        if len(text) > max_size:
            parts = hard_split_text(
                text,
                max_size=max_size,
                chunk_overlap=chunk_overlap
            )
            for j, part in enumerate(parts, start=1):
                new_md = dict(chunk.metadata or {})
                new_md["split_index"] = j
                new_md["split_total"] = len(parts)
                final_chunks.append(type(chunk)(page_content=part, metadata=new_md))
        else:
            final_chunks.append(chunk)
 
    return final_chunks
 
 
# ==============================
# HELPER FUNCTION: Merge Small Chunks
# ==============================
def merge_small_chunks(chunks, min_size=MIN_CHUNK_SIZE):
    """
    Merge small chunks with the next chunk until they meet min_size.
    Uses a simple newline separator instead of HTML markers.
    """
    merged_chunks = []
    buffer = None
    separator = "<br><br><br>"  # three newlines
 
    for i, chunk in enumerate(chunks, start=1):
        metadata_key = get_metadata_key(chunk, i)
        chunk_text_with_marker = f"[{metadata_key}]\n{chunk.page_content}"
 
        if buffer is None:
            # Start a new buffer with a fresh chunk object
            buffer = type(chunk)(
                page_content=chunk_text_with_marker,
                metadata=dict(chunk.metadata or {})
            )
        else:
            if len(buffer.page_content) < min_size:
                # Append with separator
                buffer.page_content += separator + chunk_text_with_marker
            else:
                merged_chunks.append(buffer)
                buffer = type(chunk)(
                    page_content=chunk_text_with_marker,
                    metadata=dict(chunk.metadata or {})
                )
 
    if buffer:
        merged_chunks.append(buffer)
 
    return merged_chunks
 
 
# ==============================
# HELPER FUNCTION: Hard Split Large Chunks
# ==============================
def hard_split_text(
    text,
    max_size=MAX_CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    preferred_seps=None,
    tolerance=0.25  # how far (10%) we search around max_size for a good split
):
    """
    Split text into chunks close to max_size while trying to respect natural boundaries.
    """
    if max_size <= 0:
        raise ValueError("max_size must be > 0")
 
    if chunk_overlap < 0:
        chunk_overlap = 0
    if chunk_overlap >= max_size:
        chunk_overlap = max(1, max_size // 4)
    if preferred_seps is None:
        preferred_seps = ["\n\n", "\n", ". ", "! ", "? ", ", "]
    n = len(text)
    if n <= max_size:
        return [text]
    chunks = []
    i = 0
    while i < n:
        end = min(i + max_size, n)
 
        # Try to find a natural cut within a window around max_size
        search_start = max(i, end - int(max_size * tolerance))
        slice_ = text[search_start:end]
 
        cut = None
        for sep in preferred_seps:
            pos = slice_.rfind(sep)
            if pos != -1:
                cut = search_start + pos + len(sep)
                break
 
        if cut is None or cut <= i:
            cut = end  # fallback to hard cut
 
        chunks.append(text[i:cut])
 
        if cut >= n:
            break
 
        next_i = cut - chunk_overlap
        if next_i <= i:
            next_i = cut
        i = next_i
    return chunks
 
 
def merge_large_chunks(chunks, max_chunk_size, chunk_overlap=CHUNK_OVERLAP):
    """Split any chunk exceeding max_chunk_size into <= max_chunk_size pieces."""
    final_chunks = []
    for chunk in chunks:
        text = chunk.page_content
        if len(text) > max_chunk_size:
            parts = hard_split_text(
                text,
                max_size=max_chunk_size,
                chunk_overlap=chunk_overlap,
            )
            # Optionally annotate sub-parts for consistent metadata keys
            for j, part in enumerate(parts, start=1):
                new_md = dict(chunk.metadata or {})
                new_md["split_index"] = j
                new_md["split_total"] = len(parts)
                new_chunk = type(chunk)(page_content=part, metadata=new_md)
                final_chunks.append(new_chunk)
        else:
            final_chunks.append(chunk)
    return final_chunks
 
 
# ==============================
# MAIN SCRIPT
# ==============================
 
# Step 1: Decode Base64 HTML
decoded_bytes = base64.b64decode(item.RawData)
html_data = decoded_bytes.decode("utf-8", errors="replace")
item.Set("htmlData", html_data)
 
# Step 2: Parse HTML
soup = BeautifulSoup(html_data, "html.parser")
 
# --- Focus only on main content ---
main_content = soup.select_one(
    "main, #main-content, .ak-renderer-document, .wiki-content, article"
)
if main_content:
    soup = main_content
elif soup.body:
    soup = soup.body
 
# --- Remove UI noise ---
noise_selectors = [
    "script", "style", "noscript", "iframe", "svg", "link", "meta",
    "footer", "header", "nav",
    "[role='banner']", "[role='navigation']", "[role='complementary']",
    ".ia-fixed-sidebar", ".ia-secondary-sidebar", ".ia-top-bar",
    ".sidebar", ".navigation", ".ak-side-navigation", ".ak-navigation",
    ".ak-main-navigation", ".ak-app-navigation", ".ak-top-nav",
    ".toolbar", ".menu-section", ".breadcrumbs",
    ".page-metadata", ".page-metadata-modified", ".page-metadata-container",
    ".content-by-label", ".ak-renderer-page-toolbar", ".ak-renderer-table-number-column",
    ".page-header", ".ak-renderer-header", ".ak-renderer-title",
    ".ak-renderer-root", ".ak-renderer-content-wrap", ".ak-renderer-panel",
    ".ak-renderer-metadata", ".ak-renderer-action-bar", ".ak-renderer-feedback",
    ".ak-renderer-topbar", ".ak-renderer-annotation", ".ak-renderer-comment"
]
for tag in soup.select(",".join(noise_selectors)):
    tag.decompose()
 
# Remove HTML comments
for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
    comment.extract()
   
for img in soup.find_all("img", src=True):
    if img["src"].startswith("blob:"):
        img.decompose()
 
# --- Extract full tables and replace with placeholders ---
table_map = {}
for idx, table in enumerate(soup.find_all("table")):
    # Only replace if the table is **not inside another table**
    if table.find_parent("table") is None:
        placeholder = f"<<TABLE_{idx}>>"
        table_map[placeholder] = str(table)
        table.replace_with(placeholder)
 
cleaned_html = str(soup)
item.Set("cleanedHtml", cleaned_html)
 
# Step 3: Configure splitter
splitter = HTMLSemanticPreservingSplitter(
    headers_to_split_on=[("h1", "Chapter 1"), ("h2", "Chapter 2"), ("h3", "Chapter 3")],
    max_chunk_size=MAX_CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    separators = ["\n\n", "\n", ". ", "! ", "? ", ", "],
    elements_to_preserve=["ul", "ol", "code", "pre","table"],
    preserve_links=True,
    preserve_images=True,
    normalize_text=False,
    stopword_removal=False,
    keep_separator="start"
)
 
# Step 4: Perform chunking
chunks = splitter.split_text(cleaned_html)
item.Set("debug_chunks_after_split", [c.page_content for c in chunks])
 
# Step 6: Restore full tables into chunks
# After chunking and merging chunks, restore lists:
for chunk in chunks:
    content = chunk.page_content
    for placeholder, table_html in table_map.items():
        content = content.replace(placeholder, table_html)
    chunk.page_content = content
 
# Step 6: Merge small and large chunks
chunks = merge_small_chunks(chunks, MIN_CHUNK_SIZE)
#chunks = merge_large_chunks(chunks, MAX_CHUNK_SIZE, CHUNK_OVERLAP)
#chunks = merge_and_split_chunks(chunks, MIN_CHUNK_SIZE, MAX_CHUNK_SIZE, CHUNK_OVERLAP)
# Step 7: Store cleaned text
cleaned_text = "\n\n".join([chunk.page_content for chunk in chunks])
item.Set("filteredText", cleaned_text)
 
# Step 8: Build final HTML
html_chunks = [
    '<html><head><title>Chunked Output</title>'
    '<style>'
    'body{font-family:Arial,sans-serif;}'
    'div.chunk{margin-bottom:30px;}'
    'table{border-collapse:collapse;width:100%;}td,th{border:1px solid #ccc;padding:5px;}'
    '</style>'
    '</head><body>'
]
 
for index, chunk in enumerate(chunks, start=1):
    section_name = get_metadata_key(chunk, index) or f"Section {index}"
    chunk_content = chunk.page_content
 
    # Save each chunk in the item
    item.Set(f'chunk_{index}', chunk_content)
 
    # Build HTML for this chunk
    html_chunks.append("<div class='chunk'>")
    html_chunks.append(f"<h3>{section_name}</h3>")
    html_chunks.append(chunk_content)
    html_chunks.append("</div>")
 
html_chunks.append('</body></html>')
 
# Step 9: Save HTML
final_html = "\n".join(html_chunks)
item.Set("chunkedHtmlFileContent", final_html)
 
output_filename = "chunked_output_semantic.html"
output_path = os.path.join(os.getcwd(), output_filename)
Log.debug(f"Output path: {output_path}")
 
with open(output_path, "w", encoding="utf-8") as f:
    f.write(final_html)
                                                        

At the top of the script, you can edit the MAX_CHUNK_SIZE, CHUNK_OVERLAP, and MIN_CHUNK_SIZE settings to your specifications:
- MAX_CHUNK_SIZE: Specifies the maximum number of characters included in each chunk.
- CHUNK_OVERLAP: Specifies the number of characters that will carryover into subsequent chunks.
- MIN_CHUNK_SIZE: Specifies the minimum number of characters included in each chunk.

Get an item property

To get an item property, use the generic GET method on item object: propertyValue = item.Get("propertyName")

For example, to get the value of the body property:

body = item.Get("body")

Set an item property

To set an item property, use the generic SET method on item object: item.Set("propertyName", propertyValue)

For example, to set the value of the author property:

item.Set("Author", "AI Agent")

Access raw data

Access the raw data field that may be sent to the service:

content = item.RawData

Child item use cases

Add a child item

child = ProcessedChildItem(ItemProperties=[AbstractProperty(Name="New child", Value="propertydata")])
item.AddChild(child)
                                                

Print all child items

Log.debug("All Child Items:")
for child in item.ChildItems:
    Log.debug(f"Child UniqueId: {child.UniqueId}, Properties: {[f'{p.Name}={p.Value}' for p in child.ItemProperties]}")
                                                

Remove a cild item

item.RemoveChild(child.UniqueId)

Retrieve metadata

Include body and outputs in JSON format

all_metadata: str = item.get_all_metadata(True, "json")
Log.debug(f"Body included and format is json: {all_metadata}")
                                                

Exclude body and outputs in JSON format

all_metadata_without_body: str = item.get_all_metadata(False, "json")
Log.debug(f"Body excluded and format is json: {all_metadata_without_body}")
                                                

Include body and outputs in XML format

all_metadata_as_xml: str = item.get_all_metadata(True, "xml")
Log.debug(f"Body included and format is xml: {all_metadata_as_xml}")
                                                

Exclude body and outputs in XML

all_metadata_without_body_as_xml: str = item.get_all_metadata(False, "xml")
Log.debug(f"Body excluded and format is xml: {all_metadata_without_body_as_xml}")
                                                

Default behavior for metadata retrieval

all_metadata_default: str = item.get_all_metadata(True)
Log.debug(f"Body included and default format is json: {all_metadata_default}")
                                                

Configure logging

You can configure logging for your Python script runner service.

Configure the logging settings

In the AutoClassifier installation folder, navigate to AutoClassifier\Python Script Runner Service\app\logging_config.json
Open the logging_config.json file in a text editor.
locate the log_level setting.
Adjust the log level to your specifications. Supported values include DEBUG, INFO, WARNING, and ERROR.

View the log files

You can view the log files in the following locations:

The logs of the Python Script Runner REST API are located in C:\Program Files\Upland BA Insight\AutoClassifier\Python Script Runner Service\app\logs\pythonscriptrunnerrestapi.log
The logs of the Script execution are located in C:\Program Files\Upland BA Insight\AutoClassifier\Python Script Runner Service\app\logs\runner.log

If you make any changes to the logging configuration, you must restart the Windows service for the Python script runner for the changes to take effect.

Log usage

Log.debug("Debug message")
Log.info("Info message")
Log.warning("Warning message")
Log.error("Error message")

Using debug logging

If your Python script contains debug log statements (e.g., Log.Debug("Debug message")), do the following:

Set the log level to Debug in the logging_config.json file. By default, this file is located in the Upland BA Insight\AutoClassifier\Python Script Runner Service\app folder.
Restart the Python Scripting Service for the log change to take effect.

If you do not make this change, debug log lines will not appear in the logs.