Using Python script capabilities

You can configure and apply Python scripting to your documents with the Python Scripting component to utilize the capability of a Python script of your choosing. This topic will show you how to configure the component and provide use cases that you can incorporate into your AutoClassifier instance.

Prerequisites

You must complete the following prerequisites before setting up the Python Script component:

  • You must have Python 3.9.0 or later installed on your AutoClassifier server.

  • You must have selected the Install AutoClassifier Python Script Runner Rest API option when you installed your AutoClassifier engine.

  • You must avoid using scripts that contain infinite loops. These will cause the Python service to hand and waste resources.

  • You must not run scripts containing malicious or harmful code.

Limitations

Note the following limitations:

  • If you reset your Python environment and did not reinstall the necessary packages, your script will not run successfully. However, if your Python script does not contain any syntax errors, clicking the Compile button will not inform you of this error. As a result, Upland BA Insight highly recommends that you test your component configuration before adding your component and crawling your data with Connectivity Hub.

  • If your Python environment is reset or a new pipeline is imported, you must recompile the Python script and reinstall the necessary packages. This applies to the following situations:

    • The existing Python environment is deleted from the Upland BA Insight\AutoClassifier\Python Script Runner folder.

    • A new AutoClassifier pipeline is imported that includes the Python Scripting Stage. In this case, do the following:

      1. If you have an existing Python environment, you must delete it.

      2. Recompile the Python script to ensure the latest version is validated and prepared.

      3. Reinstall all of the required Python packages to rebuild a clean environment.

Configure the component

To configure your component, do the following:

  1. In the AutoClassifier administration portal, Add a new component to a new or existing pipeline.

  2. When adding your component, select Python Script from the New Component list and provide a Name for your component.

  3. In the Python Version drop-down menu, select the version of Python you want to use for your script. If you have multiple Python versions installed on your server, they will all be displayed in the list. You can click Reset if you wish to clear your Python version environment,

  4. In the Enter packages in Python requirements.txt format field, enter the packages that are necessary for your Python script to run.

  5. Click Install Packages to install the specified packages to your Python environment.

  6. In the Enter your Python script field, enter your python script according to your specifications.

  7. Click Compile. If there are any errors with your Python script syntax, they will display below the Compile button. If there are no errors, a "Script was successfully compiled" message displays.

  8. Click Save.

Use cases and examples

Use LangChain text splitters

Langchain Text Splitters allow you to take a document and split into chunks that can be used for retrieval.

  1. In the Enter packages in Python requirements.txt format field, enter the following packages:

    langchain_text_splitters
    bs4
    langchain
  2. In the Enter your Python script field, enter the following script:

    from langchain_text_splitters import RecursiveCharacterTextSplitter
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_text(item.Get('body'))
    for index, chunk in enumerate(texts):
        item.Set(f'chunk_{index}', chunk)
  3. You can update the chunk_size and chunk_overlap settings for your specifications:

    • chunk_size: Specifies the number of characters included in each chunk.

    • chunk_overlap: Specifies the number of characters that will carryover into subsequent chunks.

Use LangChain for HTML splitting

Langchain Text Splitters allow you to split an HTML page into configured chunks sizes that can be used for retrieval.

  1. In the Enter packages in Python requirements.txt format field, enter the following packages:

    langchain_text_splitters
    bs4
    langchain
  2. In the Enter your Python script field, enter the following script:

    import base64
    import os
    from bs4 import BeautifulSoup, Comment
    from langchain_text_splitters.html import HTMLSemanticPreservingSplitter
     
    # ==============================
    # CONFIG
    # ==============================
    MAX_CHUNK_SIZE = 10000
    CHUNK_OVERLAP = 200
    MIN_CHUNK_SIZE = 1000  # Minimum characters per chunk (adjust as needed)
     
    # ==============================
    # HELPER FUNCTION: Metadata Key
    # ==============================
    def get_metadata_key(chunk, idx=None):
        """Generate a consistent metadata key for a chunk."""
        if chunk.metadata:
            return " | ".join(f"{k}:{v}" for k, v in chunk.metadata.items() if v is not None)
        return f"unlabeled_{idx}" if idx is not None else "unlabeled"
     
    def merge_and_split_chunks(chunks, min_size=MIN_CHUNK_SIZE, max_size=MAX_CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP):
        """
        Merge small chunks and split large chunks, while preserving Document objects and metadata.
        Returns a list of Document objects.
        """
        # Step 1: Merge small chunks
        merged_chunks = []
        buffer = None
        separator = "<br><br><br>"
     
        for i, chunk in enumerate(chunks, start=1):
            metadata_key = get_metadata_key(chunk, i)
            chunk_text_with_marker = f"<strong>[{metadata_key}]\n{chunk.page_content}</strong>"
     
            if buffer is None:
                buffer = type(chunk)(
                    page_content=chunk_text_with_marker,
                    metadata=dict(chunk.metadata or {})
                )
            else:
                if len(buffer.page_content) < min_size:
                    buffer.page_content += separator + chunk_text_with_marker
                else:
                    merged_chunks.append(buffer)
                    buffer = type(chunk)(
                        page_content=chunk_text_with_marker,
                        metadata=dict(chunk.metadata or {})
                    )
     
        if buffer:
            merged_chunks.append(buffer)
     
        # Step 2: Split large chunks
        final_chunks = []
        for chunk in merged_chunks:
            text = chunk.page_content
            if len(text) > max_size:
                parts = hard_split_text(
                    text,
                    max_size=max_size,
                    chunk_overlap=chunk_overlap
                )
                for j, part in enumerate(parts, start=1):
                    new_md = dict(chunk.metadata or {})
                    new_md["split_index"] = j
                    new_md["split_total"] = len(parts)
                    final_chunks.append(type(chunk)(page_content=part, metadata=new_md))
            else:
                final_chunks.append(chunk)
     
        return final_chunks
     
     
    # ==============================
    # HELPER FUNCTION: Merge Small Chunks
    # ==============================
    def merge_small_chunks(chunks, min_size=MIN_CHUNK_SIZE):
        """
        Merge small chunks with the next chunk until they meet min_size.
        Uses a simple newline separator instead of HTML markers.
        """
        merged_chunks = []
        buffer = None
        separator = "<br><br><br>"  # three newlines
     
        for i, chunk in enumerate(chunks, start=1):
            metadata_key = get_metadata_key(chunk, i)
            chunk_text_with_marker = f"[{metadata_key}]\n{chunk.page_content}"
     
            if buffer is None:
                # Start a new buffer with a fresh chunk object
                buffer = type(chunk)(
                    page_content=chunk_text_with_marker,
                    metadata=dict(chunk.metadata or {})
                )
            else:
                if len(buffer.page_content) < min_size:
                    # Append with separator
                    buffer.page_content += separator + chunk_text_with_marker
                else:
                    merged_chunks.append(buffer)
                    buffer = type(chunk)(
                        page_content=chunk_text_with_marker,
                        metadata=dict(chunk.metadata or {})
                    )
     
        if buffer:
            merged_chunks.append(buffer)
     
        return merged_chunks
     
     
    # ==============================
    # HELPER FUNCTION: Hard Split Large Chunks
    # ==============================
    def hard_split_text(
        text,
        max_size=MAX_CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        preferred_seps=None,
        tolerance=0.25  # how far (10%) we search around max_size for a good split
    ):
        """
        Split text into chunks close to max_size while trying to respect natural boundaries.
        """
        if max_size <= 0:
            raise ValueError("max_size must be > 0")
     
        if chunk_overlap < 0:
            chunk_overlap = 0
        if chunk_overlap >= max_size:
            chunk_overlap = max(1, max_size // 4)
        if preferred_seps is None:
            preferred_seps = ["\n\n", "\n", ". ", "! ", "? ", ", "]
        n = len(text)
        if n <= max_size:
            return [text]
        chunks = []
        i = 0
        while i < n:
            end = min(i + max_size, n)
     
            # Try to find a natural cut within a window around max_size
            search_start = max(i, end - int(max_size * tolerance))
            slice_ = text[search_start:end]
     
            cut = None
            for sep in preferred_seps:
                pos = slice_.rfind(sep)
                if pos != -1:
                    cut = search_start + pos + len(sep)
                    break
     
            if cut is None or cut <= i:
                cut = end  # fallback to hard cut
     
            chunks.append(text[i:cut])
     
            if cut >= n:
                break
     
            next_i = cut - chunk_overlap
            if next_i <= i:
                next_i = cut
            i = next_i
        return chunks
     
     
    def merge_large_chunks(chunks, max_chunk_size, chunk_overlap=CHUNK_OVERLAP):
        """Split any chunk exceeding max_chunk_size into <= max_chunk_size pieces."""
        final_chunks = []
        for chunk in chunks:
            text = chunk.page_content
            if len(text) > max_chunk_size:
                parts = hard_split_text(
                    text,
                    max_size=max_chunk_size,
                    chunk_overlap=chunk_overlap,
                )
                # Optionally annotate sub-parts for consistent metadata keys
                for j, part in enumerate(parts, start=1):
                    new_md = dict(chunk.metadata or {})
                    new_md["split_index"] = j
                    new_md["split_total"] = len(parts)
                    new_chunk = type(chunk)(page_content=part, metadata=new_md)
                    final_chunks.append(new_chunk)
            else:
                final_chunks.append(chunk)
        return final_chunks
     
     
    # ==============================
    # MAIN SCRIPT
    # ==============================
     
    # Step 1: Decode Base64 HTML
    decoded_bytes = base64.b64decode(item.RawData)
    html_data = decoded_bytes.decode("utf-8", errors="replace")
    item.Set("htmlData", html_data)
     
    # Step 2: Parse HTML
    soup = BeautifulSoup(html_data, "html.parser")
     
    # --- Focus only on main content ---
    main_content = soup.select_one(
        "main, #main-content, .ak-renderer-document, .wiki-content, article"
    )
    if main_content:
        soup = main_content
    elif soup.body:
        soup = soup.body
     
    # --- Remove UI noise ---
    noise_selectors = [
        "script", "style", "noscript", "iframe", "svg", "link", "meta",
        "footer", "header", "nav",
        "[role='banner']", "[role='navigation']", "[role='complementary']",
        ".ia-fixed-sidebar", ".ia-secondary-sidebar", ".ia-top-bar",
        ".sidebar", ".navigation", ".ak-side-navigation", ".ak-navigation",
        ".ak-main-navigation", ".ak-app-navigation", ".ak-top-nav",
        ".toolbar", ".menu-section", ".breadcrumbs",
        ".page-metadata", ".page-metadata-modified", ".page-metadata-container",
        ".content-by-label", ".ak-renderer-page-toolbar", ".ak-renderer-table-number-column",
        ".page-header", ".ak-renderer-header", ".ak-renderer-title",
        ".ak-renderer-root", ".ak-renderer-content-wrap", ".ak-renderer-panel",
        ".ak-renderer-metadata", ".ak-renderer-action-bar", ".ak-renderer-feedback",
        ".ak-renderer-topbar", ".ak-renderer-annotation", ".ak-renderer-comment"
    ]
    for tag in soup.select(",".join(noise_selectors)):
        tag.decompose()
     
    # Remove HTML comments
    for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
        comment.extract()
       
    for img in soup.find_all("img", src=True):
        if img["src"].startswith("blob:"):
            img.decompose()
     
    # --- Extract full tables and replace with placeholders ---
    table_map = {}
    for idx, table in enumerate(soup.find_all("table")):
        # Only replace if the table is **not inside another table**
        if table.find_parent("table") is None:
            placeholder = f"<<TABLE_{idx}>>"
            table_map[placeholder] = str(table)
            table.replace_with(placeholder)
     
    cleaned_html = str(soup)
    item.Set("cleanedHtml", cleaned_html)
     
    # Step 3: Configure splitter
    splitter = HTMLSemanticPreservingSplitter(
        headers_to_split_on=[("h1", "Chapter 1"), ("h2", "Chapter 2"), ("h3", "Chapter 3")],
        max_chunk_size=MAX_CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        separators = ["\n\n", "\n", ". ", "! ", "? ", ", "],
        elements_to_preserve=["ul", "ol", "code", "pre","table"],
        preserve_links=True,
        preserve_images=True,
        normalize_text=False,
        stopword_removal=False,
        keep_separator="start"
    )
     
    # Step 4: Perform chunking
    chunks = splitter.split_text(cleaned_html)
    item.Set("debug_chunks_after_split", [c.page_content for c in chunks])
     
    # Step 6: Restore full tables into chunks
    # After chunking and merging chunks, restore lists:
    for chunk in chunks:
        content = chunk.page_content
        for placeholder, table_html in table_map.items():
            content = content.replace(placeholder, table_html)
        chunk.page_content = content
     
    # Step 6: Merge small and large chunks
    chunks = merge_small_chunks(chunks, MIN_CHUNK_SIZE)
    #chunks = merge_large_chunks(chunks, MAX_CHUNK_SIZE, CHUNK_OVERLAP)
    #chunks = merge_and_split_chunks(chunks, MIN_CHUNK_SIZE, MAX_CHUNK_SIZE, CHUNK_OVERLAP)
    # Step 7: Store cleaned text
    cleaned_text = "\n\n".join([chunk.page_content for chunk in chunks])
    item.Set("filteredText", cleaned_text)
     
    # Step 8: Build final HTML
    html_chunks = [
        '<html><head><title>Chunked Output</title>'
        '<style>'
        'body{font-family:Arial,sans-serif;}'
        'div.chunk{margin-bottom:30px;}'
        'table{border-collapse:collapse;width:100%;}td,th{border:1px solid #ccc;padding:5px;}'
        '</style>'
        '</head><body>'
    ]
     
    for index, chunk in enumerate(chunks, start=1):
        section_name = get_metadata_key(chunk, index) or f"Section {index}"
        chunk_content = chunk.page_content
     
        # Save each chunk in the item
        item.Set(f'chunk_{index}', chunk_content)
     
        # Build HTML for this chunk
        html_chunks.append("<div class='chunk'>")
        html_chunks.append(f"<h3>{section_name}</h3>")
        html_chunks.append(chunk_content)
        html_chunks.append("</div>")
     
    html_chunks.append('</body></html>')
     
    # Step 9: Save HTML
    final_html = "\n".join(html_chunks)
    item.Set("chunkedHtmlFileContent", final_html)
     
    output_filename = "chunked_output_semantic.html"
    output_path = os.path.join(os.getcwd(), output_filename)
    Log.debug(f"Output path: {output_path}")
     
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(final_html)
  3. At the top of the script, you can edit the MAX_CHUNK_SIZE, CHUNK_OVERLAP, and MIN_CHUNK_SIZE settings to your specifications:

    • MAX_CHUNK_SIZE: Specifies the maximum number of characters included in each chunk.

    • CHUNK_OVERLAP: Specifies the number of characters that will carryover into subsequent chunks.

    • MIN_CHUNK_SIZE: Specifies the minimum number of characters included in each chunk.

Get an item property

To get an item property, use the generic GET method on item object: propertyValue = item.Get("propertyName")

For example, to get the value of the body property:

body = item.Get("body")

Set an item property

To set an item property, use the generic SET method on item object: item.Set("propertyName", propertyValue)

For example, to set the value of the author property:

item.Set("Author", "AI Agent")

Access raw data

Access the raw data field that may be sent to the service:

content = item.RawData

Child item use cases

Add a child item

child = ProcessedChildItem(ItemProperties=[AbstractProperty(Name="New child", Value="propertydata")])
item.AddChild(child)

Print all child items

Log.debug("All Child Items:")
for child in item.ChildItems:
    Log.debug(f"Child UniqueId: {child.UniqueId}, Properties: {[f'{p.Name}={p.Value}' for p in child.ItemProperties]}")

Remove a cild item

item.RemoveChild(child.UniqueId)

Retrieve metadata

Include body and outputs in JSON format

all_metadata: str = item.get_all_metadata(True, "json")
Log.debug(f"Body included and format is json: {all_metadata}")

Exclude body and outputs in JSON format

all_metadata_without_body: str = item.get_all_metadata(False, "json")
Log.debug(f"Body excluded and format is json: {all_metadata_without_body}")

Include body and outputs in XML format

all_metadata_as_xml: str = item.get_all_metadata(True, "xml")
Log.debug(f"Body included and format is xml: {all_metadata_as_xml}")

Exclude body and outputs in XML

all_metadata_without_body_as_xml: str = item.get_all_metadata(False, "xml")
Log.debug(f"Body excluded and format is xml: {all_metadata_without_body_as_xml}")

Default behavior for metadata retrieval

all_metadata_default: str = item.get_all_metadata(True)
Log.debug(f"Body included and default format is json: {all_metadata_default}")

Configure logging

You can configure logging for your Python script runner service.

Configure the logging settings

  1. In the AutoClassifier installation folder, navigate to AutoClassifier\Python Script Runner Service\app\logging_config.json

  2. Open the logging_config.json file in a text editor.

  3. locate the log_level setting.

  4. Adjust the log level to your specifications. Supported values include DEBUG, INFO, WARNING, and ERROR.

View the log files

You can view the log files in the following locations:

  • The logs of the Python Script Runner REST API are located in C:\Program Files\Upland BA Insight\AutoClassifier\Python Script Runner Service\app\logs\pythonscriptrunnerrestapi.log

  • The logs of the Script execution are located in C:\Program Files\Upland BA Insight\AutoClassifier\Python Script Runner Service\app\logs\runner.log

If you make any changes to the logging configuration, you must restart the Windows service for the Python script runner for the changes to take effect.

Log usage

  • Log.debug("Debug message")

  • Log.info("Info message")

  • Log.warning("Warning message")

  • Log.error("Error message")

Using debug logging

If your Python script contains debug log statements (e.g., Log.Debug("Debug message")), do the following:

  1. Set the log level to Debug in the logging_config.json file. By default, this file is located in the Upland BA Insight\AutoClassifier\Python Script Runner Service\app folder.
  2. Restart the Python Scripting Service for the log change to take effect.

If you do not make this change, debug log lines will not appear in the logs.