Using Python script capabilities
You can configure and apply Python scripting to your documents with the Python Scripting component to utilize the capability of a Python script of your choosing. This topic will show you how to configure the component and provide use cases that you can incorporate into your AutoClassifier instance.
Prerequisites
You must complete the following prerequisites before setting up the Python Script component:
-
You must have Python 3.9.0 or later installed on your AutoClassifier server.
-
You must have selected the Install AutoClassifier Python Script Runner Rest API option when you installed your AutoClassifier engine.
-
You must avoid using scripts that contain infinite loops. These will cause the Python service to hand and waste resources.
-
You must not run scripts containing malicious or harmful code.
Limitations
Note the following limitations:
-
If you reset your Python environment and did not reinstall the necessary packages, your script will not run successfully. However, if your Python script does not contain any syntax errors, clicking the Compile button will not inform you of this error. As a result, Upland BA Insight highly recommends that you test your component configuration before adding your component and crawling your data with Connectivity Hub.
-
If your Python environment is reset or a new pipeline is imported, you must recompile the Python script and reinstall the necessary packages. This applies to the following situations:
-
The existing Python environment is deleted from the Upland BA Insight\AutoClassifier\Python Script Runner folder.
-
A new AutoClassifier pipeline is imported that includes the Python Scripting Stage. In this case, do the following:
-
If you have an existing Python environment, you must delete it.
-
Recompile the Python script to ensure the latest version is validated and prepared.
-
Reinstall all of the required Python packages to rebuild a clean environment.
-
-
Configure the component
To configure your component, do the following:
-
In the AutoClassifier administration portal, Add a new component to a new or existing pipeline.
-
When adding your component, select Python Script from the New Component list and provide a Name for your component.
-
In the Python Version drop-down menu, select the version of Python you want to use for your script. If you have multiple Python versions installed on your server, they will all be displayed in the list. You can click Reset if you wish to clear your Python version environment,
-
In the Enter packages in Python requirements.txt format field, enter the packages that are necessary for your Python script to run.
-
Click Install Packages to install the specified packages to your Python environment.
-
In the Enter your Python script field, enter your python script according to your specifications.
-
Click Compile. If there are any errors with your Python script syntax, they will display below the Compile button. If there are no errors, a "Script was successfully compiled" message displays.
-
Click Save.
Use cases and examples
Use LangChain text splitters
Langchain Text Splitters allow you to take a document and split into chunks that can be used for retrieval.
-
In the Enter packages in Python requirements.txt format field, enter the following packages:
langchain_text_splitters
bs4
langchain -
In the Enter your Python script field, enter the following script:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(item.Get('body'))
for index, chunk in enumerate(texts):
item.Set(f'chunk_{index}', chunk) -
You can update the chunk_size and chunk_overlap settings for your specifications:
-
chunk_size: Specifies the number of characters included in each chunk.
-
chunk_overlap: Specifies the number of characters that will carryover into subsequent chunks.
-
Use LangChain for HTML splitting
Langchain Text Splitters allow you to split an HTML page into configured chunks sizes that can be used for retrieval.
-
In the Enter packages in Python requirements.txt format field, enter the following packages:
langchain_text_splitters
bs4
langchain -
In the Enter your Python script field, enter the following script:
import base64
import os
from bs4 import BeautifulSoup, Comment
from langchain_text_splitters.html import HTMLSemanticPreservingSplitter
# ==============================
# CONFIG
# ==============================
MAX_CHUNK_SIZE = 10000
CHUNK_OVERLAP = 200
MIN_CHUNK_SIZE = 1000 # Minimum characters per chunk (adjust as needed)
# ==============================
# HELPER FUNCTION: Metadata Key
# ==============================
def get_metadata_key(chunk, idx=None):
"""Generate a consistent metadata key for a chunk."""
if chunk.metadata:
return " | ".join(f"{k}:{v}" for k, v in chunk.metadata.items() if v is not None)
return f"unlabeled_{idx}" if idx is not None else "unlabeled"
def merge_and_split_chunks(chunks, min_size=MIN_CHUNK_SIZE, max_size=MAX_CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP):
"""
Merge small chunks and split large chunks, while preserving Document objects and metadata.
Returns a list of Document objects.
"""
# Step 1: Merge small chunks
merged_chunks = []
buffer = None
separator = "<br><br><br>"
for i, chunk in enumerate(chunks, start=1):
metadata_key = get_metadata_key(chunk, i)
chunk_text_with_marker = f"<strong>[{metadata_key}]\n{chunk.page_content}</strong>"
if buffer is None:
buffer = type(chunk)(
page_content=chunk_text_with_marker,
metadata=dict(chunk.metadata or {})
)
else:
if len(buffer.page_content) < min_size:
buffer.page_content += separator + chunk_text_with_marker
else:
merged_chunks.append(buffer)
buffer = type(chunk)(
page_content=chunk_text_with_marker,
metadata=dict(chunk.metadata or {})
)
if buffer:
merged_chunks.append(buffer)
# Step 2: Split large chunks
final_chunks = []
for chunk in merged_chunks:
text = chunk.page_content
if len(text) > max_size:
parts = hard_split_text(
text,
max_size=max_size,
chunk_overlap=chunk_overlap
)
for j, part in enumerate(parts, start=1):
new_md = dict(chunk.metadata or {})
new_md["split_index"] = j
new_md["split_total"] = len(parts)
final_chunks.append(type(chunk)(page_content=part, metadata=new_md))
else:
final_chunks.append(chunk)
return final_chunks
# ==============================
# HELPER FUNCTION: Merge Small Chunks
# ==============================
def merge_small_chunks(chunks, min_size=MIN_CHUNK_SIZE):
"""
Merge small chunks with the next chunk until they meet min_size.
Uses a simple newline separator instead of HTML markers.
"""
merged_chunks = []
buffer = None
separator = "<br><br><br>" # three newlines
for i, chunk in enumerate(chunks, start=1):
metadata_key = get_metadata_key(chunk, i)
chunk_text_with_marker = f"[{metadata_key}]\n{chunk.page_content}"
if buffer is None:
# Start a new buffer with a fresh chunk object
buffer = type(chunk)(
page_content=chunk_text_with_marker,
metadata=dict(chunk.metadata or {})
)
else:
if len(buffer.page_content) < min_size:
# Append with separator
buffer.page_content += separator + chunk_text_with_marker
else:
merged_chunks.append(buffer)
buffer = type(chunk)(
page_content=chunk_text_with_marker,
metadata=dict(chunk.metadata or {})
)
if buffer:
merged_chunks.append(buffer)
return merged_chunks
# ==============================
# HELPER FUNCTION: Hard Split Large Chunks
# ==============================
def hard_split_text(
text,
max_size=MAX_CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
preferred_seps=None,
tolerance=0.25 # how far (10%) we search around max_size for a good split
):
"""
Split text into chunks close to max_size while trying to respect natural boundaries.
"""
if max_size <= 0:
raise ValueError("max_size must be > 0")
if chunk_overlap < 0:
chunk_overlap = 0
if chunk_overlap >= max_size:
chunk_overlap = max(1, max_size // 4)
if preferred_seps is None:
preferred_seps = ["\n\n", "\n", ". ", "! ", "? ", ", "]
n = len(text)
if n <= max_size:
return [text]
chunks = []
i = 0
while i < n:
end = min(i + max_size, n)
# Try to find a natural cut within a window around max_size
search_start = max(i, end - int(max_size * tolerance))
slice_ = text[search_start:end]
cut = None
for sep in preferred_seps:
pos = slice_.rfind(sep)
if pos != -1:
cut = search_start + pos + len(sep)
break
if cut is None or cut <= i:
cut = end # fallback to hard cut
chunks.append(text[i:cut])
if cut >= n:
break
next_i = cut - chunk_overlap
if next_i <= i:
next_i = cut
i = next_i
return chunks
def merge_large_chunks(chunks, max_chunk_size, chunk_overlap=CHUNK_OVERLAP):
"""Split any chunk exceeding max_chunk_size into <= max_chunk_size pieces."""
final_chunks = []
for chunk in chunks:
text = chunk.page_content
if len(text) > max_chunk_size:
parts = hard_split_text(
text,
max_size=max_chunk_size,
chunk_overlap=chunk_overlap,
)
# Optionally annotate sub-parts for consistent metadata keys
for j, part in enumerate(parts, start=1):
new_md = dict(chunk.metadata or {})
new_md["split_index"] = j
new_md["split_total"] = len(parts)
new_chunk = type(chunk)(page_content=part, metadata=new_md)
final_chunks.append(new_chunk)
else:
final_chunks.append(chunk)
return final_chunks
# ==============================
# MAIN SCRIPT
# ==============================
# Step 1: Decode Base64 HTML
decoded_bytes = base64.b64decode(item.RawData)
html_data = decoded_bytes.decode("utf-8", errors="replace")
item.Set("htmlData", html_data)
# Step 2: Parse HTML
soup = BeautifulSoup(html_data, "html.parser")
# --- Focus only on main content ---
main_content = soup.select_one(
"main, #main-content, .ak-renderer-document, .wiki-content, article"
)
if main_content:
soup = main_content
elif soup.body:
soup = soup.body
# --- Remove UI noise ---
noise_selectors = [
"script", "style", "noscript", "iframe", "svg", "link", "meta",
"footer", "header", "nav",
"[role='banner']", "[role='navigation']", "[role='complementary']",
".ia-fixed-sidebar", ".ia-secondary-sidebar", ".ia-top-bar",
".sidebar", ".navigation", ".ak-side-navigation", ".ak-navigation",
".ak-main-navigation", ".ak-app-navigation", ".ak-top-nav",
".toolbar", ".menu-section", ".breadcrumbs",
".page-metadata", ".page-metadata-modified", ".page-metadata-container",
".content-by-label", ".ak-renderer-page-toolbar", ".ak-renderer-table-number-column",
".page-header", ".ak-renderer-header", ".ak-renderer-title",
".ak-renderer-root", ".ak-renderer-content-wrap", ".ak-renderer-panel",
".ak-renderer-metadata", ".ak-renderer-action-bar", ".ak-renderer-feedback",
".ak-renderer-topbar", ".ak-renderer-annotation", ".ak-renderer-comment"
]
for tag in soup.select(",".join(noise_selectors)):
tag.decompose()
# Remove HTML comments
for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
comment.extract()
for img in soup.find_all("img", src=True):
if img["src"].startswith("blob:"):
img.decompose()
# --- Extract full tables and replace with placeholders ---
table_map = {}
for idx, table in enumerate(soup.find_all("table")):
# Only replace if the table is **not inside another table**
if table.find_parent("table") is None:
placeholder = f"<<TABLE_{idx}>>"
table_map[placeholder] = str(table)
table.replace_with(placeholder)
cleaned_html = str(soup)
item.Set("cleanedHtml", cleaned_html)
# Step 3: Configure splitter
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Chapter 1"), ("h2", "Chapter 2"), ("h3", "Chapter 3")],
max_chunk_size=MAX_CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
separators = ["\n\n", "\n", ". ", "! ", "? ", ", "],
elements_to_preserve=["ul", "ol", "code", "pre","table"],
preserve_links=True,
preserve_images=True,
normalize_text=False,
stopword_removal=False,
keep_separator="start"
)
# Step 4: Perform chunking
chunks = splitter.split_text(cleaned_html)
item.Set("debug_chunks_after_split", [c.page_content for c in chunks])
# Step 6: Restore full tables into chunks
# After chunking and merging chunks, restore lists:
for chunk in chunks:
content = chunk.page_content
for placeholder, table_html in table_map.items():
content = content.replace(placeholder, table_html)
chunk.page_content = content
# Step 6: Merge small and large chunks
chunks = merge_small_chunks(chunks, MIN_CHUNK_SIZE)
#chunks = merge_large_chunks(chunks, MAX_CHUNK_SIZE, CHUNK_OVERLAP)
#chunks = merge_and_split_chunks(chunks, MIN_CHUNK_SIZE, MAX_CHUNK_SIZE, CHUNK_OVERLAP)
# Step 7: Store cleaned text
cleaned_text = "\n\n".join([chunk.page_content for chunk in chunks])
item.Set("filteredText", cleaned_text)
# Step 8: Build final HTML
html_chunks = [
'<html><head><title>Chunked Output</title>'
'<style>'
'body{font-family:Arial,sans-serif;}'
'div.chunk{margin-bottom:30px;}'
'table{border-collapse:collapse;width:100%;}td,th{border:1px solid #ccc;padding:5px;}'
'</style>'
'</head><body>'
]
for index, chunk in enumerate(chunks, start=1):
section_name = get_metadata_key(chunk, index) or f"Section {index}"
chunk_content = chunk.page_content
# Save each chunk in the item
item.Set(f'chunk_{index}', chunk_content)
# Build HTML for this chunk
html_chunks.append("<div class='chunk'>")
html_chunks.append(f"<h3>{section_name}</h3>")
html_chunks.append(chunk_content)
html_chunks.append("</div>")
html_chunks.append('</body></html>')
# Step 9: Save HTML
final_html = "\n".join(html_chunks)
item.Set("chunkedHtmlFileContent", final_html)
output_filename = "chunked_output_semantic.html"
output_path = os.path.join(os.getcwd(), output_filename)
Log.debug(f"Output path: {output_path}")
with open(output_path, "w", encoding="utf-8") as f:
f.write(final_html) -
At the top of the script, you can edit the MAX_CHUNK_SIZE, CHUNK_OVERLAP, and MIN_CHUNK_SIZE settings to your specifications:
-
MAX_CHUNK_SIZE: Specifies the maximum number of characters included in each chunk.
-
CHUNK_OVERLAP: Specifies the number of characters that will carryover into subsequent chunks.
-
MIN_CHUNK_SIZE: Specifies the minimum number of characters included in each chunk.
-
Get an item property
To get an item property, use the generic GET method on item object: propertyValue = item.Get("propertyName")
For example, to get the value of the body property:
body = item.Get("body")
Set an item property
To set an item property, use the generic SET method on item object: item.Set("propertyName", propertyValue)
For example, to set the value of the author property:
item.Set("Author", "AI Agent")
Access raw data
Access the raw data field that may be sent to the service:
content = item.RawData
Child item use cases
Add a child item
child = ProcessedChildItem(ItemProperties=[AbstractProperty(Name="New child", Value="propertydata")])
item.AddChild(child)
Print all child items
Log.debug("All Child Items:")
for child in item.ChildItems:
Log.debug(f"Child UniqueId: {child.UniqueId}, Properties: {[f'{p.Name}={p.Value}' for p in child.ItemProperties]}")
Remove a cild item
item.RemoveChild(child.UniqueId)
Retrieve metadata
Include body and outputs in JSON format
all_metadata: str = item.get_all_metadata(True, "json")
Log.debug(f"Body included and format is json: {all_metadata}")
Exclude body and outputs in JSON format
all_metadata_without_body: str = item.get_all_metadata(False, "json")
Log.debug(f"Body excluded and format is json: {all_metadata_without_body}")
Include body and outputs in XML format
all_metadata_as_xml: str = item.get_all_metadata(True, "xml")
Log.debug(f"Body included and format is xml: {all_metadata_as_xml}")
Exclude body and outputs in XML
all_metadata_without_body_as_xml: str = item.get_all_metadata(False, "xml")
Log.debug(f"Body excluded and format is xml: {all_metadata_without_body_as_xml}")
Default behavior for metadata retrieval
all_metadata_default: str = item.get_all_metadata(True)
Log.debug(f"Body included and default format is json: {all_metadata_default}")
Configure logging
You can configure logging for your Python script runner service.
Configure the logging settings
-
In the AutoClassifier installation folder, navigate to AutoClassifier\Python Script Runner Service\app\logging_config.json
-
Open the logging_config.json file in a text editor.
-
locate the log_level setting.
-
Adjust the log level to your specifications. Supported values include DEBUG, INFO, WARNING, and ERROR.
View the log files
You can view the log files in the following locations:
-
The logs of the Python Script Runner REST API are located in C:\Program Files\Upland BA Insight\AutoClassifier\Python Script Runner Service\app\logs\pythonscriptrunnerrestapi.log
-
The logs of the Script execution are located in C:\Program Files\Upland BA Insight\AutoClassifier\Python Script Runner Service\app\logs\runner.log
Log usage
-
Log.debug("Debug message")
-
Log.info("Info message")
-
Log.warning("Warning message")
-
Log.error("Error message")
Using debug logging
If your Python script contains debug log statements (e.g., Log.Debug("Debug message")), do the following:
- Set the log level to Debug in the logging_config.json file. By default, this file is located in the Upland BA Insight\AutoClassifier\Python Script Runner Service\app folder.
- Restart the Python Scripting Service for the log change to take effect.
If you do not make this change, debug log lines will not appear in the logs.