GSP-Py: A Python-powered library to mine sequential patterns in large datasets, based on the robust Generalized Sequence Pattern (GSP) algorithm. Ideal for market basket analysis, temporal mining, and user journey discovery.
Important
GSP-Py is compatible with Python 3.11 and later versions!
- 🔍 What is GSP?
- 🔧 Requirements
- 🚀 Installation
- 🛠️ Developer Installation
- 📖 Documentation
- 💡 Usage
- ⌨️ Typing
- 🌟 Planned Features
- 🤝 Contributing
- 📝 License
- 📖 Citation
The Generalized Sequential Pattern (GSP) algorithm is a sequential pattern mining technique based on Apriori principles. Using support thresholds, GSP identifies frequent sequences of items in transaction datasets.
- Ordered (non-contiguous) matching: Detects patterns where items appear in order but not necessarily adjacent, following standard GSP semantics. For example, the pattern
('A', 'C')is found in the sequence['A', 'B', 'C']. - Support-based pruning: Only retains sequences that meet the minimum support threshold.
- Candidate generation: Iteratively generates candidate sequences of increasing length.
- Temporal constraints: Support for time-constrained pattern mining with
mingap,maxgap, andmaxspanparameters to find patterns within specific time windows. - General-purpose: Useful in retail, web analytics, social networks, temporal sequence mining, and more.
For example:
- In a shopping dataset, GSP can identify patterns like "Customers who buy bread and milk often purchase diapers next" - even if other items appear between bread and milk.
- In a website clickstream, GSP might find patterns like "Users visit A, then eventually go to C" - capturing user journeys with intermediate steps.
You will need Python installed on your system. On most Linux systems, you can install Python with:
sudo apt install python3For package dependencies of GSP-Py, they will automatically be installed when using pip.
GSP-Py can be easily installed from either the repository or PyPI.
To manually clone the repository and set up the environment:
git clone https://github.com/jacksonpradolima/gsp-py.git
cd gsp-pyRefer to the Developer Installation section and run the setup with uv.
Alternatively, install GSP-Py from PyPI with:
pip install gsppyThis project now uses uv for dependency management and virtual environments.
curl -Ls https://astral.sh/uv/install.sh | bashMake sure uv is on your PATH (for most Linux setups):
export PATH="$HOME/.local/bin:$PATH"Create a local virtual environment and install dependencies from uv.lock (single source of truth):
uv venv .venv
uv sync --frozen --extra dev # uses uv.lock
uv pip install -e .Rust acceleration is optional and provides faster support counting using a PyO3 extension. Python fallback remains available.
Build the extension locally:
make rust-buildSelect backend at runtime (auto tries Rust, then falls back to Python):
export GSPPY_BACKEND=rust # or python, or unset for autoRun benchmarks (adjust to your machine):
make bench-small
make bench-big # may use significant memory/CPU
# or customize:
GSPPY_BACKEND=auto uv run --python .venv/bin/python --no-project \
python benchmarks/bench_support.py --n_tx 1000000 --tx_len 8 --vocab 50000 --min_support 0.2 --warmupGPU acceleration is experimental and currently optimizes singleton (k=1) support counting using CuPy. Non-singleton candidates fall back to the Rust/Python backend.
Install the optional extra (choose a CuPy build that matches your CUDA/ROCm setup if needed):
uv run pip install -e .[gpu]Select the GPU backend at runtime:
export GSPPY_BACKEND=gpuIf a GPU isn't available, an error will be raised when GSPPY_BACKEND=gpu is set. Otherwise, the default "auto" uses CPU.
After the environment is ready, activate it and run tasks with standard tools:
source .venv/bin/activate
pytest -n auto
ruff check .
pyrightIf you prefer, you can also prefix commands with uv without activating:
uv run pytest -n auto
uv run ruff check .
uv run pyrightYou can use the Makefile to automate common tasks:
make setup # create .venv with uv and pin Python
make install # sync deps (from uv.lock) + install project (-e .)
make test # pytest -n auto
make lint # ruff check .
make format # ruff --fix
make typecheck # pyright + ty
make pre-commit-install # install the pre-commit hook
make pre-commit-run # run pre-commit on all files
# Rust-specific shortcuts
make rust-setup # install rustup toolchain
make rust-build # build PyO3 extension with maturin
make bench-small # run small benchmark
make bench-big # run large benchmarkNote
Tox in this project uses the "tox-uv" plugin. When running make tox or tox, missing Python interpreters can be provisioned automatically via uv (no need to pre-install all versions). This makes local setup faster.
Every GitHub release bundles artifacts to help you validate what you download:
- Built wheels and source distributions produced by the automated publish workflow.
sbom.json(CycloneDX) generated with Syft.- Sigstore-generated
.sigand.pemfiles for each artifact, created using GitHub OIDC identity.
To verify a downloaded artifact from a release:
python -m pip install sigstore # installs the CLI
sigstore verify identity \
--certificate gsppy-<version>-py3-none-any.whl.pem \
--signature gsppy-<version>-py3-none-any.whl.sig \
--cert-identity "https://github.com/jacksonpradolima/gsp-py/.github/workflows/publish.yml@refs/tags/v<version>" \
--cert-oidc-issuer https://token.actions.githubusercontent.com \
gsppy-<version>-py3-none-any.whlReplace <version> with the numeric package version (for example, 3.1.1) in the filenames; in --cert-identity, this becomes v<version> (for example, v3.1.1). Adjust the filenames for the sdist (.tar.gz) if preferred. The same release page also hosts sbom.json for supply-chain inspection.
-
Live site: https://jacksonpradolima.github.io/gsp-py/
-
Build locally:
uv venv .venv uv sync --extra docs uv run mkdocs serve
The docs use MkDocs with the Material theme and mkdocstrings to render the Python API directly from docstrings.
The library is designed to be easy to use and integrate with your own projects. You can use GSP-Py either programmatically (Python API) or directly from the command line (CLI).
GSP-Py provides a command-line interface (CLI) for running the Generalized Sequential Pattern algorithm on transactional data. This allows you to mine frequent sequential patterns from JSON or CSV files without writing any code.
First, install GSP-Py (if not already installed):
pip install gsppyThis will make the gsppy CLI command available in your environment.
Your input file should be either:
-
JSON: A list of transactions, each transaction is a list of items. Example:
[ ["Bread", "Milk"], ["Bread", "Diaper", "Beer", "Eggs"], ["Milk", "Diaper", "Beer", "Coke"], ["Bread", "Milk", "Diaper", "Beer"], ["Bread", "Milk", "Diaper", "Coke"] ]
-
CSV: Each row is a transaction, items separated by commas. Example:
Bread,Milk Bread,Diaper,Beer,Eggs Milk,Diaper,Beer,Coke Bread,Milk,Diaper,Beer Bread,Milk,Diaper,Coke
-
SPM/GSP Format: Uses delimiters to separate elements and sequences. This format is commonly used in sequential pattern mining datasets.
-1: Marks the end of an element (itemset)-2: Marks the end of a sequence (transaction)
Example:
1 2 -1 3 -1 -2 4 -1 5 6 -1 -2 1 -1 2 3 -1 -2The above represents:
- Transaction 1:
[[1, 2], [3]]→ flattened to[1, 2, 3] - Transaction 2:
[[4], [5, 6]]→ flattened to[4, 5, 6] - Transaction 3:
[[1], [2, 3]]→ flattened to[1, 2, 3]
String tokens are also supported:
A B -1 C -1 -2 D -1 E F -1 -2 -
Parquet/Arrow Files: Modern columnar data formats (requires 'gsppy[dataframe]')
pip install 'gsppy[dataframe]'This installs optional dependencies:
polars,pandas, andpyarrowfor DataFrame support.
Use the following command to run GSPPy on your data:
gsppy --file path/to/transactions.json --min_support 0.3 --backend autoOr for CSV files:
gsppy --file path/to/transactions.csv --min_support 0.3 --backend rustFor SPM/GSP format files, use the --format spm option:
gsppy --file path/to/data.txt --format spm --min_support 0.3--file: Path to your input file (JSON, CSV, or SPM format). Required.--format: File format to use for input. Options:auto(default, auto-detect from extension),json,csv,spm,parquet,arrow.--min_support: Minimum support threshold as a fraction (e.g.,0.3for 30%). Default is0.2.--backend: Backend to use for support counting. One ofauto(default),python,rust, orgpu.--output: Path to save mining results to a file. If not specified, results are printed to console.--output-format: Output format for mining results. Options:auto(default, detect from extension),parquet,arrow,csv,json. Requires--outputto be specified.--verbose: Enable detailed logging with timestamps, log levels, and process IDs for debugging and traceability.--mingap,--maxgap,--maxspan: Temporal constraints for time-aware pattern mining (requires timestamped transactions).
For debugging or to track execution in CI/CD pipelines, use the --verbose flag:
gsppy --file transactions.json --min_support 0.3 --verboseThis produces structured logging output with timestamps, log levels, and process information:
YYYY-MM-DDTHH:MM:SS | INFO | PID:4179 | gsppy.gsp | Pre-processing transactions...
YYYY-MM-DDTHH:MM:SS | DEBUG | PID:4179 | gsppy.gsp | Unique candidates: [('Bread',), ('Milk',), ...]
YYYY-MM-DDTHH:MM:SS | INFO | PID:4179 | gsppy.gsp | Starting GSP algorithm with min_support=0.3...
YYYY-MM-DDTHH:MM:SS | INFO | PID:4179 | gsppy.gsp | Run 1: 6 candidates filtered to 5.
...
For complete logging documentation, see docs/logging.md.
Suppose you have a file transactions.json as shown above. To find patterns with at least 30% support:
gsppy --file transactions.json --min_support 0.3Sample output:
Pre-processing transactions...
Starting GSP algorithm with min_support=0.3...
Run 1: 6 candidates filtered to 5.
Run 2: 20 candidates filtered to 3.
Run 3: 2 candidates filtered to 2.
Run 4: 1 candidates filtered to 0.
GSP algorithm completed.
Frequent Patterns Found:
1-Sequence Patterns:
Pattern: ('Bread',), Support: 4
Pattern: ('Milk',), Support: 4
Pattern: ('Diaper',), Support: 4
Pattern: ('Beer',), Support: 3
Pattern: ('Coke',), Support: 2
2-Sequence Patterns:
Pattern: ('Bread', 'Milk'), Support: 3
Pattern: ('Milk', 'Diaper'), Support: 3
Pattern: ('Diaper', 'Beer'), Support: 3
3-Sequence Patterns:
Pattern: ('Bread', 'Milk', 'Diaper'), Support: 2
Pattern: ('Milk', 'Diaper', 'Beer'), Support: 2
GSP-Py supports exporting mining results to various formats for further analysis or integration with data pipelines:
Export to Parquet (efficient columnar format for large datasets):
gsppy --file transactions.json --min_support 0.3 --output results.parquetExport to CSV:
gsppy --file transactions.json --min_support 0.3 --output results.csvExport to JSON:
gsppy --file transactions.json --min_support 0.3 --output results.jsonSpecify format explicitly:
gsppy --file transactions.json --min_support 0.3 --output results.data --output-format parquetThe exported files contain three columns:
pattern: The sequential pattern (e.g.,('Bread', 'Milk'))support: Number of transactions containing the patternlevel: Length of the pattern sequence
Export formats are particularly useful for:
- Parquet/Arrow: Integration with big data tools (Spark, Polars, Pandas), data lakes, and cloud analytics
- CSV: Easy viewing in spreadsheets and compatibility with traditional tools
- JSON: Structured data for web applications and APIs
- If the file does not exist or is in an unsupported format, a clear error message will be shown.
- The
min_supportvalue must be between 0.0 and 1.0 (exclusive of 0.0, inclusive of 1.0).
To see detailed logs for debugging, add the --verbose flag:
gsppy --file transactions.json --min_support 0.3 --verboseThe following example shows how to use GSP-Py programmatically in Python:
The input to the algorithm is a sequence of transactions, where each transaction contains a sequence of items:
transactions = [
['Bread', 'Milk'],
['Bread', 'Diaper', 'Beer', 'Eggs'],
['Milk', 'Diaper', 'Beer', 'Coke'],
['Bread', 'Milk', 'Diaper', 'Beer'],
['Bread', 'Milk', 'Diaper', 'Coke']
]Import the GSP class from the gsppy package and call the search method to find frequent patterns with a support
threshold (e.g., 0.3):
from gsppy.gsp import GSP
# Example transactions: customer purchases
transactions = [
['Bread', 'Milk'], # Transaction 1
['Bread', 'Diaper', 'Beer', 'Eggs'], # Transaction 2
['Milk', 'Diaper', 'Beer', 'Coke'], # Transaction 3
['Bread', 'Milk', 'Diaper', 'Beer'], # Transaction 4
['Bread', 'Milk', 'Diaper', 'Coke'] # Transaction 5
]
# Set minimum support threshold (30%)
min_support = 0.3
# Find frequent patterns
result = GSP(transactions).search(min_support)
# Output the results
print(result)Enable detailed logging to track algorithm progress and debug issues:
from gsppy.gsp import GSP
# Enable verbose logging for the entire instance
gsp = GSP(transactions, verbose=True)
result = gsp.search(min_support=0.3)
# Or enable verbose for a specific search only
gsp = GSP(transactions)
result = gsp.search(min_support=0.3, verbose=True)Verbose mode provides:
- Detailed progress information during execution
- Candidate generation and filtering statistics
- Preprocessing and validation details
- Useful for debugging, research, and CI/CD integration
For complete documentation on logging, see docs/logging.md.
GSP-Py 4.0+ introduces a Sequence abstraction class that provides a richer, more maintainable way to work with sequential patterns. The Sequence class encapsulates pattern items, support counts, and optional metadata in an immutable, hashable object.
from gsppy import GSP
transactions = [
['Bread', 'Milk'],
['Bread', 'Diaper', 'Beer', 'Eggs'],
['Milk', 'Diaper', 'Beer', 'Coke']
]
gsp = GSP(transactions)
result = gsp.search(min_support=0.3)
# Returns: [{('Bread',): 4, ('Milk',): 4, ...}, {('Bread', 'Milk'): 3, ...}, ...]
for level_patterns in result:
for pattern, support in level_patterns.items():
print(f"Pattern: {pattern}, Support: {support}")from gsppy import GSP
transactions = [
['Bread', 'Milk'],
['Bread', 'Diaper', 'Beer', 'Eggs'],
['Milk', 'Diaper', 'Beer', 'Coke']
]
gsp = GSP(transactions)
result = gsp.search(min_support=0.3, return_sequences=True)
# Returns: [[Sequence(('Bread',), support=4), ...], [Sequence(('Bread', 'Milk'), support=3), ...], ...]
for level_patterns in result:
for seq in level_patterns:
print(f"Pattern: {seq.items}, Support: {seq.support}, Length: {seq.length}")
# Access sequence properties
print(f" First item: {seq.first_item}, Last item: {seq.last_item}")
# Check if item is in sequence
if "Milk" in seq:
print(f" Contains Milk!")- Rich API: Access pattern properties like
length,first_item,last_item - Type Safety: IDE autocomplete and better type hints
- Immutable & Hashable: Can be used as dictionary keys
- Extensible: Add metadata for confidence, lift, or custom properties
- Backward Compatible: Convert to/from dict format as needed
from gsppy import Sequence, sequences_to_dict, dict_to_sequences
# Create custom sequences
seq = Sequence.from_tuple(("A", "B", "C"), support=5)
# Extend sequences
extended = seq.extend("D") # Creates Sequence(("A", "B", "C", "D"))
# Add metadata
seq_with_meta = seq.with_metadata(confidence=0.85, lift=1.5)
# Convert between formats for compatibility
seq_result = gsp.search(min_support=0.3, return_sequences=True)
dict_format = sequences_to_dict(seq_result[0]) # Convert to dictFor a complete example, see examples/sequence_example.py.
GSP-Py supports loading datasets in the classical SPM/GSP delimiter format, which is widely used in sequential pattern mining research. This format uses:
-1to mark the end of an element (itemset)-2to mark the end of a sequence (transaction)
from gsppy.utils import read_transactions_from_spm
from gsppy import GSP
# Load SPM format file
transactions = read_transactions_from_spm('data.txt')
# Run GSP algorithm
gsp = GSP(transactions)
result = gsp.search(min_support=0.3)Simple sequence file (data.txt):
1 2 -1 3 -1 -2
4 -1 5 6 -1 -2
1 -1 2 3 -1 -2
This represents:
- Transaction 1: Items [1, 2] followed by item [3] → flattened to [1, 2, 3]
- Transaction 2: Item [4] followed by items [5, 6] → flattened to [4, 5, 6]
- Transaction 3: Item [1] followed by items [2, 3] → flattened to [1, 2, 3]
String tokens are also supported:
A B -1 C -1 -2
D -1 E F -1 -2
For workflows requiring conversion between string tokens and integer IDs, use the TokenMapper:
from gsppy.utils import read_transactions_from_spm
from gsppy import TokenMapper
# Load with mappings
transactions, str_to_int, int_to_str = read_transactions_from_spm(
'data.txt',
return_mappings=True
)
print("String to Int:", str_to_int)
# Output: {'1': 0, '2': 1, '3': 2, '4': 3, '5': 4, '6': 5}
print("Int to String:", int_to_str)
# Output: {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6'}
# Use the TokenMapper class directly
mapper = TokenMapper()
id_a = mapper.add_token("A")
id_b = mapper.add_token("B")
print(f"A -> {id_a}, B -> {id_b}")
# Output: A -> 0, B -> 1The SPM loader gracefully handles:
- Empty lines (skipped)
- Missing
-2delimiter at end of line - Extra or consecutive delimiters
- Mixed-length elements in sequences
- Both integer and string tokens
The algorithm will return a list of patterns with their corresponding support.
Sample Output:
[
{('Bread',): 4, ('Milk',): 4, ('Diaper',): 4, ('Beer',): 3, ('Coke',): 2},
{('Bread', 'Milk'): 3, ('Bread', 'Diaper'): 3, ('Bread', 'Beer'): 2, ('Milk', 'Diaper'): 3, ('Milk', 'Beer'): 2, ('Milk', 'Coke'): 2, ('Diaper', 'Beer'): 3, ('Diaper', 'Coke'): 2},
{('Bread', 'Milk', 'Diaper'): 2, ('Bread', 'Diaper', 'Beer'): 2, ('Milk', 'Diaper', 'Beer'): 2, ('Milk', 'Diaper', 'Coke'): 2}
]- The first dictionary contains single-item sequences with their frequencies (e.g.,
('Bread',): 4means "Bread" appears in 4 transactions). - The second dictionary contains 2-item sequential patterns (e.g.,
('Bread', 'Milk'): 3means the sequence " Bread → Milk" appears in 3 transactions). Note that patterns like('Bread', 'Beer')are detected even when they don't appear adjacent in transactions - they just need to appear in order. - The third dictionary contains 3-item sequential patterns (e.g.,
('Bread', 'Milk', 'Diaper'): 2means the sequence "Bread → Milk → Diaper" appears in 2 transactions).
Note
The support of a sequence is calculated as the fraction of transactions containing the sequence in order (not necessarily contiguously), e.g.,
('Bread', 'Milk') appears in 3 out of 5 transactions → Support = 3 / 5 = 0.6 (60%).
This insight helps identify frequently occurring sequential patterns in datasets, such as shopping trends or user
behavior.
Important
Non-contiguous (ordered) matching: GSP-Py detects patterns where items appear in the specified order but not necessarily adjacent. For example, the pattern ('Bread', 'Beer') matches the transaction ['Bread', 'Milk', 'Diaper', 'Beer'] because Bread appears before Beer, even though they are not adjacent. This follows the standard GSP algorithm semantics for sequential pattern mining.
GSP-Py follows the standard GSP algorithm semantics by detecting ordered (non-contiguous) subsequences. This means:
- ✅ Order matters: Items must appear in the specified sequence order
- ✅ Gaps allowed: Items don't need to be adjacent
- ❌ Wrong order rejected: Items appearing in different order won't match
Example:
from gsppy.gsp import GSP
sequences = [
['a', 'b', 'c'], # Contains: (a,b), (a,c), (b,c), (a,b,c)
['a', 'c'], # Contains: (a,c)
['b', 'c', 'a'], # Contains: (b,c), (b,a), (c,a)
['a', 'b', 'c', 'd'], # Contains: (a,b), (a,c), (a,d), (b,c), (b,d), (c,d), etc.
]
gsp = GSP(sequences)
result = gsp.search(min_support=0.5) # Need at least 2/4 sequences
# Pattern ('a', 'c') is found with support=3 because:
# - It appears in ['a', 'b', 'c'] (with 'b' in between)
# - It appears in ['a', 'c'] (adjacent)
# - It appears in ['a', 'b', 'c', 'd'] (with 'b' in between)
# Total: 3 out of 4 sequences = 75% support ✅Tip
For more complex examples, find example scripts in the gsppy/tests folder.
GSP-Py supports Polars and Pandas DataFrames as input, enabling high-performance workflows with modern data formats like Arrow and Parquet. This feature is particularly useful for large-scale data engineering pipelines and integration with existing data processing workflows.
Install GSP-Py with DataFrame support:
pip install 'gsppy[dataframe]'This installs the optional dependencies: polars, pandas, and pyarrow.
GSP-Py supports two DataFrame formats:
Use when your data has separate rows for each item in a transaction:
import polars as pl
from gsppy import GSP
# Polars DataFrame with transaction_id and item columns
df = pl.DataFrame({
"transaction_id": [1, 1, 2, 2, 2, 3, 3],
"item": ["Bread", "Milk", "Bread", "Diaper", "Beer", "Milk", "Coke"],
})
# Run GSP directly on the DataFrame
gsp = GSP(df, transaction_col="transaction_id", item_col="item")
patterns = gsp.search(min_support=0.3)
for level, freq_patterns in enumerate(patterns, start=1):
print(f"\n{level}-Sequence Patterns:")
for pattern, support in freq_patterns.items():
print(f" {pattern}: {support}")Use when each row contains a complete transaction as a list:
import pandas as pd
from gsppy import GSP
# Pandas DataFrame with sequences as lists
df = pd.DataFrame({
"transaction": [
["Bread", "Milk"],
["Bread", "Diaper", "Beer"],
["Milk", "Coke"],
]
})
gsp = GSP(df, sequence_col="transaction")
patterns = gsp.search(min_support=0.3)DataFrames support temporal constraints for time-aware pattern mining:
import polars as pl
from gsppy import GSP
# Grouped format with timestamps
df = pl.DataFrame({
"transaction_id": [1, 1, 1, 2, 2, 2],
"item": ["Login", "Browse", "Purchase", "Login", "Browse", "Purchase"],
"timestamp": [0, 2, 5, 0, 1, 15], # Time in seconds
})
# Find patterns where consecutive events occur within 10 seconds
gsp = GSP(
df,
transaction_col="transaction_id",
item_col="item",
timestamp_col="timestamp",
maxgap=10
)
patterns = gsp.search(min_support=0.5)For sequence format with timestamps:
import pandas as pd
from gsppy import GSP
df = pd.DataFrame({
"sequence": [["A", "B", "C"], ["A", "D"]],
"timestamps": [[1, 2, 3], [1, 5]], # Timestamps per item
})
gsp = GSP(df, sequence_col="sequence", timestamp_col="timestamps", maxgap=3)
patterns = gsp.search(min_support=0.5)DataFrames enable seamless integration with columnar storage formats:
import polars as pl
from gsppy import GSP
# Read directly from Parquet
df = pl.read_parquet("transactions.parquet")
# Run GSP with automatic schema detection
gsp = GSP(df, transaction_col="txn_id", item_col="product")
patterns = gsp.search(min_support=0.2)
# Or use Pandas with Arrow backend
import pandas as pd
df_pandas = pd.read_parquet("transactions.parquet", engine="pyarrow")
gsp = GSP(df_pandas, transaction_col="txn_id", item_col="product")
patterns = gsp.search(min_support=0.2)DataFrames offer performance benefits for large datasets:
- Polars: Leverages Arrow for zero-copy operations and parallel processing
- Pandas: Compatible with Arrow backend for efficient memory usage
- Parquet/Arrow: Columnar storage enables efficient filtering and reading
- Schema validation: Errors are caught early with clear messages
Grouped Format:
transaction_col: Column containing transaction/sequence IDs (any type)item_col: Column containing items (any type, converted to strings)timestamp_col(optional): Column containing timestamps (numeric)
Sequence Format:
sequence_col: Column containing lists of itemstimestamp_col(optional): Column containing lists of timestamps (must match sequence lengths)
GSP-Py provides clear error messages for schema issues:
import polars as pl
from gsppy import GSP
df = pl.DataFrame({
"txn_id": [1, 2],
"product": ["A", "B"],
})
# ❌ Missing required column
try:
gsp = GSP(df, transaction_col="txn_id", item_col="item") # 'item' doesn't exist
except ValueError as e:
print(f"Error: {e}") # "Column 'item' not found in DataFrame"
# ❌ Invalid format specification
try:
gsp = GSP(df) # Must specify either sequence_col or both transaction_col and item_col
except ValueError as e:
print(f"Error: {e}") # "Must specify either 'sequence_col' or both 'transaction_col' and 'item_col'"Traditional list-based input continues to work:
from gsppy import GSP
# Lists still work as before
transactions = [["A", "B"], ["A", "C"], ["B", "C"]]
gsp = GSP(transactions)
patterns = gsp.search(min_support=0.5)DataFrame parameters cannot be mixed with list input:
transactions = [["A", "B"], ["C", "D"]]
# ❌ This raises an error
gsp = GSP(transactions, transaction_col="txn") # ValueError: DataFrame parameters cannot be used with list inputFor complete examples and edge cases, see:
tests/test_dataframe.py- Comprehensive test suite- DataFrame adapter documentation in
gsppy/dataframe_adapters.py
GSP-Py supports itemsets within sequence elements, enabling you to capture co-occurrence of multiple items at the same time step. This is crucial for applications where items occur together rather than in strict sequential order.
- Flat sequences:
['A', 'B', 'C']- each item occurs at a separate time step - Itemset sequences:
[['A', 'B'], ['C']]- items A and B occur together at the first time step, then C occurs later
Itemsets are essential when temporal co-occurrence matters in your domain:
- Market basket analysis: Customers buy multiple items in a single shopping trip, then return for more items later
- Web analytics: Users open multiple pages in parallel tabs before moving to the next set of pages
- Event logs: Multiple events can occur simultaneously in complex systems
- Purchase patterns: Items bought together vs. items bought in sequence
from gsppy import GSP
# Itemset format: nested lists where inner lists are items that occur together
transactions = [
[['Bread', 'Milk'], ['Eggs']], # Bought Bread & Milk together, then Eggs later
[['Bread', 'Milk', 'Butter']], # Bought all three items together
[['Bread', 'Milk'], ['Eggs']], # Same pattern as customer 1
]
gsp = GSP(transactions)
patterns = gsp.search(min_support=0.5)
# Pattern ('Bread',) will match any itemset containing Bread
# Pattern ('Bread', 'Eggs') will match sequences where Bread appears before Eggs
# (even if they're in different itemsets)GSP-Py automatically normalizes flat sequences to itemsets internally, ensuring full backward compatibility:
from gsppy import GSP
# These are equivalent after normalization:
flat_transactions = [['A', 'B', 'C']] # Flat format
itemset_transactions = [[['A'], ['B'], ['C']]] # Equivalent itemset format
# Both produce the same results
gsp1 = GSP(flat_transactions)
gsp2 = GSP(itemset_transactions)
# Patterns are identical
patterns1 = gsp1.search(min_support=0.5)
patterns2 = gsp2.search(min_support=0.5)Pattern matching with itemsets uses subset semantics:
- A pattern element matches a sequence element if all items in the pattern element are present in the sequence element
- Example: Pattern
[['A', 'B']]matches sequence element['A', 'B', 'C']because {A, B} ⊆ {A, B, C} - Pattern elements must appear in order across the sequence
from gsppy import GSP
transactions = [
[['A', 'B', 'D'], ['E'], ['C', 'F']], # A,B,D together, then E, then C,F together
]
gsp = GSP(transactions)
# Pattern ('A', 'C') will match because:
# - 'A' is in first itemset ['A', 'B', 'D'] ✓
# - 'C' appears later in third itemset ['C', 'F'] ✓
# - Order is preserved ✓The SPM/GSP format supports itemsets using delimiters:
-1: End of itemset-2: End of sequence
from gsppy.utils import read_transactions_from_spm
# SPM file content:
# 1 2 -1 3 -1 -2
# 1 -1 3 4 -1 -2
# Read with itemsets preserved
transactions = read_transactions_from_spm("data.txt", preserve_itemsets=True)
# Result: [[['1', '2'], ['3']], [['1'], ['3', '4']]]
# Read with itemsets flattened (backward compatible)
transactions = read_transactions_from_spm("data.txt", preserve_itemsets=False)
# Result: [['1', '2', '3'], ['1', '3', '4']]Itemsets work seamlessly with temporal constraints:
from gsppy import GSP
# Itemsets with timestamps: [(item, timestamp), ...]
transactions = [
[[('Login', 0), ('Home', 0)], [('Product', 5)], [('Checkout', 10)]],
[[('Login', 0)], [('Home', 2), ('Product', 2)], [('Checkout', 15)]],
]
# Find patterns where events in the same itemset occur together
# and subsequent itemsets occur within maxgap time units
gsp = GSP(transactions, maxgap=10)
patterns = gsp.search(min_support=0.5)See examples/itemset_example.py for comprehensive examples including:
- Market basket analysis with itemsets
- Web clickstream with parallel page views
- Comparison of flat vs. itemset semantics
- Reading and processing SPM format files
✓ Itemsets capture co-occurrence of items at the same time step
✓ Flat sequences are automatically normalized to itemsets internally
✓ Both formats work seamlessly with GSP-Py
✓ Use itemsets when temporal co-occurrence matters in your domain
✓ SPM format supports both flat and itemset representations
GSP-Py supports time-constrained sequential pattern mining with three powerful temporal constraints: mingap, maxgap, and maxspan. These constraints enable domain-specific applications such as medical event mining, retail analytics, and temporal user journey discovery.
mingap: Minimum time gap required between consecutive items in a patternmaxgap: Maximum time gap allowed between consecutive items in a patternmaxspan: Maximum time span from the first to the last item in a pattern
To use temporal constraints, your transactions must include timestamps as (item, timestamp) tuples:
from gsppy.gsp import GSP
# Transactions with timestamps (e.g., in seconds, hours, days, etc.)
timestamped_transactions = [
[("Login", 0), ("Browse", 2), ("AddToCart", 5), ("Purchase", 7)],
[("Login", 0), ("Browse", 1), ("AddToCart", 15), ("Purchase", 20)],
[("Login", 0), ("Browse", 3), ("AddToCart", 6), ("Purchase", 8)],
]
# Find patterns where consecutive events occur within 10 time units
gsp = GSP(timestamped_transactions, maxgap=10)
patterns = gsp.search(min_support=0.6)
# The pattern ("Browse", "AddToCart", "Purchase") will:
# - Be found in transaction 1: gaps are 3 and 2 (both ≤ 10) ✅
# - NOT be found in transaction 2: gap between Browse→AddToCart is 14 (exceeds maxgap) ❌
# - Be found in transaction 3: gaps are 3 and 2 (both ≤ 10) ✅
# Result: Support = 2/3 = 67% (above threshold of 60%)# Find patterns with maximum gap of 5 time units
gsppy --file temporal_data.json --min_support 0.3 --maxgap 5
# Find patterns with minimum gap of 2 time units
gsppy --file temporal_data.json --min_support 0.3 --mingap 2
# Find patterns that complete within 10 time units
gsppy --file temporal_data.json --min_support 0.3 --maxspan 10
# Combine multiple constraints
gsppy --file temporal_data.json --min_support 0.3 --mingap 1 --maxgap 5 --maxspan 10from gsppy.gsp import GSP
# Medical events with timestamps in days
medical_sequences = [
[("Symptom", 0), ("Diagnosis", 2), ("Treatment", 5), ("Recovery", 15)],
[("Symptom", 0), ("Diagnosis", 1), ("Treatment", 20), ("Recovery", 30)],
[("Symptom", 0), ("Diagnosis", 3), ("Treatment", 6), ("Recovery", 18)],
]
# Find patterns where treatment follows diagnosis within 10 days
gsp = GSP(medical_sequences, maxgap=10)
result = gsp.search(min_support=0.5)
# Pattern ("Diagnosis", "Treatment") found in sequences 1 & 3 only
# (sequence 2 has gap of 19 days, exceeding maxgap)from gsppy.gsp import GSP
# Customer purchases with timestamps in hours
purchase_sequences = [
[("Browse", 0), ("AddToCart", 0.5), ("Purchase", 1)],
[("Browse", 0), ("AddToCart", 1), ("Purchase", 25)], # Long delay
[("Browse", 0), ("AddToCart", 0.3), ("Purchase", 0.8)],
]
# Find purchase journeys that complete within 2 hours
gsp = GSP(purchase_sequences, maxspan=2)
result = gsp.search(min_support=0.5)
# Full sequence found in 2 out of 3 transactions
# (sequence 2 has span of 25 hours, exceeding maxspan)from gsppy.gsp import GSP
# Website navigation with timestamps in seconds
navigation_sequences = [
[("Home", 0), ("Search", 5), ("Product", 10), ("Checkout", 15)],
[("Home", 0), ("Search", 3), ("Product", 8), ("Checkout", 180)],
[("Home", 0), ("Search", 4), ("Product", 9), ("Checkout", 14)],
]
# Find navigation patterns with:
# - Minimum 2 seconds between steps (mingap)
# - Maximum 20 seconds between steps (maxgap)
# - Complete within 30 seconds total (maxspan)
gsp = GSP(navigation_sequences, mingap=2, maxgap=20, maxspan=30)
result = gsp.search(min_support=0.5)- Temporal constraints require timestamped transactions (item-timestamp tuples)
- If temporal constraints are specified but transactions don't have timestamps, a warning is logged and constraints are ignored
- When using temporal constraints, the Python backend is automatically used (accelerated backends don't yet support temporal constraints)
- Timestamps can be in any unit (seconds, minutes, hours, days) as long as they're consistent within your dataset
GSP-Py supports flexible candidate pruning strategies that allow you to customize how candidate sequences are filtered during pattern mining. This enables optimization for different dataset characteristics and mining requirements.
The standard GSP pruning based on minimum support threshold:
from gsppy.gsp import GSP
from gsppy.pruning import SupportBasedPruning
# Explicit support-based pruning
pruner = SupportBasedPruning(min_support_fraction=0.3)
gsp = GSP(transactions, pruning_strategy=pruner)
result = gsp.search(min_support=0.3)Prunes candidates based on absolute frequency (minimum number of occurrences):
from gsppy.pruning import FrequencyBasedPruning
# Require patterns to appear at least 5 times
pruner = FrequencyBasedPruning(min_frequency=5)
gsp = GSP(transactions, pruning_strategy=pruner)
result = gsp.search(min_support=0.2)Use case: When you need patterns to occur a minimum absolute number of times, regardless of dataset size.
Optimizes pruning for time-constrained pattern mining by pre-filtering infeasible patterns:
from gsppy.pruning import TemporalAwarePruning
# Prune patterns that cannot satisfy temporal constraints
pruner = TemporalAwarePruning(
mingap=1,
maxgap=5,
maxspan=10,
min_support_fraction=0.3
)
gsp = GSP(timestamped_transactions, mingap=1, maxgap=5, maxspan=10, pruning_strategy=pruner)
result = gsp.search(min_support=0.3)Use case: Improves performance for temporal pattern mining by eliminating patterns that cannot satisfy temporal constraints.
Combines multiple pruning strategies for aggressive filtering:
from gsppy.pruning import CombinedPruning, SupportBasedPruning, FrequencyBasedPruning
# Apply both support and frequency constraints
strategies = [
SupportBasedPruning(min_support_fraction=0.3),
FrequencyBasedPruning(min_frequency=5)
]
pruner = CombinedPruning(strategies)
gsp = GSP(transactions, pruning_strategy=pruner)
result = gsp.search(min_support=0.3)Use case: When you want to combine multiple filtering criteria for more selective pattern discovery.
You can create custom pruning strategies by implementing the PruningStrategy interface:
from gsppy.pruning import PruningStrategy
from typing import Dict, Optional, Tuple
class MyCustomPruner(PruningStrategy):
def should_prune(
self,
candidate: Tuple[str, ...],
support_count: int,
total_transactions: int,
context: Optional[Dict] = None
) -> bool:
# Custom pruning logic
# Return True to prune (filter out), False to keep
pattern_length = len(candidate)
# Example: Prune very long patterns with low support
if pattern_length > 5 and support_count < 10:
return True
return False
# Use your custom pruner
custom_pruner = MyCustomPruner()
gsp = GSP(transactions, pruning_strategy=custom_pruner)
result = gsp.search(min_support=0.2)Different pruning strategies have different performance tradeoffs:
| Strategy | Pruning Aggressiveness | Use Case | Performance Impact |
|---|---|---|---|
| SupportBased | Moderate | General-purpose mining | Baseline performance |
| FrequencyBased | High (for large datasets) | Require absolute frequency | Faster on large datasets |
| TemporalAware | High (for temporal data) | Time-constrained patterns | Significant speedup for temporal mining |
| Combined | Very High | Selective pattern discovery | Fastest, but may miss edge cases |
To compare pruning strategies on your dataset:
# Compare all strategies
python benchmarks/bench_pruning.py --n_tx 1000 --vocab 100 --min_support 0.2 --strategy all
# Benchmark a specific strategy
python benchmarks/bench_pruning.py --n_tx 1000 --vocab 100 --min_support 0.2 --strategy frequency
# Run multiple rounds for averaging
python benchmarks/bench_pruning.py --n_tx 1000 --vocab 100 --min_support 0.2 --strategy all --rounds 3See benchmarks/bench_pruning.py for the complete benchmarking script.
gsppy ships inline type information (PEP 561) via a bundled py.typed marker. The public API is re-exported from
gsppy directly—import GSP for programmatic use or reuse the CLI helpers (detect_and_read_file,
read_transactions_from_json, read_transactions_from_csv, and setup_logging) when embedding the tool in
larger applications.
We are actively working to improve GSP-Py. Here are some exciting features planned for future releases:
- Support for Preprocessing and Postprocessing:
- Add hooks to allow users to transform datasets before mining and customize the output results.
Want to contribute or suggest an improvement? Open a discussion or issue!
We welcome contributions from the community! If you'd like to help improve GSP-Py, read our CONTRIBUTING.md guide to get started.
Development dependencies (e.g., testing and linting tools) are handled via uv. To set up and run the main tasks:
uv venv .venv
uv sync --frozen --extra dev
uv pip install -e .
# Run tasks
uv run pytest -n auto
uv run ruff check .
uv run pyrightGSP-Py includes comprehensive test coverage, including property-based fuzzing tests using Hypothesis. These fuzzing tests automatically generate random inputs to verify algorithm invariants and discover edge cases. Run the fuzzing tests with:
uv run pytest tests/test_gsp_fuzzing.py -v- Fork the repository.
- Create a feature branch:
git checkout -b feature/my-feature. - Commit your changes using Conventional Commits format:
git commit -m "feat: add my feature". - Push to your branch:
git push origin feature/my-feature. - Submit a pull request to the main repository!
Looking for ideas? Check out our Planned Features section.
GSP-Py uses automated release management with Conventional Commits. When commits are merged to main:
- Releases are triggered by:
fix:(patch),feat:(minor),perf:(patch), orBREAKING CHANGE:(major) - No release for:
docs:,style:,refactor:,test:,build:,ci:,chore: - CHANGELOG.md is automatically updated with structured release notes
- Git tags and GitHub releases are created automatically
See Release Management Guide for details on commit message format and release process.
This project is licensed under the terms of the MIT License. For more details, refer to the LICENSE file.
If GSP-Py contributed to your research or project that led to a publication, we kindly ask that you cite it as follows:
@misc{pradolima_gsppy,
author = {Prado Lima, Jackson Antonio do},
title = {{GSP-Py - Generalized Sequence Pattern algorithm in Python}},
month = Dec,
year = 2025,
doi = {10.5281/zenodo.3333987},
url = {https://doi.org/10.5281/zenodo.3333987}
}