PD.Cipher Tokenizer SDK

Welcome to the PD.Cipher Tokenizer SDK documentation. This SDK provides a high-performance Rust tokenization library with Python bindings designed for secure, encrypted machine learning.

Overview

The PD.Cipher Tokenizer SDK is built on a composable pipeline architecture that allows you to create custom tokenization pipelines for various data types including:

Text data with multiple languages and scripts
Binary data and byte sequences
Structured data (JSON, XML)
Multimodal inputs (vision, audio)

Key Features

High Performance: Built in Rust with parallel processing capabilities
Composable Architecture: Mix and match components to create custom pipelines
Security First: Designed for encrypted ML workflows with deterministic output
Python Bindings: Easy integration with Python ML frameworks
Hugging Face Compatible: Works seamlessly with the Hugging Face ecosystem

Architecture

The SDK follows a pipeline architecture inspired by Hugging Face tokenizers:

Input Data → [DataSource] → [Normalizer] → [PreTokenizer] → [Model] → [PostProcessor] → Encoding

Components

DataSource: Handles input ingestion from various sources
Normalizer: Performs text normalization (Unicode, lowercasing, accent removal)
PreTokenizer: Splits input into initial chunks (whitespace, punctuation)
Model: Core tokenization algorithms (BPE, WordPiece, Unigram)
PostProcessor: Adds special tokens, handles padding and truncation

Getting Started

To get started with the PD.Cipher Tokenizer SDK:

Install the SDK using cargo add cipher-tokenizer
Check out the examples in the examples/ directory
Read the API documentation below

Use Cases

The PD.Cipher Tokenizer SDK is ideal for:

Secure ML Pipelines: Tokenize data before encryption for privacy-preserving ML
High-Performance NLP: Process large text corpora efficiently
Multimodal AI: Handle text, vision, and audio data in unified pipelines
Custom Tokenization: Build domain-specific tokenizers for specialized applications

Support

For questions and support:

GitHub Issues: cipher-tokenizer-sdk/issues
Documentation: cipher-tokenizer-sdk.probabilitydrive.com
Email: support@probabilitydrive.com

Overview​

Key Features​

Architecture​

Components​

Getting Started​

Use Cases​

Support​