Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ccjieba

C++ Chinese text segmentation library, a refactored version inspired by cppjieba. C++17, zero external dependencies.

Features

  • 5 segmentation algorithms: MixSegment (default), MPSegment, HMMSegment, FullSegment, QuerySegment
  • Keyword extraction: TF-IDF based keyword extraction
  • POS tagging: Part-of-speech tagging for words and sentences
  • User dictionary: Incremental word additions without rebuilding
  • Zero dependencies: Header-only usage possible for most features

Quick Start

See Getting Started for build instructions and basic usage.

Getting Started

Build

cmake -B build && cmake --build build        # static lib + demo
cmake -B build -DBUILD_TESTING=ON            # static lib + demo + tests
cd build && ctest                            # run tests

Basic Usage

Include jieba.hh, construct a Jieba object, load data via istream:

#include <fstream>
#include <jieba.hh>

ccjieba::Jieba jieba;
std::ifstream("data/jieba.dict.utf8") >> jieba.trie_;
std::ifstream("data/hmm_model.utf8") >> jieba.hmm_;
std::ifstream("data/idf.utf8") >> jieba.idf_;
std::ifstream("data/stop_words.utf8") >> jieba.stop_words_;

Data files are located in data/ (configured at compile time via the DATA_ROOT macro):

data/
├── jieba.dict.utf8    # dictionary (required)
├── hmm_model.utf8     # HMM model (required)
├── idf.utf8           # IDF weights (required for keywords)
└── stop_words.utf8    # stop words (required for keywords)

Jieba & Cut

The Jieba class is the main entry point. It holds the dictionary trie, HMM model, IDF table, and stop word set.

Jieba::cut

template <AlgoConcept Algo = MixSegment>
auto cut(std::string_view str, std::optional<size_t> max_word_length = 500)
    -> std::vector<std::string>;

Segment str into words. The template parameter Algo selects the segmentation algorithm (default: MixSegment).

max_word_length limits the maximum word length considered during dictionary matching.

auto words = jieba.cut("我来到北京清华大学");
// → {"我", "来到", "北京", "清华大学"}

jieba.cut<ccjieba::FullSegment>("我来到北京清华大学");
// → {"我", "来到", "北京", "清华", "清华大学", "华大", "大学"}
AlgorithmDescription
MixSegment (default)Dictionary MPS + HMM for OOV words
MPSegmentPure dictionary max-probability
HMMSegmentPure HMM Viterbi decoding
FullSegmentEnumerate all dictionary matches
QuerySegmentFullSegment + substrings of given length, for search recall

Keyword Extraction

Jieba::extract

auto extract(std::string_view str, size_t topN) -> std::vector<Keyword>;

Extract the top-N keywords ranked by TF-IDF weight (descending).

auto keywords = jieba.extract(
    "我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。",
    5
);
for (auto &kw : keywords)
    std::cout << kw.word << ": " << kw.weight << "\n";
// CEO: 11.7392
// 升职: 10.8562
// 加薪: 10.6426
// 手扶拖拉机: 10.0089
// 巅峰: 9.49396

Keyword struct

struct Keyword {
    double weight;                  // TF-IDF weight
    std::string word;               // keyword text
    std::vector<std::size_t> offsets; // byte offsets in original string
};

Requires idf.utf8 and stop_words.utf8 to be loaded.

POS Tagging

Jieba::tag_sentence

auto tag_sentence(std::string_view str)
    -> std::vector<std::pair<std::string, std::string_view>>;

Segment a sentence and tag each word with its part-of-speech. Dictionary-matched words use the tag from the dictionary; unmatched words are auto-classified.

for (auto &[word, tag] : jieba.tag_sentence("我是蓝翔技工"))
    std::cout << word << "/" << tag << " ";
// 我/r 是/v 蓝翔/nz 技工/n

Jieba::tag_word

auto tag_word(std::string_view str) -> std::string_view;

Return the POS tag for a single word. Falls back to auto-classification when the word is not in the dictionary:

jieba.tag_word("手扶拖拉机");  // "n"    ← dictionary match
jieba.tag_word("CEO");        // "eng"  ← English
jieba.tag_word("123");        // "m"    ← number
jieba.tag_word("龘龘");       // "x"    ← unknown

MixSegment

The default segmentation algorithm. Combines dictionary-based max-probability segmentation with HMM for out-of-vocabulary (OOV) word handling.

Strategy: First applies MPSegment using the dictionary trie, then handles any remaining unsegmented characters with HMMSegment (Viterbi decoding).

This is the recommended algorithm for general-purpose segmentation.

MPSegment

Pure dictionary-based max-probability segmentation.

Strategy: Builds a DAG of all possible dictionary matches, then finds the maximum probability path using dynamic programming. Word probabilities are derived from dictionary frequencies.

Does not handle OOV words — characters not found in the dictionary are left unsegmented.

HMMSegment

Pure HMM-based segmentation using Viterbi decoding.

Strategy: Models Chinese text as a Hidden Markov Model with 4 states:

  • B — Begin of word
  • E — End of word
  • M — Middle of word
  • S — Single-character word

Uses the HMM model file (hmm_model.utf8) for initial state probabilities, transition probabilities, and emission probabilities.

FullSegment

Enumerate all dictionary matches at every position.

Strategy: For each position in the input string, find every dictionary entry that matches starting at that position. Returns all matches, producing a superset of possible segmentations.

Useful for search indexing and recall-oriented applications.

QuerySegment

Extended full segmentation optimized for search queries.

Strategy: Applies FullSegment to find all dictionary matches, then additionally generates substrings of configurable length to maximize search recall.

Controlled by max_word_length parameter (default 500) passed to Jieba::cut.

Data Formats

Dictionary (jieba.dict.utf8)

Format: word frequency tag, one entry per line.

清华大学 5 ns
来到 3 v
  • frequency: integer, used by MPSegment for probability calculation
  • tag: optional part-of-speech tag

HMM Model (hmm_model.utf8)

Log-probability values. Structure:

  1. Line 1: 4 initial state probabilities (B, E, M, S)
  2. Lines 2-5: 4×4 transition probability matrix
  3. Lines 6-9: Emission probabilities for each state, format char:prob,char:prob,...
-0.2626 -3.14e+100 -3.14e+100 -1.4653
-3.14e+100 -0.5108 -0.9163 -3.14e+100
-0.5897 -3.14e+100 -3.14e+100 -0.8085
-3.14e+100 -0.3334 -1.2604 -3.14e+100
-0.7212 -3.14e+100 -3.14e+100 -0.6659
耀:-10.46,蘄:-11.02
耀:-9.27,蘄:-17.33
耀:-8.48,蘄:-14.37
耀:-11.22,蘄:-10.01

-3.14e+100 represents negative infinity (impossible transitions).

IDF (idf.utf8)

Format: word IDF_value, one entry per line.

来到 10.5
北京 8.2

Used for TF-IDF keyword extraction.

Stop Words (stop_words.utf8)

One word per line:

的
了
在

Filtered out during keyword extraction.

User Dictionary

Add custom words without rebuilding the main dictionary.

Loading

std::ifstream("user.dict.utf8") >> jieba.trie_.user();

File Format

One entry per line. Frequency and tag are optional:

云计算 5 n
云计算 n
云计算

Effect

Words added via the user dictionary take priority over dictionary entries during segmentation. This is useful for domain-specific terms, new words, or correcting segmentation errors.

API Reference (Doxygen)