ccjieba
C++ Chinese text segmentation library, a refactored version inspired by cppjieba. C++17, zero external dependencies.
Features
- 5 segmentation algorithms: MixSegment (default), MPSegment, HMMSegment, FullSegment, QuerySegment
- Keyword extraction: TF-IDF based keyword extraction
- POS tagging: Part-of-speech tagging for words and sentences
- User dictionary: Incremental word additions without rebuilding
- Zero dependencies: Header-only usage possible for most features
Quick Start
See Getting Started for build instructions and basic usage.
Getting Started
Build
cmake -B build && cmake --build build # static lib + demo
cmake -B build -DBUILD_TESTING=ON # static lib + demo + tests
cd build && ctest # run tests
Basic Usage
Include jieba.hh, construct a Jieba object, load data via istream:
#include <fstream>
#include <jieba.hh>
ccjieba::Jieba jieba;
std::ifstream("data/jieba.dict.utf8") >> jieba.trie_;
std::ifstream("data/hmm_model.utf8") >> jieba.hmm_;
std::ifstream("data/idf.utf8") >> jieba.idf_;
std::ifstream("data/stop_words.utf8") >> jieba.stop_words_;
Data files are located in data/ (configured at compile time via the DATA_ROOT macro):
data/
├── jieba.dict.utf8 # dictionary (required)
├── hmm_model.utf8 # HMM model (required)
├── idf.utf8 # IDF weights (required for keywords)
└── stop_words.utf8 # stop words (required for keywords)
Jieba & Cut
The Jieba class is the main entry point. It holds the dictionary trie, HMM model, IDF table, and stop word set.
Jieba::cut
template <AlgoConcept Algo = MixSegment>
auto cut(std::string_view str, std::optional<size_t> max_word_length = 500)
-> std::vector<std::string>;
Segment str into words. The template parameter Algo selects the segmentation algorithm (default: MixSegment).
max_word_length limits the maximum word length considered during dictionary matching.
auto words = jieba.cut("我来到北京清华大学");
// → {"我", "来到", "北京", "清华大学"}
jieba.cut<ccjieba::FullSegment>("我来到北京清华大学");
// → {"我", "来到", "北京", "清华", "清华大学", "华大", "大学"}
| Algorithm | Description |
|---|---|
MixSegment (default) | Dictionary MPS + HMM for OOV words |
MPSegment | Pure dictionary max-probability |
HMMSegment | Pure HMM Viterbi decoding |
FullSegment | Enumerate all dictionary matches |
QuerySegment | FullSegment + substrings of given length, for search recall |
Keyword Extraction
Jieba::extract
auto extract(std::string_view str, size_t topN) -> std::vector<Keyword>;
Extract the top-N keywords ranked by TF-IDF weight (descending).
auto keywords = jieba.extract(
"我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。",
5
);
for (auto &kw : keywords)
std::cout << kw.word << ": " << kw.weight << "\n";
// CEO: 11.7392
// 升职: 10.8562
// 加薪: 10.6426
// 手扶拖拉机: 10.0089
// 巅峰: 9.49396
Keyword struct
struct Keyword {
double weight; // TF-IDF weight
std::string word; // keyword text
std::vector<std::size_t> offsets; // byte offsets in original string
};
Requires idf.utf8 and stop_words.utf8 to be loaded.
POS Tagging
Jieba::tag_sentence
auto tag_sentence(std::string_view str)
-> std::vector<std::pair<std::string, std::string_view>>;
Segment a sentence and tag each word with its part-of-speech. Dictionary-matched words use the tag from the dictionary; unmatched words are auto-classified.
for (auto &[word, tag] : jieba.tag_sentence("我是蓝翔技工"))
std::cout << word << "/" << tag << " ";
// 我/r 是/v 蓝翔/nz 技工/n
Jieba::tag_word
auto tag_word(std::string_view str) -> std::string_view;
Return the POS tag for a single word. Falls back to auto-classification when the word is not in the dictionary:
jieba.tag_word("手扶拖拉机"); // "n" ← dictionary match
jieba.tag_word("CEO"); // "eng" ← English
jieba.tag_word("123"); // "m" ← number
jieba.tag_word("龘龘"); // "x" ← unknown
MixSegment
The default segmentation algorithm. Combines dictionary-based max-probability segmentation with HMM for out-of-vocabulary (OOV) word handling.
Strategy: First applies MPSegment using the dictionary trie, then handles any remaining unsegmented characters with HMMSegment (Viterbi decoding).
This is the recommended algorithm for general-purpose segmentation.
MPSegment
Pure dictionary-based max-probability segmentation.
Strategy: Builds a DAG of all possible dictionary matches, then finds the maximum probability path using dynamic programming. Word probabilities are derived from dictionary frequencies.
Does not handle OOV words — characters not found in the dictionary are left unsegmented.
HMMSegment
Pure HMM-based segmentation using Viterbi decoding.
Strategy: Models Chinese text as a Hidden Markov Model with 4 states:
- B — Begin of word
- E — End of word
- M — Middle of word
- S — Single-character word
Uses the HMM model file (hmm_model.utf8) for initial state probabilities, transition probabilities, and emission probabilities.
FullSegment
Enumerate all dictionary matches at every position.
Strategy: For each position in the input string, find every dictionary entry that matches starting at that position. Returns all matches, producing a superset of possible segmentations.
Useful for search indexing and recall-oriented applications.
QuerySegment
Extended full segmentation optimized for search queries.
Strategy: Applies FullSegment to find all dictionary matches, then additionally generates substrings of configurable length to maximize search recall.
Controlled by max_word_length parameter (default 500) passed to Jieba::cut.
Data Formats
Dictionary (jieba.dict.utf8)
Format: word frequency tag, one entry per line.
清华大学 5 ns
来到 3 v
frequency: integer, used by MPSegment for probability calculationtag: optional part-of-speech tag
HMM Model (hmm_model.utf8)
Log-probability values. Structure:
- Line 1: 4 initial state probabilities (B, E, M, S)
- Lines 2-5: 4×4 transition probability matrix
- Lines 6-9: Emission probabilities for each state, format
char:prob,char:prob,...
-0.2626 -3.14e+100 -3.14e+100 -1.4653
-3.14e+100 -0.5108 -0.9163 -3.14e+100
-0.5897 -3.14e+100 -3.14e+100 -0.8085
-3.14e+100 -0.3334 -1.2604 -3.14e+100
-0.7212 -3.14e+100 -3.14e+100 -0.6659
耀:-10.46,蘄:-11.02
耀:-9.27,蘄:-17.33
耀:-8.48,蘄:-14.37
耀:-11.22,蘄:-10.01
-3.14e+100 represents negative infinity (impossible transitions).
IDF (idf.utf8)
Format: word IDF_value, one entry per line.
来到 10.5
北京 8.2
Used for TF-IDF keyword extraction.
Stop Words (stop_words.utf8)
One word per line:
的
了
在
Filtered out during keyword extraction.
User Dictionary
Add custom words without rebuilding the main dictionary.
Loading
std::ifstream("user.dict.utf8") >> jieba.trie_.user();
File Format
One entry per line. Frequency and tag are optional:
云计算 5 n
云计算 n
云计算
Effect
Words added via the user dictionary take priority over dictionary entries during segmentation. This is useful for domain-specific terms, new words, or correcting segmentation errors.