Corpus I/O¶
Build frequency Counters from tokens, text, or count files.
keyflux.io.corpus
¶
Build frequency Counters from tokens, text, or count files.
keyflux is about keyness and rank comparison, not tokenisation. These helpers
cover the simple cases; for real linguistic tokenisation, pre-tokenise (for
example with kenon.Tokenizer) and pass the resulting Counter directly.
counts_from_text(text, *, lowercase=True)
¶
Tokenise a string on word characters, then count.
Uses a simple word-character regular expression — adequate for demos and tests, not a substitute for linguistic tokenisation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Raw text to tokenise and count. |
required |
lowercase
|
bool
|
If True, lowercase before counting. |
True
|
Returns:
| Type | Description |
|---|---|
Counter[str]
|
A Counter mapping each type to its frequency. |
Examples:
Source code in keyflux/io/corpus.py
counts_from_tokens(tokens, *, lowercase=True)
¶
Build a frequency Counter from a token iterable.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
Iterable[str]
|
Already-tokenised word types. |
required |
lowercase
|
bool
|
If True, lowercase each token before counting. |
True
|
Returns:
| Type | Description |
|---|---|
Counter[str]
|
A Counter mapping each type to its frequency. |
Examples:
Source code in keyflux/io/corpus.py
load_counts(path)
¶
Read a count file into a Counter.
Each non-empty line is either type<TAB>count or a bare type (counted
as one occurrence per line).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to the count file. |
required |
Returns:
| Type | Description |
|---|---|
Counter[str]
|
A Counter built from the file. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If |
ValueError
|
If a count field is present but not an integer. |
Examples:
>>> import tempfile, pathlib
>>> p = pathlib.Path(tempfile.mkdtemp()) / "counts.tsv"
>>> rows = ["climate" + chr(9) + "30", "carbon" + chr(9) + "12"]
>>> _ = p.write_text(chr(10).join(rows))
>>> load_counts(p)
Counter({'climate': 30, 'carbon': 12})