Corpus I/O¶

Build frequency Counters from tokens, text, or count files.

`keyflux.io.corpus` ¶

Build frequency Counters from tokens, text, or count files.

keyflux is about keyness and rank comparison, not tokenisation. These helpers cover the simple cases; for real linguistic tokenisation, pre-tokenise (for example with kenon.Tokenizer) and pass the resulting Counter directly.

`counts_from_text(text, *, lowercase=True)` ¶

Tokenise a string on word characters, then count.

Uses a simple word-character regular expression — adequate for demos and tests, not a substitute for linguistic tokenisation.

Parameters:

Name	Type	Description	Default
`text`	`str`	Raw text to tokenise and count.	required
`lowercase`	`bool`	If True, lowercase before counting.	`True`

Returns:

Type	Description
`Counter[str]`	A Counter mapping each type to its frequency.

Examples:

>>> counts_from_text("The cat sat. The dog ran.")["the"]
2

Source code in keyflux/io/corpus.py

def counts_from_text(text: str, *, lowercase: bool = True) -> Counter[str]:
    """Tokenise a string on word characters, then count.

    Uses a simple word-character regular expression — adequate for demos and
    tests, not a substitute for linguistic tokenisation.

    Args:
        text: Raw text to tokenise and count.
        lowercase: If True, lowercase before counting.

    Returns:
        A Counter mapping each type to its frequency.

    Examples:
        >>> counts_from_text("The cat sat. The dog ran.")["the"]
        2
    """
    tokens = _WORD_RE.findall(text.lower() if lowercase else text)
    return Counter(tokens)

`counts_from_tokens(tokens, *, lowercase=True)` ¶

Build a frequency Counter from a token iterable.

Parameters:

Name	Type	Description	Default
`tokens`	`Iterable[str]`	Already-tokenised word types.	required
`lowercase`	`bool`	If True, lowercase each token before counting.	`True`

Returns:

Type	Description
`Counter[str]`	A Counter mapping each type to its frequency.

Examples:

>>> counts_from_tokens(["The", "cat", "the", "CAT"])
Counter({'the': 2, 'cat': 2})

Source code in keyflux/io/corpus.py

def counts_from_tokens(
    tokens: Iterable[str], *, lowercase: bool = True
) -> Counter[str]:
    """Build a frequency Counter from a token iterable.

    Args:
        tokens: Already-tokenised word types.
        lowercase: If True, lowercase each token before counting.

    Returns:
        A Counter mapping each type to its frequency.

    Examples:
        >>> counts_from_tokens(["The", "cat", "the", "CAT"])
        Counter({'the': 2, 'cat': 2})
    """
    if lowercase:
        return Counter(t.lower() for t in tokens)
    return Counter(tokens)

`load_counts(path)` ¶

Read a count file into a Counter.

Each non-empty line is either type<TAB>count or a bare type (counted as one occurrence per line).

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to the count file.	required

Returns:

Type	Description
`Counter[str]`	A Counter built from the file.

Raises:

Type	Description
`FileNotFoundError`	If `path` does not exist.
`ValueError`	If a count field is present but not an integer.

Examples:

>>> import tempfile, pathlib
>>> p = pathlib.Path(tempfile.mkdtemp()) / "counts.tsv"
>>> rows = ["climate" + chr(9) + "30", "carbon" + chr(9) + "12"]
>>> _ = p.write_text(chr(10).join(rows))
>>> load_counts(p)
Counter({'climate': 30, 'carbon': 12})

Source code in keyflux/io/corpus.py

def load_counts(path: str | Path) -> Counter[str]:
    """Read a count file into a Counter.

    Each non-empty line is either ``type<TAB>count`` or a bare ``type`` (counted
    as one occurrence per line).

    Args:
        path: Path to the count file.

    Returns:
        A Counter built from the file.

    Raises:
        FileNotFoundError: If ``path`` does not exist.
        ValueError: If a count field is present but not an integer.

    Examples:
        >>> import tempfile, pathlib
        >>> p = pathlib.Path(tempfile.mkdtemp()) / "counts.tsv"
        >>> rows = ["climate" + chr(9) + "30", "carbon" + chr(9) + "12"]
        >>> _ = p.write_text(chr(10).join(rows))
        >>> load_counts(p)
        Counter({'climate': 30, 'carbon': 12})
    """
    counts: Counter[str] = Counter()
    for line in Path(path).read_text(encoding="utf-8").splitlines():
        line = line.strip()
        if not line:
            continue
        parts = line.split("\t")
        if len(parts) == 1:
            counts[parts[0]] += 1
        else:
            word, raw = parts[0], parts[1]
            try:
                counts[word] += int(raw)
            except ValueError as exc:
                msg = f"Non-integer count {raw!r} for type {word!r}."
                raise ValueError(msg) from exc
    return counts

Corpus I/O¶

keyflux.io.corpus ¶

counts_from_text(text, *, lowercase=True) ¶

counts_from_tokens(tokens, *, lowercase=True) ¶

load_counts(path) ¶

`keyflux.io.corpus` ¶

`counts_from_text(text, *, lowercase=True)` ¶

`counts_from_tokens(tokens, *, lowercase=True)` ¶

`load_counts(path)` ¶