Classify¶

Keyword and lockword categorisation helpers.

`keyflux.keyness.classify` ¶

Keyword and lockword categorisation (Baker 2011; Brezina Ch. 3).

Each compared type is one of three categories:

Positive keyword (+): significantly more frequent in the focus corpus.
Negative keyword (-): significantly more frequent in the reference corpus.
Lockword (0): comparable relative frequency in both corpora.

This module owns the categorisation boundary (direction and band thresholds); the numeric zero-cell flooring lives in :mod:keyflux.keyness.measures.

`Category = Literal['keyword+', 'keyword-', 'lockword', 'other']` `module-attribute` ¶

The bucket a type falls into under :func:classify_row.

`classify_direction(focus_rf, reference_rf)` ¶

Decide keyness polarity from relative frequencies.

Parameters:

Name	Type	Description	Default
`focus_rf`	`float`	Relative frequency of the type in the focus corpus.	required
`reference_rf`	`float`	Relative frequency of the type in the reference corpus.	required

Returns:

Type	Description
`Direction`	`"positive"` if more frequent in the focus corpus, `"negative"` if
`Direction`	more frequent in the reference corpus, `"neutral"` if exactly equal.

Contract

Swapping the two arguments swaps "positive" and "negative" and leaves "neutral" unchanged.

Examples:

>>> classify_direction(0.003, 0.001)
'positive'
>>> classify_direction(0.001, 0.003)
'negative'
>>> classify_direction(0.002, 0.002)
'neutral'

Source code in keyflux/keyness/classify.py

def classify_direction(focus_rf: float, reference_rf: float) -> Direction:
    """Decide keyness polarity from relative frequencies.

    Args:
        focus_rf: Relative frequency of the type in the focus corpus.
        reference_rf: Relative frequency of the type in the reference corpus.

    Returns:
        ``"positive"`` if more frequent in the focus corpus, ``"negative"`` if
        more frequent in the reference corpus, ``"neutral"`` if exactly equal.

    Contract:
        - Swapping the two arguments swaps ``"positive"`` and ``"negative"`` and
          leaves ``"neutral"`` unchanged.

    Examples:
        >>> classify_direction(0.003, 0.001)
        'positive'
        >>> classify_direction(0.001, 0.003)
        'negative'
        >>> classify_direction(0.002, 0.002)
        'neutral'
    """
    if focus_rf > reference_rf:
        return "positive"
    if focus_rf < reference_rf:
        return "negative"
    return "neutral"

`classify_row(row, *, min_significance='p05', lockword_max_abs_log_ratio=0.5)` ¶

Bucket a keyness row into keyword(+/-), lockword, or other.

Parameters:

Name	Type	Description	Default
`row`	`KeynessRow`	A scored keyness row.	required
`min_significance`	`Significance`	The weakest band that counts as a keyword.	`'p05'`
`lockword_max_abs_log_ratio`	`float`	A non-significant type is a lockword only if its absolute log ratio is at or below this (relative frequencies near parity).	`0.5`

Returns:

Type	Description
`Category`	`"keyword+"`, `"keyword-"`, `"lockword"`, or `"other"` (a
`Category`	non-significant type whose frequencies are too far apart to be stable).

Contract

Significant rows are keywords; their sign follows row.direction.
A non-significant row is a lockword when its effect size is small, otherwise "other".
Frequency cutoffs (minimum evidence in both corpora) are applied by :meth:keyflux.keyness.keyness.Keyness.lockwords, not here.

Examples:

>>> from keyflux.keyness.keyness import KeynessRow
>>> kw = KeynessRow("war", 620, 267, 609.1, 265.0, 140.87, 1.2,
...                  "p0001", 140.87, "positive")
>>> classify_row(kw)
'keyword+'
>>> lock = KeynessRow("the", 59901, 58960, 58848.8, 58519.2, 1.5, 0.01,
...                   "ns", 1.5, "positive")
>>> classify_row(lock)
'lockword'

Source code in keyflux/keyness/classify.py

def classify_row(
    row: KeynessRow,
    *,
    min_significance: Significance = "p05",
    lockword_max_abs_log_ratio: float = 0.5,
) -> Category:
    """Bucket a keyness row into keyword(+/-), lockword, or other.

    Args:
        row: A scored keyness row.
        min_significance: The weakest band that counts as a keyword.
        lockword_max_abs_log_ratio: A non-significant type is a lockword only if
            its absolute log ratio is at or below this (relative frequencies near
            parity).

    Returns:
        ``"keyword+"``, ``"keyword-"``, ``"lockword"``, or ``"other"`` (a
        non-significant type whose frequencies are too far apart to be stable).

    Contract:
        - Significant rows are keywords; their sign follows ``row.direction``.
        - A non-significant row is a lockword when its effect size is small,
          otherwise ``"other"``.
        - Frequency cutoffs (minimum evidence in both corpora) are applied by
          :meth:`keyflux.keyness.keyness.Keyness.lockwords`, not here.

    Examples:
        >>> from keyflux.keyness.keyness import KeynessRow
        >>> kw = KeynessRow("war", 620, 267, 609.1, 265.0, 140.87, 1.2,
        ...                  "p0001", 140.87, "positive")
        >>> classify_row(kw)
        'keyword+'
        >>> lock = KeynessRow("the", 59901, 58960, 58848.8, 58519.2, 1.5, 0.01,
        ...                   "ns", 1.5, "positive")
        >>> classify_row(lock)
        'lockword'
    """
    if is_significant(row.significance, min_significance):
        return "keyword+" if row.direction == "positive" else "keyword-"
    if abs(row.effect_size) <= lockword_max_abs_log_ratio:
        return "lockword"
    return "other"

`is_significant(significance, min_significance='p05')` ¶

Whether a significance band reaches at least min_significance.

Parameters:

Name	Type	Description	Default
`significance`	`Significance`	The band to test.	required
`min_significance`	`Significance`	The weakest band that still counts as significant.	`'p05'`

Returns:

Type	Description
`bool`	True if `significance` is at least as strong as `min_significance`.

Contract

"ns" is never significant for any min_significance above it.
Monotone in the band ordering ns < p05 < p01 < p001 < p0001.

Examples:

>>> is_significant("p001")
True
>>> is_significant("ns")
False
>>> is_significant("p05", min_significance="p01")
False

Source code in keyflux/keyness/classify.py

def is_significant(
    significance: Significance, min_significance: Significance = "p05"
) -> bool:
    """Whether a significance band reaches at least ``min_significance``.

    Args:
        significance: The band to test.
        min_significance: The weakest band that still counts as significant.

    Returns:
        True if ``significance`` is at least as strong as ``min_significance``.

    Contract:
        - ``"ns"`` is never significant for any ``min_significance`` above it.
        - Monotone in the band ordering ns < p05 < p01 < p001 < p0001.

    Examples:
        >>> is_significant("p001")
        True
        >>> is_significant("ns")
        False
        >>> is_significant("p05", min_significance="p01")
        False
    """
    return _BAND_ORDER.index(significance) >= _BAND_ORDER.index(min_significance)

`partition(rows, *, min_significance='p05', lockword_max_abs_log_ratio=0.5)` ¶

Group rows by :func:classify_row category.

Parameters:

Name	Type	Description	Default
`rows`	`Sequence[KeynessRow]`	Scored keyness rows.	required
`min_significance`	`Significance`	The weakest band that counts as a keyword.	`'p05'`
`lockword_max_abs_log_ratio`	`float`	Lockword effect-size ceiling.	`0.5`

Returns:

Type	Description
`dict[Category, list[KeynessRow]]`	A dict mapping each category to its rows. Every input row appears in
`dict[Category, list[KeynessRow]]`	exactly one bucket; absent categories map to an empty list.

Contract

The buckets partition the input: disjoint and exhaustive.
All four category keys are always present.

Examples:

>>> from keyflux.keyness.keyness import KeynessRow
>>> rows = [
...     KeynessRow("war", 620, 267, 609.1, 265.0, 140.87, 1.2,
...                "p0001", 140.87, "positive"),
...     KeynessRow("the", 59901, 58960, 58848.8, 58519.2, 1.5, 0.01,
...                "ns", 1.5, "positive"),
... ]
>>> buckets = partition(rows)
>>> len(buckets["keyword+"]), len(buckets["lockword"])
(1, 1)

Source code in keyflux/keyness/classify.py

def partition(
    rows: Sequence[KeynessRow],
    *,
    min_significance: Significance = "p05",
    lockword_max_abs_log_ratio: float = 0.5,
) -> dict[Category, list[KeynessRow]]:
    """Group rows by :func:`classify_row` category.

    Args:
        rows: Scored keyness rows.
        min_significance: The weakest band that counts as a keyword.
        lockword_max_abs_log_ratio: Lockword effect-size ceiling.

    Returns:
        A dict mapping each category to its rows. Every input row appears in
        exactly one bucket; absent categories map to an empty list.

    Contract:
        - The buckets partition the input: disjoint and exhaustive.
        - All four category keys are always present.

    Examples:
        >>> from keyflux.keyness.keyness import KeynessRow
        >>> rows = [
        ...     KeynessRow("war", 620, 267, 609.1, 265.0, 140.87, 1.2,
        ...                "p0001", 140.87, "positive"),
        ...     KeynessRow("the", 59901, 58960, 58848.8, 58519.2, 1.5, 0.01,
        ...                "ns", 1.5, "positive"),
        ... ]
        >>> buckets = partition(rows)
        >>> len(buckets["keyword+"]), len(buckets["lockword"])
        (1, 1)
    """
    buckets: dict[Category, list[KeynessRow]] = defaultdict(list)
    for row in rows:
        category = classify_row(
            row,
            min_significance=min_significance,
            lockword_max_abs_log_ratio=lockword_max_abs_log_ratio,
        )
        buckets[category].append(row)
    for key in ("keyword+", "keyword-", "lockword", "other"):
        buckets.setdefault(key, [])
    return dict(buckets)

Classify¶

keyflux.keyness.classify ¶

Category = Literal['keyword+', 'keyword-', 'lockword', 'other'] module-attribute ¶

classify_direction(focus_rf, reference_rf) ¶

classify_row(row, *, min_significance='p05', lockword_max_abs_log_ratio=0.5) ¶

is_significant(significance, min_significance='p05') ¶

partition(rows, *, min_significance='p05', lockword_max_abs_log_ratio=0.5) ¶

`keyflux.keyness.classify` ¶

`Category = Literal['keyword+', 'keyword-', 'lockword', 'other']` `module-attribute` ¶

`classify_direction(focus_rf, reference_rf)` ¶

`classify_row(row, *, min_significance='p05', lockword_max_abs_log_ratio=0.5)` ¶

`is_significant(significance, min_significance='p05')` ¶

`partition(rows, *, min_significance='p05', lockword_max_abs_log_ratio=0.5)` ¶