Measures¶
Pure-function keyness association measures. The log-likelihood is the significance backbone; the log ratio is the effect size used to rank keywords.
keyflux.keyness.measures
¶
Keyness association measures as pure functions.
Each measure compares one word type across a focus corpus C and a reference
corpus R. The four observed quantities are the type's frequency in each corpus
(a in C, b in R) and the corpus token totals (n_focus,
n_reference). This is the keyword contingency of Brezina, Statistics in
Corpus Linguistics (2018), Ch. 3, and Dunning (1993) for the log-likelihood.
The log-likelihood (Dunning's G2) is the significance backbone; an effect-size measure (log ratio, Simple Maths, %DIFF) is used to rank keywords. Brezina's caution applies: in large corpora the log-likelihood flags far too many keywords, so it is for significance only and the sorting key should be an effect size.
CHI2_CRITICAL = {'p05': 3.84, 'p01': 6.63, 'p001': 10.83, 'p0001': 15.13}
module-attribute
¶
Chi-square critical values at 1 d.f. for the log-likelihood significance bands.
SMP_DEFAULT_K = 100.0
module-attribute
¶
Default Simple Maths constant k (Kilgarriff 2009); also a frequency filter.
ZERO_CELL_FLOOR = 0.5
module-attribute
¶
Count substituted for a zero cell in log ratio / %DIFF so exclusives stay finite.
chi_square(a, b, n_focus, n_reference)
¶
Pearson chi-square (1 d.f.) on the 2x2 keyword contingency table.
Included for contrast and teaching only. As Dunning (1993) shows, chi-square
overestimates the significance of rare events; this is exactly the failure
that :func:log_likelihood is preferred to avoid. Prefer the log-likelihood
for significance in real analyses.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
int
|
Frequency of the type in the focus corpus C. |
required |
b
|
int
|
Frequency of the type in the reference corpus R. |
required |
n_focus
|
int
|
Token total of the focus corpus C. |
required |
n_reference
|
int
|
Token total of the reference corpus R. |
required |
Returns:
| Type | Description |
|---|---|
float
|
The chi-square statistic, always |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Contract
- The result is non-negative.
- Computed over the full 2x2 table (word present / absent x corpus), so it uses the "other tokens" cells that the log-likelihood short form omits.
Examples:
>>> round(chi_square(620, 267, 1_017_879, 1_007_532), 2)
136.96
>>> chi_square(10, 10, 1000, 1000)
0.0
Source code in keyflux/keyness/measures.py
expected_counts(a, b, n_focus, n_reference)
¶
Expected focus and reference counts under the shared-rate null hypothesis.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
int
|
Frequency of the type in the focus corpus C. |
required |
b
|
int
|
Frequency of the type in the reference corpus R. |
required |
n_focus
|
int
|
Token total of the focus corpus C. |
required |
n_reference
|
int
|
Token total of the reference corpus R. |
required |
Returns:
| Type | Description |
|---|---|
float
|
|
float
|
rate in both corpora. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Contract
E_C + E_Requalsa + b(the expected counts redistribute the observed total by corpus size).- Both returned values are non-negative.
Examples:
>>> e_c, e_r = expected_counts(620, 267, 1_017_879, 1_007_532)
>>> round(e_c, 2), round(e_r, 2)
(445.77, 441.23)
Source code in keyflux/keyness/measures.py
log_likelihood(a, b, n_focus, n_reference)
¶
Dunning's log-likelihood ratio (G2) for one type — the significance score.
Uses the two-cell keyword short form (Brezina eq. 3.4): the statistic is the
unsigned magnitude only. Direction (whether the type is over- or
under-represented in the focus corpus) is decided separately from relative
frequencies; see :func:keyflux.keyness.classify.classify_direction.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
int
|
Frequency of the type in the focus corpus C. |
required |
b
|
int
|
Frequency of the type in the reference corpus R. |
required |
n_focus
|
int
|
Token total of the focus corpus C. |
required |
n_reference
|
int
|
Token total of the reference corpus R. |
required |
Returns:
| Type | Description |
|---|---|
float
|
The log-likelihood statistic, always |
float
|
data: |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Contract
- The result is non-negative (it is zero only when the relative frequencies in C and R are identical).
- A zero-frequency cell contributes nothing (the
x ln x -> 0limit), so the statistic is finite for exclusives. - Handles rare events far better than :func:
chi_square(Dunning 1993).
Examples:
>>> round(log_likelihood(620, 267, 1_017_879, 1_007_532), 2)
140.87
>>> log_likelihood(0, 0, 1000, 1000)
0.0
Source code in keyflux/keyness/measures.py
log_ratio(a, b, n_focus, n_reference, floor=ZERO_CELL_FLOOR)
¶
Log ratio (Hardie 2014) — the signed effect size, in log2 units.
The base-2 logarithm of the relative-frequency ratio between C and R. A
value of +1 means the type is twice as frequent (per token) in the focus
corpus; -1 means twice as frequent in the reference corpus. Brezina's
convention is to filter by log-likelihood significance first, then sort
keywords by log ratio.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
int
|
Frequency of the type in the focus corpus C. |
required |
b
|
int
|
Frequency of the type in the reference corpus R. |
required |
n_focus
|
int
|
Token total of the focus corpus C. |
required |
n_reference
|
int
|
Token total of the reference corpus R. |
required |
floor
|
float
|
Count substituted for a zero cell so exclusives stay finite rather than diverging to +/- infinity. |
ZERO_CELL_FLOOR
|
Returns:
| Type | Description |
|---|---|
float
|
|
float
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If either corpus total is non-positive. |
Contract
- Sign convention: positive when the type leans to the focus corpus, negative when it leans to the reference corpus.
- Swapping (a, n_focus) with (b, n_reference) negates the result.
- Finite for exclusives (one of a, b zero) thanks to
floor.
Examples:
>>> round(log_ratio(620, 267, 1_017_879, 1_007_532), 2)
1.2
>>> round(log_ratio(10, 2, 1000, 1000), 2)
2.32
Source code in keyflux/keyness/measures.py
percent_diff(a, b, n_focus, n_reference, floor=ZERO_CELL_FLOOR)
¶
%DIFF (Gabrielatos & Marchi 2012) — percentage change in relative frequency.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
int
|
Frequency of the type in the focus corpus C. |
required |
b
|
int
|
Frequency of the type in the reference corpus R. |
required |
n_focus
|
int
|
Token total of the focus corpus C. |
required |
n_reference
|
int
|
Token total of the reference corpus R. |
required |
floor
|
float
|
Count substituted for a zero cell so the reference rate is never exactly zero (which would make the percentage undefined). |
ZERO_CELL_FLOOR
|
Returns:
| Type | Description |
|---|---|
float
|
|
float
|
focus rate is double the reference rate. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If either corpus total is non-positive. |
Contract
- Positive when the type leans to the focus corpus, negative otherwise.
- The reference rate is floored, so the result is always finite.
Examples:
>>> round(percent_diff(620, 267, 1_017_879, 1_007_532), 2)
129.85
>>> percent_diff(10, 10, 1000, 1000)
0.0
Source code in keyflux/keyness/measures.py
significance_band(statistic)
¶
Map a log-likelihood / chi-square statistic to its significance band.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
statistic
|
float
|
A log-likelihood or chi-square value (1 d.f.). |
required |
Returns:
| Type | Description |
|---|---|
Significance
|
One of |
Significance
|
|
Significance
|
statistic reaches. |
Contract
- Monotone: a larger statistic never maps to a weaker band.
- Thresholds are the chi-square critical values at 1 d.f.: 3.84, 6.63, 10.83, 15.13.
Examples:
>>> significance_band(140.87)
'p0001'
>>> significance_band(3.83)
'ns'
>>> significance_band(6.63)
'p01'
Source code in keyflux/keyness/measures.py
simple_maths(a, b, n_focus, n_reference, k=SMP_DEFAULT_K)
¶
Simple Maths parameter (Kilgarriff 2009) — ratio with a built-in filter.
Compares relative frequencies per million words after adding a constant
k to each, which both avoids division by zero and doubles as a
frequency filter: larger k suppresses low-frequency words.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
int
|
Frequency of the type in the focus corpus C. |
required |
b
|
int
|
Frequency of the type in the reference corpus R. |
required |
n_focus
|
int
|
Token total of the focus corpus C. |
required |
n_reference
|
int
|
Token total of the reference corpus R. |
required |
k
|
float
|
Smoothing constant / frequency filter, typically 1, 10, 100, or 1000. |
SMP_DEFAULT_K
|
Returns:
| Type | Description |
|---|---|
float
|
|
float
|
million words. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If either corpus total is non-positive. |
Contract
- Always positive and finite (
k > 0removes the zero-denominator case), so exclusives need no special handling. - A value of 1.0 means equal smoothed rates in the two corpora.
Examples:
>>> round(simple_maths(620, 267, 1_017_879, 1_007_532), 2)
1.94
>>> simple_maths(5, 5, 1000, 1000)
1.0