Skip to content

Unintuitive frequency calculation

Hi Mike,

I implemented a token-pair vocabulary builder in C (to speed up tokenizer training) and tested it by comparing frequency stats generated by your algo and mine. Here's what I found (you can see my stats in c-vocabulary-generator (cvocgen) column):

=== Top 10 Common Tokens by Frequency ===
Token                APE Freq   cvocgen Freq Ratio (APE/cvocgen) 
------------------------------------------------------------
[C][=C]              *5202105*    421672     12.3369
[C]                  *3564290*    3564290    1.0000
[Branch1][C][O]      2043565    407667     5.0128
[C][O]               1413892    590456     2.3946
[C][C]               1214318    1184528    1.0251
[O]                  1172384    1172384    1.0000
[Branch1]            1100707    1100707    1.0000
[=C]                 793134     793134     1.0000
[Ring1]              687263     687263     1.0000
[C@H1]               455998     455998     1.0000

=== Examples of Tokens Only in APETokenizer ===
[Branch1][C]: 1246517

=== Examples of Tokens Only in cvocgen ===
[=Branch1][C]: 271128

What I have trouble with is the fact that your algo's stats related to merged tokens is higher than that related to individual tokens. This does not make any sense to me at all :(. Can you explain the reason behind this approach?

Kind regards, Alex.

PS The C-code is in this repo: https://github.com/cloudcell/cvocgen