Unintuitive frequency calculation
Hi Mike,
I implemented a token-pair vocabulary builder in C (to speed up tokenizer training) and tested it by comparing frequency stats generated by your algo and mine. Here's what I found (you can see my stats in c-vocabulary-generator (cvocgen) column):
=== Top 10 Common Tokens by Frequency ===
Token APE Freq cvocgen Freq Ratio (APE/cvocgen)
------------------------------------------------------------
[C][=C] *5202105* 421672 12.3369
[C] *3564290* 3564290 1.0000
[Branch1][C][O] 2043565 407667 5.0128
[C][O] 1413892 590456 2.3946
[C][C] 1214318 1184528 1.0251
[O] 1172384 1172384 1.0000
[Branch1] 1100707 1100707 1.0000
[=C] 793134 793134 1.0000
[Ring1] 687263 687263 1.0000
[C@H1] 455998 455998 1.0000
=== Examples of Tokens Only in APETokenizer ===
[Branch1][C]: 1246517
=== Examples of Tokens Only in cvocgen ===
[=Branch1][C]: 271128
What I have trouble with is the fact that your algo's stats related to merged tokens is higher than that related to individual tokens. This does not make any sense to me at all :(. Can you explain the reason behind this approach?
Kind regards, Alex.
PS The C-code is in this repo: https://github.com/cloudcell/cvocgen