Quick Search:
Prog. Theor. Phys. Supplement No.194 (2012) pp. 202-209
Zipf's Law and Heaps' Law Can Predict the Size of Potential Words
Yukie Sano,1
Hideki Takayasu2,3 and
Misako Takayasu4
1College of Science and Technology, Nihon University,
Funabashi 274-8501, Japan
2Sony Computer Science Laboratories, 3-14-13 Higashi-Gotanda,
Tokyo 141-0022, Japan
3Meiji Institute for Advanced Study of Mathematical Sciences,
Kawasaki 214-8571, Japan
4Department of Computational Intelligence and Systems Science,
Interdisciplinary Graduate School of Science and Engineering,
Tokyo Institute of Technology, Yokohama 226-8502, Japan
Abstract:
We confirm Zipf's law and Heaps' law using various types of
documents such as
literary works, blogs, and computer programs. Independent of the document type,
the exponents of Zipf' law are estimated to be approximately 1, whereas Heaps'
exponents appear to be dependent on the observation size, and the estimated
values are scattered around 0.5. By definition, randomly shuffled documents
reproduce Zipf's law and Heaps' law.
However, artificially generated documents using the empirically observed Zipf's
law and number of distinct
words do not reproduce Heaps' law. We demonstrate that Heaps' law holds for
artificial documents
in which a certain number of distinct words are added to empirically observed
distinct words.
This suggests that the number of potential distinct words considered in the
creation of a given document can be predicted.
URL :
http://ptp.ipap.jp/link?PTPS/194/202/
DOI : 10.1143/PTPS.194.202
References:
- G. K. Zipf, Human behavior and the principle of least effort (Addison-Wesley, Cambridge, 1949).
-
K. Okuyama, M. Takayasu and H. Takayasu, Physica A 269 (1999), 125[CrossRef].
-
C. Furusawa and K. Kaneko, Phys. Rev. Lett. 90 (2003), 088102[APS].
- M. E. J. Newman, Contemporary Physics 46 (2005), 323.
- A. Gelbukh and G. Sidorov, Lecture Notes in Computer Science 2004 (2001), 332.
- H. Zhang, Information Processing and Management 45 (2009), 477.
- H. S. Heaps, Information Retrieval: Computational and Theoretical Aspects (Academic Press, Orlando, 1978).
- R. Baeza-Yates and B. Ribeiro-Neto, Modern information retrieval (ACM Press, 1999).
- M. D. Araújo, G. Navarro and N. Ziviani, in Proc. Fourth South American Workshop on String Processing, Carleton University Press International Informatics Series 8 (1997), 2.
- P. Baldi, P. Frasconi and P. Smyth, Modeling the internet and the web: Probabilistic methods and algorithms (Wiley, 2003).
- R. Baeza-Yates and G. Navarro, J. of the American Society for Information Science 51 (2000), 69
- L. Lü, Z.-K. Zhang and T. Zhou, PLoS ONE 5 (2010), e14139.
- D. C. van Leijenhorst and Th. P. van der Weide, Inf. Sci. 170 (2005), 263.
- C. Cattuto, V. Loreto and L. Pietronero, Proc. Natl. Acad. Sci. 104 (2007), 1461.
- C. Cattuto, A. Barrat, A. Baldassarri, G. Schehr and V. Loreto, Proc. Natl. Acad. Sci. 106 (2009), 10511.
- O. Arrhenius, J. Ecol. 9 (1921), 95.
- H. Irie, K. Tokita and H. Habara, Publ. RIMS, Kyoto Univ. 1432 (2005), 116 (in Japanese).
- H. Irie and K. Tokita, q-bio/0609012.
- R. Colwell and J. Coddington, Philos. Trans. R. Soc. London B 345 (1994), 101.
-
M. Arumugam, J. Raes, E. Pelletier, D. Le Paslier, T. Yamada, D. R. Mende, G. R. Fernandes et al., Nature 473 (2011), 174[CrossRef].
- Digital libraries providing public domain books.
Gutenberg Project, http://www.gutenberg.org/
and Aozora bunko, http://www.aozora.gr.jp/
- Extension to the Python programming language.
NumPy (Version1-6-1), http://numpy.scipy.org/
- List of Japanese common new words.
Wikipedia titles, http://ja.wikipedia.org
and Hatena keywords, http://d.hatena.ne.jp/keyword/