Quick Search:
Author: Title/Abstract: Vol./No: Page:

Prog. Theor. Phys. Supplement No.194 (2012) pp. 202-209

[ Full Text PDF : FREE ACCESS (465K) ]

Zipf's Law and Heaps' Law Can Predict the Size of Potential Words

Yukie Sano,1 Hideki Takayasu2,3 and Misako Takayasu4

1College of Science and Technology, Nihon University, Funabashi 274-8501, Japan
2Sony Computer Science Laboratories, 3-14-13 Higashi-Gotanda, Tokyo 141-0022, Japan
3Meiji Institute for Advanced Study of Mathematical Sciences, Kawasaki 214-8571, Japan
4Department of Computational Intelligence and Systems Science, Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Yokohama 226-8502, Japan

Abstract:

We confirm Zipf's law and Heaps' law using various types of documents such as literary works, blogs, and computer programs. Independent of the document type, the exponents of Zipf' law are estimated to be approximately 1, whereas Heaps' exponents appear to be dependent on the observation size, and the estimated values are scattered around 0.5. By definition, randomly shuffled documents reproduce Zipf's law and Heaps' law. However, artificially generated documents using the empirically observed Zipf's law and number of distinct words do not reproduce Heaps' law. We demonstrate that Heaps' law holds for artificial documents in which a certain number of distinct words are added to empirically observed distinct words. This suggests that the number of potential distinct words considered in the creation of a given document can be predicted.


URL : http://ptp.ipap.jp/link?PTPS/194/202/
DOI : 10.1143/PTPS.194.202

[ Full Text PDF : FREE ACCESS (465K) ] Citation:


References:

  1. G. K. Zipf, Human behavior and the principle of least effort (Addison-Wesley, Cambridge, 1949).
  2. K. Okuyama, M. Takayasu and H. Takayasu, Physica A 269 (1999), 125[CrossRef].
  3. C. Furusawa and K. Kaneko, Phys. Rev. Lett. 90 (2003), 088102[APS].
  4. M. E. J. Newman, Contemporary Physics 46 (2005), 323.
  5. A. Gelbukh and G. Sidorov, Lecture Notes in Computer Science 2004 (2001), 332.
  6. H. Zhang, Information Processing and Management 45 (2009), 477.
  7. H. S. Heaps, Information Retrieval: Computational and Theoretical Aspects (Academic Press, Orlando, 1978).
  8. R. Baeza-Yates and B. Ribeiro-Neto, Modern information retrieval (ACM Press, 1999).
  9. M. D. Araújo, G. Navarro and N. Ziviani, in Proc. Fourth South American Workshop on String Processing, Carleton University Press International Informatics Series 8 (1997), 2.
  10. P. Baldi, P. Frasconi and P. Smyth, Modeling the internet and the web: Probabilistic methods and algorithms (Wiley, 2003).
  11. R. Baeza-Yates and G. Navarro, J. of the American Society for Information Science 51 (2000), 69
  12. L. Lü, Z.-K. Zhang and T. Zhou, PLoS ONE 5 (2010), e14139.
  13. D. C. van Leijenhorst and Th. P. van der Weide, Inf. Sci. 170 (2005), 263.
  14. C. Cattuto, V. Loreto and L. Pietronero, Proc. Natl. Acad. Sci. 104 (2007), 1461.
  15. C. Cattuto, A. Barrat, A. Baldassarri, G. Schehr and V. Loreto, Proc. Natl. Acad. Sci. 106 (2009), 10511.
  16. O. Arrhenius, J. Ecol. 9 (1921), 95.
  17. H. Irie, K. Tokita and H. Habara, Publ. RIMS, Kyoto Univ. 1432 (2005), 116 (in Japanese).
  18. H. Irie and K. Tokita, q-bio/0609012.
  19. R. Colwell and J. Coddington, Philos. Trans. R. Soc. London B 345 (1994), 101.
  20. M. Arumugam, J. Raes, E. Pelletier, D. Le Paslier, T. Yamada, D. R. Mende, G. R. Fernandes et al., Nature 473 (2011), 174[CrossRef].
  21. Digital libraries providing public domain books.
    Gutenberg Project, http://www.gutenberg.org/
    and Aozora bunko, http://www.aozora.gr.jp/
  22. Extension to the Python programming language.
    NumPy (Version1-6-1), http://numpy.scipy.org/
  23. List of Japanese common new words.
    Wikipedia titles, http://ja.wikipedia.org
    and Hatena keywords, http://d.hatena.ne.jp/keyword/