Dissertations and Theses @ UNI

Availability

Open Access Thesis

Keywords

Collocation (Linguistics); Natural language processing (Computer science); Academic theses;

Abstract

Studies in Natural Language Processing are often dedicated to defining the lexicon and grammar rules of individual languages. It is a common understanding that virtually all statements that a language user encounters are unique constructions that have never been seen before. Because of this, there has been very little modem research regarding the use of repeated language when attempting to provide computers with language processing functionality. It seems reasonable that most statements we use on a daily basis are completely new when they are evaluated in their entirely. However, it is possible that smaller portions of these statements are reused. I propose that there may be collocations or frequently repeated groups of words that are reused by language users over time. If it can be found that people have a tendency to use frequent collocations in their language, this would provide a new area of research for the advancement of NLP solutions. To explore the frequency of repeated collocations, I developed an application that read portions of English and German novels. This application was then provided different portions of these novels and identified the frequency of reoccurring sentence fragments. I analyzed the results of the tests in order to determine if collocations exist. After performing multiple tests in both English and German, it can be determined that most excerpts from the given novels contained at least one collocation that could be recognized from the control portion of the novels. After reading the beginning of the novel, the application was able to determine which set of additional excerpts comes from the same novel and which set comes from an alternate novel. The application was able to make this distinction based entirely on the existence of collocations. The results of my experiment show that some language is indeed repeated by language users, and it would be worthwhile to explore any opportunities to use this information in future NLP applications.

Year of Submission

2011

Degree Name

Master of Science

Department

Department of Computer Science

First Advisor

Eugene Wallingford

Second Advisor

J. Ben Schafer

Third Advisor

Jack Yates

Comments

If you are the rightful copyright holder of this thesis and wish to have it removed from the Open Access Collection, please submit a request to scholarworks@uni.edu and include clear identification of the work, preferably with URL.

Date Original

2011

Object Description

1 PDF file (47 leaves)

Language

en

File Format

application/pdf

Share

COinS