Dissertations and Theses @ UNI
Availability
Open Access Thesis
Keywords
Collocation (Linguistics); Natural language processing (Computer science); Academic theses;
Abstract
Studies in Natural Language Processing are often dedicated to defining the lexicon and grammar rules of individual languages. It is a common understanding that virtually all statements that a language user encounters are unique constructions that have never been seen before. Because of this, there has been very little modem research regarding the use of repeated language when attempting to provide computers with language processing functionality. It seems reasonable that most statements we use on a daily basis are completely new when they are evaluated in their entirely. However, it is possible that smaller portions of these statements are reused. I propose that there may be collocations or frequently repeated groups of words that are reused by language users over time. If it can be found that people have a tendency to use frequent collocations in their language, this would provide a new area of research for the advancement of NLP solutions. To explore the frequency of repeated collocations, I developed an application that read portions of English and German novels. This application was then provided different portions of these novels and identified the frequency of reoccurring sentence fragments. I analyzed the results of the tests in order to determine if collocations exist. After performing multiple tests in both English and German, it can be determined that most excerpts from the given novels contained at least one collocation that could be recognized from the control portion of the novels. After reading the beginning of the novel, the application was able to determine which set of additional excerpts comes from the same novel and which set comes from an alternate novel. The application was able to make this distinction based entirely on the existence of collocations. The results of my experiment show that some language is indeed repeated by language users, and it would be worthwhile to explore any opportunities to use this information in future NLP applications.
Year of Submission
2011
Degree Name
Master of Science
Department
Department of Computer Science
First Advisor
Eugene Wallingford
Second Advisor
J. Ben Schafer
Third Advisor
Jack Yates
Date Original
2011
Object Description
1 PDF file (47 leaves)
Copyright
©2011 Patrick Anthony Burke
Language
en
File Format
application/pdf
Recommended Citation
Burke, Patrick Anthony, "Examining the Use of Collocations in Natural Language Processing" (2011). Dissertations and Theses @ UNI. 2263.
https://scholarworks.uni.edu/etd/2263
Comments
If you are the rightful copyright holder of this thesis and wish to have it removed from the Open Access Collection, please submit a request to scholarworks@uni.edu and include clear identification of the work, preferably with URL.