Normal view MARC view ISBD view

Natural language annotation for machine learning /

Pustejovsky, J.

Natural language annotation for machine learning / - Mumbai : Shroff Publishers & Distr, 2013. - xiv, 324 p. ;

Includes bibliographical references (p.306-315) and index.

Machine generated contents note: The Importance of Language Annotation --
The Layers of Linguistic Description --
What Is Natural Language Processing? --
A Brief History of Corpus Linguistics --
What Is a Corpus? --
Early Use of Corpora --
Corpora Today --
Kinds of Annotation --
Language Data and Machine Learning --
Classification --
Clustering --
Structured Pattern Induction --
The Annotation Development Cycle --
Model the Phenomenon --
Annotate with the Specification --
Train and Test the Algorithms over the Corpus --
Evaluate the Results --
Revise the Model and Algorithms --
Summary --
Defining Your Goal --
The Statement of Purpose --
Refining Your Goal: Informativity Versus Correctness --
Background Research --
Language Resources --
Organizations and Conferences --
NLP Challenges --
Assembling Your Dataset --
The Ideal Corpus: Representative and Balanced --
Collecting Data from the Internet --
Eliciting Data from People --
The Size of Your Corpus --
Existing Corpora --
Distributions Within Corpora --
Summary --
Basic Probability for Corpus Analytics --
Joint Probability Distributions --
Bayes Rule --
Counting Occurrences --
Zipf's Law --
N-grams --
Language Models --
Summary --
Some Example Models and Specs --
Film Genre Classification --
Adding Named Entities --
Semantic Roles --
Adopting (or Not Adopting) Existing Models --
Creating Your Own Model and Specification: Generality Versus Specificity --
Using Existing Models and Specifications --
Using Models Without Specifications --
Different Kinds of Standards --
ISO Standards --
Community-Driven Standards --
Other Standards Affecting Annotation --
Summary --
Metadata Annotation: Document Classification --
Unique Labels: Movie Reviews --
Multiple Labels: Film Genres --
Text Extent Annotation: Named Entities --
Inline Annotation --
Stand-off Annotation by Tokens --
Stand-off Annotation by Character Location --
Linked Extent Annotation: Semantic Roles --
ISO Standards and You --
Summary --
The Infrastructure of an Annotation Project --
Specification Versus Guidelines --
Be Prepared to Revise --
Preparing Your Data for Annotation --
Metadata --
Preprocessed Data --
Splitting Up the Files for Annotation --
Writing the Annotation Guidelines --
Example 1: Single Labels-Movie Reviews --
Example 2: Multiple Labels-Film Genres --
Example 3: Extent Annotations-Named Entities --
Example 4: Link Tags-Semantic Roles --
Annotators --
Choosing an Annotation Environment --
Evaluating the Annotations --
Cohen's Kappa (K) --
Fleiss's Kappa (K) --
Interpreting Kappa Coefficients --
Calculating K in Other Contexts --
Creating the Gold Standard (Adjudication) --
Summary --
What Is Learning? --
Defining Our Learning Task --
Classifier Algorithms --
Decision Tree Learning --
Gender Identification --
Naive Bayes Learning --
Maximum Entropy Classifiers --
Other Classifiers to Know About --
Sequence Induction Algorithms --
Clustering and Unsupervised Learning --
Semi-Supervised Learning --
Matching Annotation to Algorithms --
Testing Your Algorithm --
Evaluating Your Algorithm --
Confusion Matrices --
Calculating Evaluation Scores --
Interpreting Evaluation Scores --
Problems That Can Affect Evaluation --
Dataset Is Too Small --
Algorithm Fits the Development Data Too Well --
Too Much Information in the Annotation --
Final Testing Scores --
Summary --
Revising Your Project --
Corpus Distributions and Content --
Model and Specification --
Annotation --
Training and Testing --
Reporting About Your Work --
About Your Corpus --
About Your Model and Specifications --
About Your Annotation Task and Annotators --
About Your ML Algorithm --
About Your Revisions --
Summary --
The Goal of TimeML --
Related Research --
Building the Corpus --
Model: Preliminary Specifications --
Times --
Signals --
Events --
Links --
Annotation: First Attempts --
Model: The TimeML Specification Used in TimeBank --
Time Expressions --
Events --
Signals --
Links --
Confidence --
Annotation: The Creation of TimeBank --
TimeML Becomes ISO-TimeML --
Modeling the Future: Directions for TimeML --
Narrative Containers --
Expanding TimeML to Other Domains --
Event Structures --
Summary --
The TARSQI Components --
GUTime: Temporal Marker Identification --
EVITA: Event Recognition and Classification --
GUTenLINK --
Slinket --
SputLink --
Machine Learning in the TARSQI Components --
Improvements to the TTK --
Structural Changes --
Improvements to Temporal Entity Recognition: BTime --
Temporal Relation Identification --
Temporal Relation Validation --
Temporal Relation Visualization --
TimeML Challenges: TempEval-2 --
TempEval-2: System Summaries --
Overview of Results --
Future of the TTK --
New Input Formats --
Narrative Containers/Narrative Times --
Medical Documents --
Cross-Document Analysis --
Summary --
Crowdsourcing Annotation --
Amazon's Mechanical Turk --
Games with a Purpose (GWAP) --
User-Generated Content --
Handling Big Data --
Boosting --
Active Learning --
Semi-Supervised Learning --
NLP Online and in the Cloud --
Distributed Computing --
Shared Language Resources --
Shared Language Applications --
And Finally ... --
Appendices.

Create your own natural language training corpus for machine learning. This example-driven book walks you through the annotation cycle, from selecting an annotation task and creating the annotation specification to designing the guidelines, creating a "gold standard" corpus, and then beginning the actual data creation with the annotation process.

ISBN: 9781449306663 (pbk.) 9789351103738

Subjects--Topical Terms:
Natural language processing (Computer science)
Corpora (Linguistics)
Machine learning.

Dewey Class. No.: 006.35 / PUS