TY - BOOK AU - Pustejovsky,J. AU - Stubbs,Amber TI - Natural language annotation for machine learning SN - 9781449306663 (pbk.) U1 - 006.35 PY - 2013/// CY - Mumbai PB - Shroff Publishers & Distr, KW - Natural language processing (Computer science) KW - Corpora (Linguistics) KW - Machine learning N1 - Includes bibliographical references (p.306-315) and index; Machine generated contents note: The Importance of Language Annotation -- The Layers of Linguistic Description -- What Is Natural Language Processing? -- A Brief History of Corpus Linguistics -- What Is a Corpus? -- Early Use of Corpora -- Corpora Today -- Kinds of Annotation -- Language Data and Machine Learning -- Classification -- Clustering -- Structured Pattern Induction -- The Annotation Development Cycle -- Model the Phenomenon -- Annotate with the Specification -- Train and Test the Algorithms over the Corpus -- Evaluate the Results -- Revise the Model and Algorithms -- Summary -- Defining Your Goal -- The Statement of Purpose -- Refining Your Goal: Informativity Versus Correctness -- Background Research -- Language Resources -- Organizations and Conferences -- NLP Challenges -- Assembling Your Dataset -- The Ideal Corpus: Representative and Balanced -- Collecting Data from the Internet -- Eliciting Data from People -- The Size of Your Corpus -- Existing Corpora -- Distributions Within Corpora -- Summary -- Basic Probability for Corpus Analytics -- Joint Probability Distributions -- Bayes Rule -- Counting Occurrences -- Zipf's Law -- N-grams -- Language Models -- Summary -- Some Example Models and Specs -- Film Genre Classification -- Adding Named Entities -- Semantic Roles -- Adopting (or Not Adopting) Existing Models -- Creating Your Own Model and Specification: Generality Versus Specificity -- Using Existing Models and Specifications -- Using Models Without Specifications -- Different Kinds of Standards -- ISO Standards -- Community-Driven Standards -- Other Standards Affecting Annotation -- Summary -- Metadata Annotation: Document Classification -- Unique Labels: Movie Reviews -- Multiple Labels: Film Genres -- Text Extent Annotation: Named Entities -- Inline Annotation -- Stand-off Annotation by Tokens -- Stand-off Annotation by Character Location -- Linked Extent Annotation: Semantic Roles -- ISO Standards and You -- Summary -- The Infrastructure of an Annotation Project -- Specification Versus Guidelines -- Be Prepared to Revise -- Preparing Your Data for Annotation -- Metadata -- Preprocessed Data -- Splitting Up the Files for Annotation -- Writing the Annotation Guidelines -- Example 1: Single Labels-Movie Reviews -- Example 2: Multiple Labels-Film Genres -- Example 3: Extent Annotations-Named Entities -- Example 4: Link Tags-Semantic Roles -- Annotators -- Choosing an Annotation Environment -- Evaluating the Annotations -- Cohen's Kappa (K) -- Fleiss's Kappa (K) -- Interpreting Kappa Coefficients -- Calculating K in Other Contexts -- Creating the Gold Standard (Adjudication) -- Summary -- What Is Learning? -- Defining Our Learning Task -- Classifier Algorithms -- Decision Tree Learning -- Gender Identification -- Naive Bayes Learning -- Maximum Entropy Classifiers -- Other Classifiers to Know About -- Sequence Induction Algorithms -- Clustering and Unsupervised Learning -- Semi-Supervised Learning -- Matching Annotation to Algorithms -- Testing Your Algorithm -- Evaluating Your Algorithm -- Confusion Matrices -- Calculating Evaluation Scores -- Interpreting Evaluation Scores -- Problems That Can Affect Evaluation -- Dataset Is Too Small -- Algorithm Fits the Development Data Too Well -- Too Much Information in the Annotation -- Final Testing Scores -- Summary -- Revising Your Project -- Corpus Distributions and Content -- Model and Specification -- Annotation -- Training and Testing -- Reporting About Your Work -- About Your Corpus -- About Your Model and Specifications -- About Your Annotation Task and Annotators -- About Your ML Algorithm -- About Your Revisions -- Summary -- The Goal of TimeML -- Related Research -- Building the Corpus -- Model: Preliminary Specifications -- Times -- Signals -- Events -- Links -- Annotation: First Attempts -- Model: The TimeML Specification Used in TimeBank -- Time Expressions -- Events -- Signals -- Links -- Confidence -- Annotation: The Creation of TimeBank -- TimeML Becomes ISO-TimeML -- Modeling the Future: Directions for TimeML -- Narrative Containers -- Expanding TimeML to Other Domains -- Event Structures -- Summary -- The TARSQI Components -- GUTime: Temporal Marker Identification -- EVITA: Event Recognition and Classification -- GUTenLINK -- Slinket -- SputLink -- Machine Learning in the TARSQI Components -- Improvements to the TTK -- Structural Changes -- Improvements to Temporal Entity Recognition: BTime -- Temporal Relation Identification -- Temporal Relation Validation -- Temporal Relation Visualization -- TimeML Challenges: TempEval-2 -- TempEval-2: System Summaries -- Overview of Results -- Future of the TTK -- New Input Formats -- Narrative Containers/Narrative Times -- Medical Documents -- Cross-Document Analysis -- Summary -- Crowdsourcing Annotation -- Amazon's Mechanical Turk -- Games with a Purpose (GWAP) -- User-Generated Content -- Handling Big Data -- Boosting -- Active Learning -- Semi-Supervised Learning -- NLP Online and in the Cloud -- Distributed Computing -- Shared Language Resources -- Shared Language Applications -- And Finally ... -- Appendices N2 - Create your own natural language training corpus for machine learning. This example-driven book walks you through the annotation cycle, from selecting an annotation task and creating the annotation specification to designing the guidelines, creating a "gold standard" corpus, and then beginning the actual data creation with the annotation process ER -