本文是计算机专业的留学生作业代写范文翻译范例，题目是“Analysis of Convolution Neural Network-Based Algorithm for Annotation Cost Estimation（基于卷积神经网络的标注代价估计算法分析）”，主动学习是一种重要的机器学习过程，它通过有选择地查询用户来对实例进行标注或标注，以降低总体标注成本。
Active learning is an important machine learning process of selectively querying the users to label or annotate examples with the goal of reducing the overall annotation cost. Although most existing convolution neural network (CNN) work are based on a simple assumption that the cost of annotation for each labeling query is the same or fixed, the assumption may not be realistic. That is, in fact, the cost of annotation may vary between instances of data. In this work, I have studied and presented various annotation cost-sensitive active learning algorithms, which need to estimate the utility and cost of each query simultaneously.
The goal is to build or merge different models of machine learning and reduce the total cost of labelling to train the model. Hence , I propose a technique for combining Latent Semantic Indexing (LSI) and Word Mover’s Distance (WMD) methods to come up with an efficient architecture which can work on different set of datasets, thus reducing the overall labelling/annotation cost in the field of supervised machine learning and validate that the proposed method is generally superior to other annotation cost-sensitive algorithms.
Traditional machine learning algorithms use any data labeled to induce a model. By contrast, an active learning algorithm can select which instances are labeled and added to the training set. A learner typically starts with a small set of labeled instances, selects a few informative instances from a pool of unlabeled data, and queries from an oracle (e.g., a human annotator) for labels. The objective is to reduce the overall annotation cost to train a model. The notion of annotation costs must be better understood and incorporated into the active learning process in order to genuinely reduce the labeling costs required to build an accurate model. Hence, I propose a technique for combining Latent Semantic Indexing (LSI) and Word Mover’s Distance (WMD) methods to come up with an efficient architecture which can work on different set of datasets, thus reducing the overall labelling/annotation cost in the field of supervised machine learning.
Active learning is a machine learning setup that enables machines to strategically “ask questions” to label the oracle (Settles, 2010) in order to reduce the cost of labeling. With regard to the number of examples, annotation costs have traditionally been measured, but it has been widely recognized that different examples may require different annotation efforts (Settles et al., 2008).
Vast quantities of unlabeled instances can be easily acquired in many machine learning scenarios, yet high-quality labels are expensive to obtain. For example, a massive number of experiments and analyzes are needed in fields such as medicine (Liu, 2004)) or biology (King et al., 2004) to label a single instance, while collecting samples is a relatively easy task. In setting up cost-sensitive active learning, there are some variations. In (Margineantu, 2005), it is assumed that the cost of labeling for all data instances is known before querying, while in (Settles et al., 2008), the cost of a data instance can only be bought after querying its label. In this work, I concentrate on the later setup that closely matches the real-world human annotation scenario. Existing works (Haertel et al., 2008) must therefore simultaneously estimate the utility and cost of each instance in the setup and select instances with a high utility and low cost.
The idea of sampling uncertainty (Lewis and Gale, 1994) is to query the data instance label with the classifier’s highest uncertainty. For example, in a support vector machine (SVM), (Tong and Koller, 2001) propose to query the data instance closest to the decision boundary; (Holub et al., 2008) selects data instances to be queried from a probabilistic classifier based on the entropy of label probabilities.
In Kang et al., 2004,Data instances closest to each cluster’s centroid are searched before using any other section criteria; (Huang et al., 2010) measures the representativeness of each data instance from both the unlabeled data in-stances cluster structure and the labeled data class assignments , and (Xu et al., 2003) clusters those data instances close to the SVM decision boundary and queries data instance labels close to the center of each cluster. In (Nguyen and Smeulders, 2004) clustering is used to estimate the probability of unlabeled data instances labeling, which is the key component in the measurement of data instance utilities.
There are various works targeting on annotation cost sensitive active learning with different problem settings, such as the querying target (Greiner et al., 2002), the number of the labelers (Donmez and Carbonell, 2008) the targeting classification problem (Yan and Huang, 2018) and the applied data domain (Vijayanarasimhan and Grauman, 2011).
有各种各样的工作目标在注释代价敏感主动学习不同的问题设置,如查询目标(Greiner et al., 2002),贴标签机的数量(Donmez and Carbonell, 2008)针对分类问题(Yan and Huang, 2018)和应用数据域(Vijayanarasimhan and Grauman, 2011)。
In order to discuss cost-sensitive active learning with unknown costs, the first question to be answered is whether the cost of human annotation can be estimated accurately. In (Arora et al., 2009), Various unsupervised models are proposed to estimate the cost of annotation for corpus datasets, while (Settles et al., 2008) shows that the cost of annotation can be estimated accurately using a supervised model of learning.
2.Architecture and Proposed Procedure架构和建议流程
Active learning is widespread framework with the ability to automatically select the most informative unlabeled examples for annotation. The motivation behind the sampling of uncertainty is to find some unlabeled examples closest to the data set labeled (nearest neighbor) and use them to assign the label. To achieve this, I am creating document classification using CNN for any unknown target label input article and doing a cosine similarity to finding the most similar documents as neighbors for the document in the training set without labels. It allows to assume fairly that the closest similar document can be labeled the same, this will facilitate the labeling of the oracle with a smaller set of inputs.
The architecture is combination of two major components, first is to collect and preprocess them and will explain the similarity measures and develop the related models. The architecture’s second part captures unlabeled data and uses different models to perform similarity checks. The output of the system is to use effective models to identify neighboring documents / articles. I am evaluating multiple models in this work to improve document similarity in order to reduce the overall labeling effort. For similarity score, I am using Word2Vec.Based on the Vector Space Model, two similarity measures based on word2vec (“Centroids” and “Word Mover’s Distance (WMD)”) will be studied and compared with the commonly used Latent Semantic Indexing (LSI). Also 20 newsgroups datasets will be used to compare the document similarity measures.
Task1: Data Understanding
In order to conduct the testing, I have to assess data situation, obtain data (Access), once data is available it needs to be explored. I used data pipeline ETL tool PowerCenter Informatica for building data warehouse. It was deployed in Virtual Machine with following specifications:
Operating System: Windows Server 2012 R2 Standard
RAM :32 GB
CPU Cores :8 Core 2.40 GHz Processor
Kernel Version 9.3.9600.18821
操作系统:Windows Server 2012 R2 Standard
Task2: Data Preparation
Data Preparation is the process of gathering, cleaning and consolidating data into a single file or data table, primarily for analysis purposes. I used Datawatch Monarch is the industry’s leading solution for self-service data preparation. Recommendation specification for using Datawatch Monarch are as follows:
Windows 10 – 8 GB memory
5 GB disk space
2GHz or faster processor
.NET Framework 4.5.2
Microsoft Access Database Engine 2010 version
I went with Scikit-Learn, the Python programming language for machine learning library to implement some models quickly during this project. To get the data ready for machine learning, I have to take some basic steps: missing value imputation, encoding of categorical variables, and optionally feature selection if the input dimension is too large. Scikit-learn library requires following dependencies:
Python (>= 2.7 or >= 3.4)
NumPy (>= 1.8.2)
SciPy (>= 0.13.3)
Task4: Evaluation of Results
As part of testing, we compared the three methods (LSI, Centroid and WMD). First, a local analysis on a single example is done to get a sense of how well the methods work. Then a global analysis is done with a clustering task. A lemmatization step has been done, and duplicates are removed to make the table readable.）
Finally, I compared the overall performance of the methods considered to common discrete methods of representation such as K-medoids, K-Means, Complete, Ward and DBSCAN.
4.Project Roadmap and Timeline项目路线图和时间表
Coming up with this distributed architecture as explained in above sections would require six steps as mentioned in timeline section below:
The first step involved reading and analyzing various relevant research papers and documents. This initial part would take around two weeks.
For the next three steps I have selected various existing algorithm and I am going to test and record results for each algorithm. Testing and recording results of LSI Algorithm will take a week.
In this step I will test, and record results of Centroid Algorithm. This part will take a week.
In this step I will test, and record results of WMD Algorithm. This part will also take a week.
In this step results of various algorithms as determined in above mentioned steps are compared across various matrices and identifying the bottle necks, this step will take one week.
The Final step involved combining of LSI and WMD algorithms and applying various optimizations steps to address the issues identified so that final algorithm reduces the total labelling/annotation cost in the field. This step will take two weeks.
I am a lead member of technical staff at Salesforce with over 15 years of experience in software industry. I am responsible for building Test framework/harness design, development and execution for unit testing of Java based cloud services. I have also developed java-based tools for load & performance for applications within large-scale Linux Clustered. I have professional level experiences in following technologies: Java, Python, Big Data Technologies, Functional Testing, automation and performance engineering. I am leading Sales Cloud prediction quality team in Salesforce Prior to Salesforce I was working with Intuit Inc as Staff Engineer.
The existing literature provided well defined explanation and comparison of various algorithms calculating annotation/labeling costs in field of supervised machine learning, however it did not include any improvement like combining various models to come up with an architecture which can work on different set of datasets uniformly considering behavior and volume of the data, thus I managed to demonstrate for long texts corresponding to the 20 Newsgroups dataset, LSI is the best method; MD and the Centroid method both involve better clustering than LSI for the Web snippets dataset. and main focus in future work would be to investigate cost-sensitive active learning strategies that are more robust when given approximate, predicted annotation costs.