计算机作业代写范文翻译:Analysis of Convolution Neural Network-Based Algorithm for Annotation Cost Estimation

发布时间:2022-05-18 10:34:49 论文编辑:wangda1203

本文是计算机专业的留学生作业代写范文翻译范例,题目是“Analysis of Convolution Neural Network-Based Algorithm for Annotation Cost Estimation(基于卷积神经网络的标注代价估计算法分析)”,主动学习是一种重要的机器学习过程,它通过有选择地查询用户来对实例进行标注或标注,以降低总体标注成本。

虽然目前大多数卷积神经网络(CNN)的工作都是基于一个简单的假设,即每个标注查询的标注成本是相同的或固定的,但这个假设可能是不现实的。也就是说,注释的成本可能会因数据实例的不同而不同。在本工作中,我研究并提出了各种对标注代价敏感的主动学习算法,这些算法需要同时估计每个查询的效用和代价。

Abstract 摘要

Active learning is an important machine learning process of selectively querying the users to label or annotate examples with the goal of reducing the overall annotation cost. Although most existing convolution neural network (CNN) work are based on a simple assumption that the cost of annotation for each labeling query is the same or fixed, the assumption may not be realistic. That is, in fact, the cost of annotation may vary between instances of data. In this work, I have studied and presented various annotation cost-sensitive active learning algorithms, which need to estimate the utility and cost of each query simultaneously.

The goal is to build or merge different models of machine learning and reduce the total cost of labelling to train the model. Hence , I propose a technique for combining Latent Semantic Indexing (LSI) and Word Mover’s Distance (WMD) methods to come up with an efficient architecture which can work on different set of datasets, thus reducing the overall labelling/annotation cost in the field of supervised machine learning and validate that the proposed method is generally superior to other annotation cost-sensitive algorithms.

其目标是建立或合并不同的机器学习模型,并减少标签训练模型的总成本。因此,我提出了一种结合潜在语义索引(LSI)和单词移动者距离(WMD)方法的技术,以提出一种可以在不同的数据集上工作的高效架构,从而降低了监督机器学习领域的总体标注/标注成本,验证了该方法总体上优于其他标注成本敏感算法。

1.Introduction引言

Traditional machine learning algorithms use any data labeled to induce a model. By contrast, an active learning algorithm can select which instances are labeled and added to the training set. A learner typically starts with a small set of labeled instances, selects a few informative instances from a pool of unlabeled data, and queries from an oracle (e.g., a human annotator) for labels. The objective is to reduce the overall annotation cost to train a model. The notion of annotation costs must be better understood and incorporated into the active learning process in order to genuinely reduce the labeling costs required to build an accurate model. Hence, I propose a technique for combining Latent Semantic Indexing (LSI) and Word Mover’s Distance (WMD) methods to come up with an efficient architecture which can work on different set of datasets, thus reducing the overall labelling/annotation cost in the field of supervised machine learning.

传统的机器学习算法使用任何标记的数据来诱导模型。相比之下,主动学习算法可以选择哪些实例被标记并添加到训练集中。学习者通常从一组带标签的实例开始,从一组未带标签的数据中选择一些信息丰富的实例,然后从oracle(例如,人类注释器)中查询标签。目的是降低训练模型的总体标注成本。必须更好地理解标注成本的概念,并将其纳入主动学习过程,以真正降低构建准确模型所需的标注成本。因此,我提出了一种结合潜在语义索引(LSI)和单词移动者距离(WMD)方法的技术,提出了一种适用于不同数据集的高效架构,从而降低了监督机器学习领域的总体标注/标注成本。

Active learning is a machine learning setup that enables machines to strategically “ask questions” to label the oracle (Settles, 2010) in order to reduce the cost of labeling. With regard to the number of examples, annotation costs have traditionally been measured, but it has been widely recognized that different examples may require different annotation efforts (Settles et al., 2008).

Vast quantities of unlabeled instances can be easily acquired in many machine learning scenarios, yet high-quality labels are expensive to obtain. For example, a massive number of experiments and analyzes are needed in fields such as medicine (Liu, 2004)) or biology (King et al., 2004) to label a single instance, while collecting samples is a relatively easy task. In setting up cost-sensitive active learning, there are some variations. In (Margineantu, 2005), it is assumed that the cost of labeling for all data instances is known before querying, while in (Settles et al., 2008), the cost of a data instance can only be bought after querying its label. In this work, I concentrate on the later setup that closely matches the real-world human annotation scenario. Existing works (Haertel et al., 2008) must therefore simultaneously estimate the utility and cost of each instance in the setup and select instances with a high utility and low cost.

英文作业代写

The idea of sampling uncertainty (Lewis and Gale, 1994) is to query the data instance label with the classifier’s highest uncertainty. For example, in a support vector machine (SVM), (Tong and Koller, 2001) propose to query the data instance closest to the decision boundary; (Holub et al., 2008) selects data instances to be queried from a probabilistic classifier based on the entropy of label probabilities.

In Kang et al., 2004,Data instances closest to each cluster’s centroid are searched before using any other section criteria; (Huang et al., 2010) measures the representativeness of each data instance from both the unlabeled data in-stances cluster structure and the labeled data class assignments , and (Xu et al., 2003) clusters those data instances close to the SVM decision boundary and queries data instance labels close to the center of each cluster. In (Nguyen and Smeulders, 2004) clustering is used to estimate the probability of unlabeled data instances labeling, which is the key component in the measurement of data instance utilities.

There are various works targeting on annotation cost sensitive active learning with different problem settings, such as the querying target (Greiner et al., 2002), the number of the labelers (Donmez and Carbonell, 2008) the targeting classification problem (Yan and Huang, 2018) and the applied data domain (Vijayanarasimhan and Grauman, 2011).

有各种各样的工作目标在注释代价敏感主动学习不同的问题设置,如查询目标(Greiner et al., 2002),贴标签机的数量(Donmez and Carbonell, 2008)针对分类问题(Yan and Huang, 2018)和应用数据域(Vijayanarasimhan and Grauman, 2011)。

In order to discuss cost-sensitive active learning with unknown costs, the first question to be answered is whether the cost of human annotation can be estimated accurately. In (Arora et al., 2009), Various unsupervised models are proposed to estimate the cost of annotation for corpus datasets, while (Settles et al., 2008) shows that the cost of annotation can be estimated accurately using a supervised model of learning.

2.Architecture and Proposed Procedure架构和建议流程

Active learning is widespread framework with the ability to automatically select the most informative unlabeled examples for annotation. The motivation behind the sampling of uncertainty is to find some unlabeled examples closest to the data set labeled (nearest neighbor) and use them to assign the label. To achieve this, I am creating document classification using CNN for any unknown target label input article and doing a cosine similarity to finding the most similar documents as neighbors for the document in the training set without labels. It allows to assume fairly that the closest similar document can be labeled the same, this will facilitate the labeling of the oracle with a smaller set of inputs.

主动学习是一种广泛使用的框架,它能够自动选择信息量最大的未标记示例进行标注。不确定性抽样背后的动机是找到一些最接近已标记数据集(最近邻居)的未标记示例,并使用它们来分配标签。为了实现这一点,我正在使用CNN为任何未知的目标标签输入文章创建文档分类,并做一个余弦相似度,以在没有标签的训练集中为文档找到最相似的文档作为邻居。它允许假设最接近的相似文档可以被标记为相同的,这将有助于使用更小的输入集来标记oracle。

The architecture is combination of two major components, first is to collect and preprocess them and will explain the similarity measures and develop the related models. The architecture’s second part captures unlabeled data and uses different models to perform similarity checks. The output of the system is to use effective models to identify neighboring documents / articles. I am evaluating multiple models in this work to improve document similarity in order to reduce the overall labeling effort. For similarity score, I am using Word2Vec.Based on the Vector Space Model, two similarity measures based on word2vec (“Centroids” and “Word Mover’s Distance (WMD)”) will be studied and compared with the commonly used Latent Semantic Indexing (LSI). Also 20 newsgroups datasets will be used to compare the document similarity measures.

3.Task Analysis任务分析

Task1: Data Understanding

In order to conduct the testing, I have to assess data situation, obtain data (Access), once data is available it needs to be explored. I used data pipeline ETL tool PowerCenter Informatica for building data warehouse. It was deployed in Virtual Machine with following specifications:

Operating System: Windows Server 2012 R2 Standard

RAM :32 GB

CPU Cores :8 Core 2.40 GHz Processor

Kernel Version 9.3.9600.18821

任务1:数据的理解

为了进行测试,我必须评估数据情况,获取数据(访问),一旦数据可用,就需要进行探索。我使用数据管道ETL工具PowerCenter Informatica来建立数据仓库。它部署在虚拟机与以下规格:

操作系统:Windows Server 2012 R2 Standard

内存:32 GB

CPU内核:8核2.40 GHz处理器

内核版本9.3.9600.18821

Task2: Data Preparation

Data Preparation is the process of gathering, cleaning and consolidating data into a single file or data table, primarily for analysis purposes. I used Datawatch Monarch is the industry’s leading solution for self-service data preparation. Recommendation specification for using Datawatch Monarch are as follows:

Windows 10 – 8 GB memory

5 GB disk space

2GHz or faster processor

Google Chrome

.NET Framework 4.5.2

Microsoft Access Database Engine 2010 version

Microsoft SQLServer

Task3: Modelling

I went with Scikit-Learn, the Python programming language for machine learning library to implement some models quickly during this project. To get the data ready for machine learning, I have to take some basic steps: missing value imputation, encoding of categorical variables, and optionally feature selection if the input dimension is too large. Scikit-learn library requires following dependencies:

Python (>= 2.7 or >= 3.4)

NumPy (>= 1.8.2)

SciPy (>= 0.13.3)

Task4: Evaluation of Results

As part of testing, we compared the three methods (LSI, Centroid and WMD). First, a local analysis on a single example is done to get a sense of how well the methods work. Then a global analysis is done with a clustering task. A lemmatization step has been done, and duplicates are removed to make the table readable.)

Finally, I compared the overall performance of the methods considered to common discrete methods of representation such as K-medoids, K-Means, Complete, Ward and DBSCAN.

4.Project Roadmap and Timeline项目路线图和时间表

Coming up with this distributed architecture as explained in above sections would require six steps as mentioned in timeline section below:

The first step involved reading and analyzing various relevant research papers and documents. This initial part would take around two weeks.

For the next three steps I have selected various existing algorithm and I am going to test and record results for each algorithm. Testing and recording results of LSI Algorithm will take a week.

提出上述章节中解释的分布式架构需要以下时间轴部分中提到的六个步骤:

第一步是阅读和分析各种相关的研究论文和文件。最初的部分大约需要两周的时间。

在接下来的三个步骤中,我选择了各种现有的算法,我将测试并记录每个算法的结果。LSI算法测试和记录结果需要一周的时间。

In this step I will test, and record results of Centroid Algorithm. This part will take a week.

In this step I will test, and record results of WMD Algorithm. This part will also take a week.

In this step results of various algorithms as determined in above mentioned steps are compared across various matrices and identifying the bottle necks, this step will take one week.

The Final step involved combining of LSI and WMD algorithms and applying various optimizations steps to address the issues identified so that final algorithm reduces the total labelling/annotation cost in the field. This step will take two weeks.

计算机作业代写

5.Credentials凭证

I am a lead member of technical staff at Salesforce with over 15 years of experience in software industry. I am responsible for building Test framework/harness design, development and execution for unit testing of Java based cloud services. I have also developed java-based tools for load & performance for applications within large-scale Linux Clustered. I have professional level experiences in following technologies: Java, Python, Big Data Technologies, Functional Testing, automation and performance engineering. I am leading Sales Cloud prediction quality team in Salesforce Prior to Salesforce I was working with Intuit Inc as Staff Engineer.

我是Salesforce技术团队的领导成员,在软件行业有超过15年的经验。我负责构建基于Java的云服务单元测试的测试框架/设备设计、开发和执行。我还为大规模Linux集群中的应用程序开发了基于java的加载和性能工具。在Java、Python、大数据技术、功能测试、自动化、性能工程等方面有专业水平的经验。我在Salesforce领导销售云预测质量团队在Salesforce之前,我是Intuit公司的员工工程师。

The existing literature provided well defined explanation and comparison of various algorithms calculating annotation/labeling costs in field of supervised machine learning, however it did not include any improvement like combining various models to come up with an architecture which can work on different set of datasets uniformly considering behavior and volume of the data, thus I managed to demonstrate for long texts corresponding to the 20 Newsgroups dataset, LSI is the best method; MD and the Centroid method both involve better clustering than LSI for the Web snippets dataset. and main focus in future work would be to investigate cost-sensitive active learning strategies that are more robust when given approximate, predicted annotation costs.

现有文献对监督机器学习领域中计算标注/标注代价的各种算法进行了很好的解释和比较,但是它不包括任何改进像结合各种模型来提出一个架构,可以工作在不同的数据集统一考虑行为和体积的数据,因此我设法证明长文本对应于20新闻组数据集,大规模集成电路是最好的方法;MD和Centroid方法对Web代码片段数据集的聚类效果都优于LSI方法。未来工作的主要重点将是研究成本敏感的主动学习策略,当给出近似的、预测的注释成本时,这种策略更稳健。

留学生作业相关专业范文素材资料,尽在本网,可以随时查阅参考。本站也提供多国留学生课程作业写作指导服务,如有需要可咨询本平台。

提交代写需求

如果您有论文代写需求,可以通过下面的方式联系我们。