Background: 1) Weka: Weka has two meanings: a flightless bird name, a machine learning open source project called (Waikato Environment for Knowledge Analysis,). Presented here of course is the second meaning friends, Weka project began in 1992 by the New Zealand government support, is now famous in the field of machine learning. There are very comprehensive Weka machine learning algorithms, including data preprocessing, classification, regression, clustering, association rules. Weka graphical interface will not write a program for people who are very convenient, but also provides “KnowledgeFlow” feature that allows multiple steps to form a workflow. In addition, Weka also allows to execute commands on the command line. 2) RR nonsense I do not have it, huh, huh, more and more popular statistical software (). 3) R and Weka: R, there are many machine learning functions, and packages, but the function provided by Weka in a more comprehensive and more focused, so I sometimes need to use Weka. I used to be so used R and Weka: in R ready for training in data (such as: extraction of data characteristics …); organized into the format required by Weka (*. arff); doing in the Weka machine learning (such as: feature selection classification …); Weka predictions from the statistical calculation of the amount needed (such as: sensitivity, specificity, MCC …). Return Daoteng two software is still very troublesome; to be lazy, I did not learn Weka command line, only use graphical interface, in the very suffer when large volumes of data and sometimes not enough memory. Provided that R is actually the interface functions and Weka package RWeka, more convenient since, oh, here what RWeka features: RWeka (: 1) data input and output WOW (): View Weka function parameters. Weka_control (): set the Weka function parameters. read.arff (): read Weka Attribute-Relation File Format (ARFF) format. write.arff: write data to Weka Attribute-Relation File Format (ARFF) forhttp://www.gamewg110.cnmat. 2) data preprocessing Normalize (): standardization of continuous data without supervision. Discretize (): using MDL (Minimum Description Length) method, supervised discrete continuous numerical data. 3) Classification and regression IBk (): k nearest neighbor classifier LBR (): naive Bayes method classification J48 (): C4.5 decision tree algorithm (decision tree analysis of each property, is completely independent.) LMT (): combination of tree and Logistic regression model, each leaf node is a Logistic regression model, the accuracy of the decision tree than the individual better and Logistic Regression. M5P (): M5 model number algorithm, a combination of tree and linear regression model, each leaf node is a linear regression model, which can be used for the return of continuous data. DecisionStump (): single decision tree algorithm, boosting is often used as the basic learner. SMO (): support vector machines AdaBoostM1 (): Adaboost M1 method. -W parameter to specify the weak learner algorithm. Bagging (): from the original data by sampling (with replacement method), create multiple models. LogitBoost (): weak learner using logarithmic regression method, learn the real value MultiBoostAB (): AdaBoost improved method, can be viewed as AdaBoost and “wagging” of the combination. Stacking (): The basic classification for different integration algorithms. LinearRegression (): establishment of an appropriate linear regression model. Logistic (): establishment of logistic regression model. JRip (): A rule learning method. M5Rules (): return of the problem with the M5 method produces decision rules. OneR (): a simple 1-R classification. PART (): PART decision rules generated. 4) clustering Cobweb (): This is the kind of model-based approach, which assumes that each cluster model and the corresponding model fit the data. Not suitable for clustering of large database processing. FarthestFirst (): a fast k-means clustering algorithm similar to SimpleKMeans (): k means clustering algorithm XMeans (): modified k means method to automatically determine the number of categories DBScan (): density-based clustering method, which according to increasing the density of objects around the cluster. It from the spatial database with noise found in clusters of arbitrary shape. This method defines a cluster as a set of “density of connections,” the point set. 5) Association Rules Apriori (): Apriori association rule is the most influential areas of basic algorithm is a breadth-first algorithm, by repeatedly scan the database to obtain the support is greater than the minimum support of frequent itemsets. It is based on frequent item set is two monotonicity principle: any child set of frequent itemsets must be frequent; non-frequent item sets of a superset of any of certain non-frequent. In the case of massive data, Apriori algorithm time and space costs are very high. Tertius (): Tertius algorithm. 6) Forecasting and Assessment: predict (): according to the classification or clustering results to predict the type of new data table (): compare two factors object evaluate_Weka_classifier (): assessment of the implementation of the model, such as: TP Rate, FP Rate, Precision, Recall , F-Measure.

December 14, 2010 at 1:25 pm by admin
Category: Uncategorized
Tags: , ,