* Algorithms - ID3 (Iterative Dichotomiser 3) - C4.5 (successor of ID3) - CART (Classification And Regression Tree) - CHAID (CHi-squared Automatic Interaction Detector). Performs multi-level splits when computing classification trees.[11] - MARS: extends decision trees to handle numerical data better. # Automatic Web Content Extraction by Combination of Learning and Grouping Created: Nov 09, 2018 6:38 PM Tags: Paper # 1. INTRODUCTION - The main content in a webpage is often accompanied by a lot of additional and often distracting content such ad branding banners, navigation elements, advertisements and copyright etc. - The web pages in the World Wide Web are highly heterogeneous. - Previous work - Heuristic - Template based Approach - TED # 2. RELATED WORK - CETR - CETD - VIPS # 3. PROBLEM FORMULATION AND SOLUTION ![](Untitled-c3f13dfd-e5f1-486c-aaf6-4580e50223b5.png) # 4. FEATURE SELECTION $$F_x(v_i)=F'_x(v_i)\bigcup\{{\bigcup_{v_j\subseteqq Children(v_i)}F_x(v_j)}\}$$ ## 4.1 Position and Area Features - We consider the left, right, top, bottom, horizontal center and vertical center positions. $$POS\_LEFT = 1 - |BEST\_LEFT\_LEFT|$$ ## 4.2 Font Features $$FONT\_COLOR\_POPULARITY=\sum_i\varphi_{ki} \varphi_{ri}$$ $$FONT\_SIZE=\sum_i{\rho_{ki}(z_i-z_{min}) \over (z_{max}-z_{min})}$$ ## 4.3 Text, Tag and Link Features $$TEXT\_RATIO={A_{text} \over A_{text} +A_{image} + 1}$$ $$TAG\_DENSITY={numTags \over numChars+1}$$ $$LINK\_DENSITY={numLinks \over numTags+1}$$ # 5 LEARNING # 6 GROUPING AND REFINING 1. Grouping 2. Group Selection 3. Refining 4. EXPERIMENTAL EVALUATION 1. Evaluation Data Set and Metrics 2. Comparison with the Baseline Methods - LR_A - SVM_A - LR - SVM - MSS 3. Parameter Sensitivity Analysis 5. CONCLUSIONS