===== Definition =====
==== Project Overview ====
The plague has taken the lives of many people, and at the moment, many diseases are potentially spreading. People are infected by a variety of causes, such as air, insects, animals, and water. However, since ordinary people don't know when, where and how epidemics start or disappear, so, they are very ignorant about the epidemic. the most people recognizes large and powerful epidemics only through news, articles and government announcements. Therefore, little known or small infectious diseases remains unnoticed to the public. \\
In this project, we analyze the frequency of words related illness in the articles and the relationship between the occurrence of actual diseases and find the actual diseases that have occurred and disappear. \\
==== Problem Statement ====
The purpose is to find the relationship between the frequency of disease-related words in the contents of articles and the presence or absence of actual diseases within that range of dates (2015, 2016, 2017). The tasks are as follows: \\
- Collect news articles in NAVER. (https://news.naver.com/) (=news articles have been collected since May 20, 2015.)
- Analyze morphemes of every news article by a year term and calculate the average frequency of how many times the term 'patient’ has been used in article. (Why we chose ‘patients’? : when we looked at morphological analysis in the sample group extracted from the population, the most frequent morpheme except the pathologic name was high.)
- After analyzing all of the news articles morphemes by a monthly basis, select the months that have higher frequency of using the term ‘patients’ in the news compared to the average number.
- Based on the analyzed morphemes of the selected month, check the occurrence of an epidemic in that day using the actual data from KOSIS(Korean Statistical Information Service).
===== Analysis =====
==== Data Exploration ====
Collected articles data are all the news articles from various portals that have been collected since 2015. Using Python library web crawler called ‘Scrapy’, we define the fields of the data to be collected( item.py )and write code to get the information (spider.py). The crawled web article data was stored in a personal DB. Data is about 500,000 total, and we use article content and collection time in this data. The collection article data has the following fields: \\
* id : Collected article Index ( Integer )
* aid: article identified number(Integer)
* title : article title(String)
* content : article content(String)
* aDate : collected time (TimeStamp)
* nUrl : article url from NAVER (String)
* pUrl : article url from original NEWS site (String)
* nClass : article section (String)
* press : article company (String)
* subclass, pdf, numComment etc.
{{:artilce_data.jpg|}}
==== Data Visualization ====
This graph is the monthly patients number of typhoid fever in a certain year. Similary, the number of incidents for each year /month of Group 1to 4 infectious diseases can be confirmed. These data are uses as a basis for comparing the correspondence between the data analyzed by us and the actual data. \\
{{:epidemic_data.jpg|}}