문서의 선택한 두 판 사이의 차이를 보여줍니다.
etc [2020/04/14 08:25] |
etc [2021/04/13 06:54] (현재) |
||
---|---|---|---|
줄 1: | 줄 1: | ||
+ | < | ||
+ | |||
+ | |||
+ | ===== Definition ===== | ||
+ | ==== Project Overview ==== | ||
+ | The plague has taken the lives of many people, and at the moment, many diseases are potentially spreading. People are infected by a variety of causes, such as air, insects, animals, and water. However, since ordinary people don't know when, where and how epidemics start or disappear, so, they are very ignorant about the epidemic. the most people recognizes large and powerful epidemics only through news, articles and government announcements. Therefore, little known or small infectious diseases remains unnoticed to the public. \\ | ||
+ | |||
+ | In this project, we analyze the frequency of words related illness in the articles and the relationship between the occurrence of actual diseases and find the actual diseases that have occurred and disappear. \\ | ||
+ | |||
+ | ==== Problem Statement ==== | ||
+ | The purpose is to find the relationship between the frequency of disease-related words in the contents of articles and the presence or absence of actual diseases within that range of dates (2015, 2016, 2017). The tasks are as follows: \\ | ||
+ | |||
+ | - Collect news articles in NAVER. (https:// | ||
+ | - Analyze morphemes of every news article by a year term and calculate the average frequency of how many times the term ' | ||
+ | - After analyzing all of the news articles morphemes by a monthly basis, select the months that have higher frequency of using the term ‘patients’ in the news compared to the average number. | ||
+ | - Based on the analyzed morphemes of the selected month, check the occurrence of an epidemic in that day using the actual data from KOSIS(Korean Statistical Information Service). | ||
+ | |||
+ | |||
+ | ===== Analysis ===== | ||
+ | ==== Data Exploration ==== | ||
+ | Collected articles data are all the news articles from various portals that have been collected since 2015. Using Python library web crawler called ‘Scrapy’, | ||
+ | |||
+ | * id : Collected article Index ( Integer ) | ||
+ | * aid: article identified number(Integer) | ||
+ | * title : article title(String) | ||
+ | * content : article content(String) | ||
+ | * aDate : collected time (TimeStamp) | ||
+ | * nUrl : article url from NAVER (String) | ||
+ | * pUrl : article url from original NEWS site (String) | ||
+ | * nClass : article section (String) | ||
+ | * press : article company (String) | ||
+ | * subclass, pdf, numComment etc. | ||
+ | |||
+ | {{: | ||
+ | |||
+ | ==== Data Visualization ==== | ||
+ | |||
+ | This graph is the monthly patients number of typhoid fever in a certain year. Similary, the number of incidents for each year /month of Group 1to 4 infectious diseases can be confirmed. These data are uses as a basis for comparing the correspondence between the data analyzed by us and the actual data. \\ | ||
+ | |||
+ | {{: | ||
+ | |||
+ | < | ||
+ | |||
+ | ===== Methodology ===== | ||
+ | ==== Data Preprocessing ==== | ||
+ | |||
+ | Article collection data is supplied and has 100% accuracy. So, no preprocessing is required. \\ | ||
+ | |||
+ | ==== Implementation ==== | ||
+ | |||
+ | Analyze all the news articles on a yearly basis and calculate the average by calculating the frequency of the term ‘patient’. \\ | ||
+ | |||
+ | < | ||
+ | def get_tag(text, | ||
+ | spliter = Okt() | ||
+ | nouns = spliter.nouns(text) | ||
+ | | ||
+ | count = Counter(nouns) | ||
+ | return_list = [] | ||
+ | | ||
+ | for n, c in count.most_common(ntags): | ||
+ | if (n == " | ||
+ | temp = {' | ||
+ | return_list.append(temp) | ||
+ | return return_list | ||
+ | </ | ||
+ | |||
+ | < | ||
+ | import time | ||
+ | start_time = time.time() | ||
+ | |||
+ | noun_list = [] | ||
+ | count_list = [] | ||
+ | |||
+ | for i in range(1, 13): | ||
+ | article_df_month = article_df.loc[(article_df[' | ||
+ | |||
+ | noun_count = 1000 | ||
+ | tags = get_tag(article_df_month.to_string(), | ||
+ | | ||
+ | for tag in tags: | ||
+ | noun = tag[' | ||
+ | count = tag[' | ||
+ | print(str(i) + " | ||
+ | print(noun, ":", | ||
+ | noun_list.append(str(i)) | ||
+ | count_list.append(count) | ||
+ | </ | ||
+ | |||
+ | < | ||
+ | patient_df = pd.DataFrame({" | ||
+ | </ | ||
+ | |||
+ | < | ||
+ | |||
+ | total = patient_df.sum(axis = 0)[1] | ||
+ | month_avr = total / 12 | ||
+ | print(" | ||
+ | |||
+ | select_month = patient_df[patient_df[' | ||
+ | |||
+ | </ | ||
+ | |||
+ | Month Average : 2331.6371681415926 \\ | ||
+ | |||
+ | -Higher frequency Months than Average- \\ | ||
+ | |||
+ | ^index ^ Month ^ Patient^ | ||
+ | ^0| 5 | 2899 | | ||
+ | ^1| 6 | 10629 | | ||
+ | |||
+ | select the months that have higher frequency of using the term ‘patients’ and show a list of morphemes for that month. \\ | ||
+ | |||
+ | < | ||
+ | |||
+ | def get_tags(text, | ||
+ | spliter = Okt() | ||
+ | nouns = spliter.nouns(text) | ||
+ | | ||
+ | count = Counter(nouns) | ||
+ | return_list = [] | ||
+ | | ||
+ | for n, c in count.most_common(ntags): | ||
+ | temp = {' | ||
+ | return_list.append(temp) | ||
+ | |||
+ | return return_list | ||
+ | </ | ||
+ | |||
+ | < | ||
+ | for i in range(select_month.shape[0]): | ||
+ | month = select_month[' | ||
+ | | ||
+ | article_df_month = article_df.loc[(article_df[' | ||
+ | | ||
+ | noun_count = 10 | ||
+ | tags = get_tags(article_df_month.to_string(), | ||
+ | |||
+ | print(str(month) + " | ||
+ | for tag in tags: | ||
+ | noun = tag[' | ||
+ | count = tag[' | ||
+ | print(noun, ":", | ||
+ | </ | ||
+ | |||
+ | # month_avr : The average of the frequencyofthe ' | ||
+ | |||
+ | # select_month : The frequency of ‘patient’ morphemes is higher than average \\ | ||
+ | |||
+ | {{: | ||
+ | |||
+ | Based on the analyzed morphemes of the selected month, check the occurrence of an epidemic in that day using the actual data from KOSIS. \\ | ||
+ | |||
+ | < | ||
+ | mers_df = pd.read_csv(" | ||
+ | |||
+ | patient_df[' | ||
+ | mers_df[' | ||
+ | merge_2015 = pd.merge(patient_df, | ||
+ | merge_2015.corr() | ||
+ | </ | ||
+ | |||
+ | ==== Refinemnet ==== | ||
+ | |||
+ | get_tag() : A function that returns the frequency of the ' | ||
+ | @param1 : Text to analyze morpheme \\ | ||
+ | @param2 : The number of nouns to be extracted with high frequency\\ | ||
+ | |||
+ | get_tags() : A function that returns a list of n morphemes with a high frequency and prints them.\\ | ||
+ | @param1 : Text to analyze morpheme\\ | ||
+ | @param2 : The number of nouns to be extracted with high frequency\\ | ||
+ | |||
+ | |||
+ | ===== Result ===== | ||
+ | ==== Model Evaluation and Validation ==== | ||
+ | < | ||
+ | - The frequency of ' | ||
+ | {{: | ||
+ | |||
+ | - Real data on patients that actually occurred ' | ||
+ | {{: | ||
+ | |||
+ | - Correlation between ' | ||
+ | |||
+ | ^ ^ ** Month ** ^ ** Patient ** ^ ** Occurrences ** ^ | ||
+ | ^ ** Month **| 1.000000| | ||
+ | ^ ** Patient **| -0.603977 | | ||
+ | ^ ** Occurrences ** | -0.486350 | ** 0.985725 **| 1.000000 | | ||
+ | |||
+ | < | ||
+ | - The frequency of ' | ||
+ | {{: | ||
+ | |||
+ | - Real data on patients that actually occurred all infectious disease \\ | ||
+ | {{: | ||
+ | |||
+ | - Correlation between ' | ||
+ | ^ ^ ** Month ** ^ ** Patient ** ^ ** Occurrences ** ^ | ||
+ | ^ ** Month **| 1.000000| | ||
+ | ^ ** Patient **| 0.696169 | | ||
+ | ^ ** Occurrences ** | 0.507285 | ** -0.043851 **| 1.000000 | | ||
+ | |||
+ | < | ||
+ | - The frequency of ' | ||
+ | {{: | ||
+ | |||
+ | - Real data on patients that actually occurred all infectious disease \\ | ||
+ | {{: | ||
+ | |||
+ | - Correlation between ' | ||
+ | ^ ^ ** Month ** ^ ** Patient ** ^ ** Occurrences ** ^ | ||
+ | ^ ** Month **| 1.000000| | ||
+ | ^ ** Patient **| 0.490746| | ||
+ | ^ ** Occurrences ** | 0.703117| | ||
+ | |||
+ | ==== Justification ==== | ||
+ | - 2015 year : The news article written in June showed the most abundant usage of the morpheme ‘patient’. And it was actually the same month as when there were highest number of MERS patients among the globe. \\ | ||
+ | In addition, in 2015, the usage frequency of the morpheme ‘patient’ did have a very strong positive correlation to the number of patients that actually suffered from disease, which was up to 0.985725. \\ | ||
+ | |||
+ | - 2016 year : The ' | ||
+ | In fact, in 2016, there was no national disaster caused by a pandemic. \\ | ||
+ | |||
+ | - 2017 year : The frequencies of the " | ||
+ | In 2017, there is no national disaster caused by a contagious disease. Although the frequency of " | ||
+ | |||
+ | ===== Conclusion ===== | ||
+ | ==== Reflection ==== | ||
+ | In 2015 and 2017, there is a strong positive correlation between the frequency of morpheme " | ||
+ | |||
+ | ==== Improvement ==== | ||
+ | - It was a very weak criterion because there was only a ' | ||
+ | - And, since we analyze the morpheme with only contents of all articles regardless of field. Therefore, at that time, when there were many social and political issues, there were very few articles and morphemes related to ' | ||
+ | - In the future, I would like to reconsider the logic of proving that particular morphemes of article and another edges are associated with the actual data. | ||