Text Analysis for digital humanity: XinWenLianBo

Overview
Objective
Project File Structure
Initialization
Understanding the Code
What Can Be Done Further
Acknowledgement
License and Additional Clause
Contact the Creator of the Project

Overview

The project aims to uncover insights into Chinese society's changes over time by analyzing the frequency of geographic names mentioned in the news program over a decade. This analysis seeks to interpret the importance of these cities in national politics, economics, or societal aspects.

Objective

Describe the clear structure of the project.
Highlight the importance of tracking contributions and ensuring project integrity.

Project File Structure

DigitalHumanity_XinWenLianBo/
│
├── README.md               # Project overview and setup instructions
├── LICENSE                 # The license file
│
├── src/                    # Source files
│   └── web_crawler_.py             # Main application script
│
├── viz/                    # Files for visualizaiton effects
│   ├── *_news.html         # TF-IDF Results
│   └── barchartrace_exposure.gif    #Exposure Results
│   └── ProvinceExposure.html    
│
├── data/                   
│   ├── [videos_{yyyymmdd}.csv] 
│   └── ...                 # Data collected from crawler.py
│   └── [news_document-distribution.csv] 
│   └── ...                 # Data generated for TF-IDF Analysis
│   └── [news_document-top-topic-words.csv] 
│   └── ...                 # Data generated for TF-IDF Analysis
│   └── other documents     # Involve Other Documents for further studies
│   └── ... 
│
├── bar_chart_race/         # Dependency Package for drawing .gif
├── XWLB_news.ipynb         # Overall Project
├── requirements.txt        # Project dependencies
└── .gitignore              # Specifies intentionally untracked files to ignore

Note

Due to limitation of each time upload number. For .csv files in data/, I have posted some documents on: data

Also the viz/ document is stored in viz.Some unseen files on git repo can be found here.

Initialization

Here are the steps to set up this project locally:

Reorganize the file structure as necessary.
Ensure all dependencies are installed

Understanding the code

1. Data Collection

The code defines a function web_crawler_() which uses a web scraping approach to collect data. This function iterates over a specified date range, constructing URLs for each day, fetching HTML content, and extracting details from video links found on the daily news pages. The extracted details include video title, description, source, time, and content. The data is temporarily stored in pandas DataFrames and saved as CSV files for each day.

As a reminder, the time phase for news data collections can be changed by modify the code accroding to your need.

2. Data Preprocessing

This section is intended to merge the individual daily CSV files into a single DataFrame, facilitating easier analysis. It also details steps to extract structured information from the "Detail" field in the data, which contains structured text describing video content. The intent is to use fuzzy matching to separate different components like description, source, time, and content into separate columns.

3. Analysis

The analysis would involve time series analysis of text data, possibly using techniques like word frequency analysis, topic modeling (LDA), and geographical data visualization (using tools like pyecharts for mapping).

What can be done further

Graphical Viz Analysis for interactions between regions
Other themes alternations by time

We welcome any further contribution.

Acnowledgement

[1] Thanks for the original author for the WeChat Passage 哪个城市是中央眼中的心头爱？基于新闻联播文本的大数据分析 That makes the process for data collection easier and makes it possible for more exciting analysis.

[2] Thanks for Jindi Mo and Xuan Li for the explanation job for the code and case-making. Their contribution makes the project to have real impact to related researchers in digital humanity.

License and Additional Clause

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License - see the LICENSE file for details.

Use of this project in any form must also comply with the applicable laws and regulations of the People's Republic of China.

Contact the Creator of the Project

Email
- rucerwilliam@ruc.edu.cn
- wuyuhuan_walter@163.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Analysis for digital humanity: XinWenLianBo

Overview

Objective

Project File Structure

Initialization

Understanding the code

1. Data Collection

2. Data Preprocessing

3. Analysis

What can be done further

Acnowledgement

License and Additional Clause

Contact the Creator of the Project

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
bar_chart_race		bar_chart_race
data		data
src		src
viz		viz
LICENSE		LICENSE
README.md		README.md
XWLB_news.ipynb		XWLB_news.ipynb
requirements.txt		requirements.txt

License

WhuanY/TextAnalysis_DigitalHumanity_XinWenLianBo

Folders and files

Latest commit

History

Repository files navigation

Text Analysis for digital humanity: XinWenLianBo

Overview

Objective

Project File Structure

Initialization

Understanding the code

1. Data Collection

2. Data Preprocessing

3. Analysis

What can be done further

Acnowledgement

License and Additional Clause

Contact the Creator of the Project

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages