- Overview
- Objective
- Project File Structure
- Initialization
- Understanding the Code
- What Can Be Done Further
- Acknowledgement
- License and Additional Clause
- Contact the Creator of the Project
The project aims to uncover insights into Chinese society's changes over time by analyzing the frequency of geographic names mentioned in the news program over a decade. This analysis seeks to interpret the importance of these cities in national politics, economics, or societal aspects.
- Describe the clear structure of the project.
- Highlight the importance of tracking contributions and ensuring project integrity.
DigitalHumanity_XinWenLianBo/
│
├── README.md # Project overview and setup instructions
├── LICENSE # The license file
│
├── src/ # Source files
│ └── web_crawler_.py # Main application script
│
├── viz/ # Files for visualizaiton effects
│ ├── *_news.html # TF-IDF Results
│ └── barchartrace_exposure.gif #Exposure Results
│ └── ProvinceExposure.html
│
├── data/
│ ├── [videos_{yyyymmdd}.csv]
│ └── ... # Data collected from crawler.py
│ └── [news_document-distribution.csv]
│ └── ... # Data generated for TF-IDF Analysis
│ └── [news_document-top-topic-words.csv]
│ └── ... # Data generated for TF-IDF Analysis
│ └── other documents # Involve Other Documents for further studies
│ └── ...
│
├── bar_chart_race/ # Dependency Package for drawing .gif
├── XWLB_news.ipynb # Overall Project
├── requirements.txt # Project dependencies
└── .gitignore # Specifies intentionally untracked files to ignore
Note
Due to limitation of each time upload number. For .csv files in data/
, I have posted some documents on: data
Also the viz/
document is stored in viz.Some unseen files on git repo can be found here.
Here are the steps to set up this project locally:
- Reorganize the file structure as necessary.
- Ensure all dependencies are installed
The code defines a function web_crawler_()
which uses a web scraping approach to collect data. This function iterates over a specified date range, constructing URLs for each day, fetching HTML content, and extracting details from video links found on the daily news pages. The extracted details include video title, description, source, time, and content. The data is temporarily stored in pandas DataFrames and saved as CSV files for each day.
As a reminder, the time phase for news data collections can be changed by modify the code accroding to your need.
This section is intended to merge the individual daily CSV files into a single DataFrame, facilitating easier analysis. It also details steps to extract structured information from the "Detail" field in the data, which contains structured text describing video content. The intent is to use fuzzy matching to separate different components like description, source, time, and content into separate columns.
The analysis would involve time series analysis of text data, possibly using techniques like word frequency analysis, topic modeling (LDA), and geographical data visualization (using tools like pyecharts for mapping).
-
Graphical Viz Analysis for interactions between regions
-
Other themes alternations by time
We welcome any further contribution.
[1] Thanks for the original author for the WeChat Passage 哪个城市是中央眼中的心头爱?基于新闻联播文本的大数据分析 That makes the process for data collection easier and makes it possible for more exciting analysis.
[2] Thanks for Jindi Mo
and Xuan Li
for the explanation job for the code and case-making. Their contribution makes the project to have real impact to related researchers in digital humanity.
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License - see the LICENSE file for details.
Use of this project in any form must also comply with the applicable laws and regulations of the People's Republic of China.