GSoC Road Map & Weekly Status

GSoC 2015

This project is being done as part of GSoC 2015 by Naveen Madhire

This is a sketch of the Road Map.

Milestones:

Identify functions that are priority for re-generating spotlight models - [Warm up/Bonding]
Find/Fix a wikipedia dump parser (...Jsonpedia, Cloud9, Bliki..) - [Weeks: 1, 2, 3]

Week 1 - 25th May to 29th May

Modified and analyzed the JSON-wikipedia code to output paragraphs and links in the output file. This would be useful downstream for calculating token counts.

Week 2 - 1st June to 5th June

This week was about finding the right parser instead of relying on JSON-wikipedia for parsing the wiki dumps. I've checked different Wiki Parsers like sweble, jsonpedia, json-wikipedia, cloud9

Below is the comparison table which I've created after analyzing different wikipedia parsers.

Parser	License	Format	Parsing Logic	Working with Spark	Clean Text	Other	Cons	MultiLanguage
JsonPedia	Y - Apache 2.0	JSON	JSON output can be readable in spark	Y	Uses	Jackson to convert to JSON	No start and end index of the links. Just the link information	It claims to handle all the languages.
JsonWikipedia	N	JSON	Good parsing with templates	JSON output can be readable in spark	Need to add 1 more method to clean text	Uses GSON to convert to JSON		Few language already it supports. One has to create property files for using languages.
Sweble	Y - Apache 2.0	AST, Plain Text		Yes	Y		May have to write some custom code to convert the AST to Plain text	Yes
Cloud9	Y - Apache 2.0	Plain Text	This one has a good parsing logic for getting the plain text from the wikitext	May work with Spark	Y		We have to add few functions to get the real paragrapgh from the wikitext and the links associated.	Looks like only forArabic,Chinease,Czech,German,Spanish,Swedish,Turkish
wtf_wikipedia	N	JSON					At a high level. Looks like it doesn't integrate well with others.	N

Week 3 - 8th June to 12th June

To Work on implementing the JSON-wikipedia logic by parsing the XML Dump as elements of Spark RDD.

Plan & Progress

Modified the JSON - wikipedia to remove the boiler plate templates from the wiki - text. [Json Wikipedia Code] (/~https://github.com/naveenmadhire/json-wikipedia-dbspotlight).

Created a wikiparser.scala program to parse the XML dump and create JSON format individual articles as elements of Spark RDD

[Wikipedia Extractor Code] (/~https://github.com/naveenmadhire/wikipedia-stats-extractor)

Json output of Json-wikipedia has few Unicode characters. For example "\u0027". Need to convert the unicode characters to regular text after the parsing in Scala.

Made changes to Json-wikipedia to fetch the redirect information from class instead of individual language independent property files.

Currently working on fetching the category links information from class instead of property files.

Task Moved to next week Fix parsing issues with Json-wikipedia.

Week 4 - 15th June to 19th June

Learning from previous weeks During the course of last 3 weeks, the main learning was to understand the wiki parsing and make changes to suit the needs of DBPedia spotlight models.

Parsing issues with Json-wikipedia. Writing test cases for testing the parsing logic. I will use the small wiki dataset for testing and verifying the whole parsing logic.

Week 5 - June 22nd to 26th June

Fixed most of the parsing issues with the Json-wikipedia code. Implement using dataframes for reading the parsed RDDs to calculate the various counts later on.

Progress

This week I've worked on fixing the json-wikipedia parsing issues

Removed the reference tags from the article text
Fixed language identifiers

URI Counts logic has been implemented in Scala and Spark Here

Challenges and Learning

Implementation using Scala and Spark
Few errors faced during testing and was able to overcome most of the issues.

Week 6 - June 29th to 3rd July

I will spend this week working on testing using a little bigger Wiki dump. And start working on Surface form counts. I will also write a report at the end of the week for Mid-term review.

Re-rewrite Priority functions - [Weeks: 4, 5, 6] :
- Entity Counts
- Surface Form Counts
- Pair Counts
- Token Counts
Resolve DBPedia Identifiers (i.e: Resolve Redirects) - [Weeks: 7]
Re-rewrite additional non-priority functions - [Weeks: 8]
Update Scripts to generate Spotlight Models (Quickstarter, Dbpedia-Spotlight/bin) - [Weeks: 9]
SF discount(Better Automatisation [Optional]) [Weeks: 10, 11]
Generate new models for supported Languages - [Weeks: 12]

Weeks

1-week - May 25th-29th
2-week - June 1st-5th
3-week - June 8th-12th
4-week - June 15th-19th
5-week - June 22nd-26th
6-week - June 29-July 3rd
7-week - July 6th-10th
8-week - July 13th-17th
9-week - July 20th-24th
10-week - July 27th-31th
11-week - August 3rd-7th
12-week - August 10th-14th
13-week - August 17th-21st
August 22nd - submit code
September 25th - Soft deadline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly