Skip to content

GSoC Road Map & Weekly Status

Naveen Madhire edited this page Jun 28, 2015 · 16 revisions

GSoC 2015

This project is being done as part of GSoC 2015 by Naveen Madhire

This is a sketch of the Road Map.

Milestones:

  • Identify functions that are priority for re-generating spotlight models - [Warm up/Bonding]

  • Find/Fix a wikipedia dump parser (...Jsonpedia, Cloud9, Bliki..) - [Weeks: 1, 2, 3]

Week 1 - 25th May to 29th May

Modified and analyzed the JSON-wikipedia code to output paragraphs and links in the output file. This would be useful downstream for calculating token counts.

Week 2 - 1st June to 5th June

This week was about finding the right parser instead of relying on JSON-wikipedia for parsing the wiki dumps. I've checked different Wiki Parsers like sweble, jsonpedia, json-wikipedia, cloud9

Below is the comparison table which I've created after analyzing different wikipedia parsers.

Parser License Format Parsing Logic Working with Spark Clean Text Other Cons MultiLanguage
JsonPedia Y - Apache 2.0 JSON JSON output can be readable in spark Y Uses Jackson to convert to JSON No start and end index of the links. Just the link information It claims to handle all the languages.
JsonWikipedia N JSON Good parsing with templates JSON output can be readable in spark Need to add 1 more method to clean text Uses GSON to convert to JSON Few language already it supports. One has to create property files for using languages.
Sweble Y - Apache 2.0 AST, Plain Text Yes Y May have to write some custom code to convert the AST to Plain text Yes
Cloud9 Y - Apache 2.0 Plain Text This one has a good parsing logic for getting the plain text from the wikitext May work with Spark Y We have to add few functions to get the real paragrapgh from the wikitext and the links associated. Looks like only forArabic,Chinease,Czech,German,Spanish,Swedish,Turkish
wtf_wikipedia N JSON At a high level. Looks like it doesn't integrate well with others. N

Week 3 - 8th June to 12th June

To Work on implementing the JSON-wikipedia logic by parsing the XML Dump as elements of Spark RDD.

Plan & Progress

Modified the JSON - wikipedia to remove the boiler plate templates from the wiki - text. [Json Wikipedia Code] (/~https://github.com/naveenmadhire/json-wikipedia-dbspotlight).

Created a wikiparser.scala program to parse the XML dump and create JSON format individual articles as elements of Spark RDD

[Wikipedia Extractor Code] (/~https://github.com/naveenmadhire/wikipedia-stats-extractor)

Json output of Json-wikipedia has few Unicode characters. For example "\u0027". Need to convert the unicode characters to regular text after the parsing in Scala.

Made changes to Json-wikipedia to fetch the redirect information from class instead of individual language independent property files.

Currently working on fetching the category links information from class instead of property files.

Task Moved to next week Fix parsing issues with Json-wikipedia.

Week 4 - 15th June to 19th June

Learning from previous weeks During the course of last 3 weeks, the main learning was to understand the wiki parsing and make changes to suit the needs of DBPedia spotlight models.

Parsing issues with Json-wikipedia. Writing test cases for testing the parsing logic. I will use the small wiki dataset for testing and verifying the whole parsing logic.

Week 5 - June 22nd to 26th June

Fixed most of the parsing issues with the Json-wikipedia code. Implement using dataframes for reading the parsed RDDs to calculate the various counts later on.

Progress

This week I've worked on fixing the json-wikipedia parsing issues

  1. Removed the reference tags from the article text
  2. Fixed language identifiers

URI Counts logic has been implemented in Scala and Spark Here

Challenges and Learning

  1. Implementation using Scala and Spark
  2. Few errors faced during testing and was able to overcome most of the issues.

Week 6 - June 29th to 3rd July

I will spend this week working on testing using a little bigger Wiki dump. And start working on Surface form counts. I will also write a report at the end of the week for Mid-term review.

  • Re-rewrite Priority functions - [Weeks: 4, 5, 6] :
    • Entity Counts
    • Surface Form Counts
    • Pair Counts
    • Token Counts
  • Resolve DBPedia Identifiers (i.e: Resolve Redirects) - [Weeks: 7]
  • Re-rewrite additional non-priority functions - [Weeks: 8]
  • Update Scripts to generate Spotlight Models (Quickstarter, Dbpedia-Spotlight/bin) - [Weeks: 9]
  • SF discount(Better Automatisation [Optional]) [Weeks: 10, 11]
  • Generate new models for supported Languages - [Weeks: 12]

Weeks

  • 1-week - May 25th-29th
  • 2-week - June 1st-5th
  • 3-week - June 8th-12th
  • 4-week - June 15th-19th
  • 5-week - June 22nd-26th
  • 6-week - June 29-July 3rd
  • 7-week - July 6th-10th
  • 8-week - July 13th-17th
  • 9-week - July 20th-24th
  • 10-week - July 27th-31th
  • 11-week - August 3rd-7th
  • 12-week - August 10th-14th
  • 13-week - August 17th-21st
  • August 22nd - submit code
  • September 25th - Soft deadline
Clone this wiki locally