-
Notifications
You must be signed in to change notification settings - Fork 5
GSoC Road Map & Weekly Status
This project is being done as part of GSoC 2015 by Naveen Madhire
This is a sketch of the Road Map.
Milestones:
-
Identify functions that are priority for re-generating spotlight models - [Warm up/Bonding] -
Find/Fix a wikipedia dump parser (...Jsonpedia, Cloud9, Bliki..) - [Weeks: 1, 2, 3]
Modified and analyzed the JSON-wikipedia code to output paragraphs and links in the output file. This would be useful downstream for calculating token counts.
This week was about finding the right parser instead of relying on JSON-wikipedia for parsing the wiki dumps. I've checked different Wiki Parsers like sweble, jsonpedia, json-wikipedia, cloud9
Below is the comparison table which I've created after analyzing different wikipedia parsers.
Parser | License | Format | Parsing Logic | Working with Spark | Clean Text | Other | Cons | MultiLanguage |
---|---|---|---|---|---|---|---|---|
JsonPedia | Y - Apache 2.0 | JSON | JSON output can be readable in spark | Y | Uses | Jackson to convert to JSON | No start and end index of the links. Just the link information | It claims to handle all the languages. |
JsonWikipedia | N | JSON | Good parsing with templates | JSON output can be readable in spark | Need to add 1 more method to clean text | Uses GSON to convert to JSON | Few language already it supports. One has to create property files for using languages. | |
Sweble | Y - Apache 2.0 | AST, Plain Text | Yes | Y | May have to write some custom code to convert the AST to Plain text | Yes | ||
Cloud9 | Y - Apache 2.0 | Plain Text | This one has a good parsing logic for getting the plain text from the wikitext | May work with Spark | Y | We have to add few functions to get the real paragrapgh from the wikitext and the links associated. | Looks like only forArabic,Chinease,Czech,German,Spanish,Swedish,Turkish | |
wtf_wikipedia | N | JSON | At a high level. Looks like it doesn't integrate well with others. | N |
To Work on implementing the JSON-wikipedia logic by parsing the XML Dump as elements of Spark RDD.
Plan & Progress
Modified the JSON - wikipedia to remove the boiler plate templates from the wiki - text. [Json Wikipedia Code] (/~https://github.com/naveenmadhire/json-wikipedia-dbspotlight).
Created a wikiparser.scala program to parse the XML dump and create JSON format individual articles as elements of Spark RDD
[Wikipedia Extractor Code] (/~https://github.com/naveenmadhire/wikipedia-stats-extractor)
Json output of Json-wikipedia has few Unicode characters. For example "\u0027". Need to convert the unicode characters to regular text after the parsing in Scala.
Made changes to Json-wikipedia to fetch the redirect information from class instead of individual language independent property files.
Currently working on fetching the category links information from class instead of property files.
Task Moved to next week Fix parsing issues with Json-wikipedia.
Learning from previous weeks During the course of last 3 weeks, the main learning was to understand the wiki parsing and make changes to suit the needs of DBPedia spotlight models.
Parsing issues with Json-wikipedia. Writing test cases for testing the parsing logic. I will use the small wiki dataset for testing and verifying the whole parsing logic.
Fixed most of the parsing issues with the Json-wikipedia code. Implement using dataframes for reading the parsed RDDs to calculate the various counts later on.
Progress
This week I've worked on fixing the json-wikipedia parsing issues
- Removed the reference tags from the article text
- Fixed language identifiers
URI Counts logic has been implemented in Scala and Spark Here
Challenges and Learning
- Implementation using Scala and Spark
- Few errors faced during testing and was able to overcome most of the issues.
I will spend this week working on testing using a little bigger Wiki dump. And start working on Surface form counts. I will also write a report at the end of the week for Mid-term review.
- Re-rewrite Priority functions - [Weeks: 4, 5, 6] :
- Entity Counts
- Surface Form Counts
- Pair Counts
- Token Counts
- Resolve DBPedia Identifiers (i.e: Resolve Redirects) - [Weeks: 7]
- Re-rewrite additional non-priority functions - [Weeks: 8]
- Update Scripts to generate Spotlight Models (Quickstarter, Dbpedia-Spotlight/bin) - [Weeks: 9]
- SF discount(Better Automatisation [Optional]) [Weeks: 10, 11]
- Generate new models for supported Languages - [Weeks: 12]
- 1-week - May 25th-29th
- 2-week - June 1st-5th
- 3-week - June 8th-12th
- 4-week - June 15th-19th
- 5-week - June 22nd-26th
- 6-week - June 29-July 3rd
- 7-week - July 6th-10th
- 8-week - July 13th-17th
- 9-week - July 20th-24th
- 10-week - July 27th-31th
- 11-week - August 3rd-7th
- 12-week - August 10th-14th
- 13-week - August 17th-21st
- August 22nd - submit code
- September 25th - Soft deadline