Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added language detection #42

Merged
merged 4 commits into from
Sep 19, 2017
Merged

Added language detection #42

merged 4 commits into from
Sep 19, 2017

Conversation

mcarmonaa
Copy link
Contributor

Currently is needed to have enry-java in your local ivi2 repo.

@mcarmonaa mcarmonaa requested review from ajnavarro and bzz September 19, 2017 07:31
@bzz
Copy link
Contributor

bzz commented Sep 19, 2017

Awesome 🎉 I supposes, this addresses #6
Let me try it out locally and get back to you later today.

CI fails on missing enry-java dependency.

sbt.ResolveException: unresolved dependency: tech.sourced#enry-java;1.0: not found

We can not use jitpack for enry-java as the default build would most probably not be a fatjar one, with all the native dependencies. Until src-d/enry#102 is resolved, you could copay and add an existing helper script ./local-install-enry-java.sh to .travis.yml as a workaround.

@mcarmonaa in the meanwhile, could you please double-check that the exposed API is the same as in the design document?

@@ -14,6 +15,12 @@ trait BaseSparkSpec extends BeforeAndAfterAll {
.appName("test").master("local[*]")
.config("spark.driver.host", "localhost")
.getOrCreate()

Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's still useful? Right now, logging for tests is muted and configured though log4j.properties

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it doesn't seem so much useful now, I'll remove these lines.

Logger.getLogger("akka").setLevel(Level.OFF)

import Implicits._
ss.registerUDFs()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but is there a reason we want to expose this to clients?
Are there any downsides in this, being always done by Implicits, so user does not need to do that ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the spark-session is an object outside our library, and Implicits allow us augment functionality of some classes performing conversions from those classes to our own defined classes, there's no way to register automatically the udfs just importing Implicits. So is needed to do it by hand before use them.

@erizocosmico
Copy link
Contributor

Maybe we can get enry via jitpack for now. It "should" work after this, I think src-d/enry#98

@bzz
Copy link
Contributor

bzz commented Sep 19, 2017

👍 @erizocosmico I guess it also depends, if jitpack uses sbt assembly or not. Definitely worth a try!

@erizocosmico
Copy link
Contributor

@bzz ah, right, then probably it doesn't work. We might need to look into making the assembly build with the normal packaging task

@mcarmonaa
Copy link
Contributor Author

I performed changes to follow the design document, so now the exported method is classifyLanguages, and I removed the logger lines too.


def classifyLanguages: DataFrame = {
Implicits.checkCols(df, "file_hash")
df.withColumn("lang", ClassifyLanguagesUDF.function(col("is_binary"), col("path"), col("content")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why file_hash column is checked to be present in df, rather then is_binary, path and content that are actually used below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops!, fixing in progress


private def getLanguage(isBinary: Boolean, path: String, content: Array[Byte]): Option[String] = isBinary match {
case false => {
val lang = Enry.getLanguage(path, content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a nitpick, but do you think

... = isBinay match {
  case true => // ...
  case false => None
}

has advantages over

... = if (isBinary) {
  // ...
 } else {
  None
 }

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I don't, I just wanted to use new scala patterns I learnt, but probably is more readable use an if expression here. I'll change it

@bzz
Copy link
Contributor

bzz commented Sep 19, 2017

@mcarmonaa nice job! Few more comments above.

Sorry, must have missed where another comment was addressed - do you see any reason not to make CI pass with .sh script?

@bzz
Copy link
Contributor

bzz commented Sep 19, 2017

@mcarmonaa Regarding now hidden discussion

ss.registerUDFs()

Do you think this approach f31b1e5 would not work for some reason?

I think it may be reasonable to have a convention of UDFs registering themselves, thus removing this "cognitive burden" from the API users.

@mcarmonaa
Copy link
Contributor Author

@bzz about f31b1e5, Is it ok register the udf in the catalog each time you use it?

@codecov
Copy link

codecov bot commented Sep 19, 2017

Codecov Report

Merging #42 into master will decrease coverage by <.01%.
The diff coverage is 87.5%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master      #42      +/-   ##
============================================
- Coverage     83.42%   83.42%   -0.01%     
- Complexity       31       32       +1     
============================================
  Files            13       14       +1     
  Lines           362      374      +12     
  Branches         62       65       +3     
============================================
+ Hits            302      312      +10     
  Misses           27       27              
- Partials         33       35       +2
Impacted Files Coverage Δ Complexity Δ
...n/scala/tech/sourced/api/customudf/CustomUDF.scala 85.71% <85.71%> (ø) 1 <1> (?)
src/main/scala/tech/sourced/api/Implicits.scala 81.25% <88.88%> (-0.24%) 0 <0> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4c4d85f...adc9e15. Read the comment docs.

@mcarmonaa
Copy link
Contributor Author

mcarmonaa commented Sep 19, 2017

As talking irl with @bzz, by the moment

ss.registerUDFs()

still will be there.

The rest of the feedback has been addressed

@bzz
Copy link
Contributor

bzz commented Sep 19, 2017

@mcarmonaa awesome work! Let's wait for CI to become green CI is green now, LGTM to merge and I'll be happy to rebase bblfsh part in #41 to adhere the same conventions for UDFs.

@bzz
Copy link
Contributor

bzz commented Sep 19, 2017

and 👍 for having nice tests!

@bzz bzz mentioned this pull request Sep 19, 2017
5 tasks
@bzz bzz requested a review from erizocosmico September 19, 2017 12:11
Copy link
Contributor

@erizocosmico erizocosmico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than these two minor things, LGTM, great job! 🎉

build.sbt Outdated

resolvers += "jitpack" at "https://jitpack.io",
resolvers += "Local Ivy repository" at "file://"+Path.userHome.absolutePath+"/.ivy2/repository",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a TODO or FIXME to explain why this is here, just in case we forget to delete it after enry is published





Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is, I'll remove it

@bzz bzz merged commit b2ca969 into src-d:master Sep 19, 2017
@mcarmonaa mcarmonaa deleted the feature/enry-languages branch September 20, 2017 10:44
@bzz bzz mentioned this pull request Sep 21, 2017
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants