-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added language detection #42
Conversation
Awesome 🎉 I supposes, this addresses #6 CI fails on missing enry-java dependency.
We can not use jitpack for enry-java as the default build would most probably not be a fatjar one, with all the native dependencies. Until src-d/enry#102 is resolved, you could copay and add an existing helper script ./local-install-enry-java.sh to @mcarmonaa in the meanwhile, could you please double-check that the exposed API is the same as in the design document? |
@@ -14,6 +15,12 @@ trait BaseSparkSpec extends BeforeAndAfterAll { | |||
.appName("test").master("local[*]") | |||
.config("spark.driver.host", "localhost") | |||
.getOrCreate() | |||
|
|||
Logger.getLogger("org").setLevel(Level.OFF) | |||
Logger.getLogger("akka").setLevel(Level.OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's still useful? Right now, logging for tests is muted and configured though log4j.properties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it doesn't seem so much useful now, I'll remove these lines.
Logger.getLogger("akka").setLevel(Level.OFF) | ||
|
||
import Implicits._ | ||
ss.registerUDFs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but is there a reason we want to expose this to clients?
Are there any downsides in this, being always done by Implicits
, so user does not need to do that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the spark-session is an object outside our library, and Implicits
allow us augment functionality of some classes performing conversions from those classes to our own defined classes, there's no way to register automatically the udfs just importing Implicits
. So is needed to do it by hand before use them.
Maybe we can get enry via jitpack for now. It "should" work after this, I think src-d/enry#98 |
👍 @erizocosmico I guess it also depends, if jitpack uses |
@bzz ah, right, then probably it doesn't work. We might need to look into making the assembly build with the normal packaging task |
I performed changes to follow the design document, so now the exported method is |
|
||
def classifyLanguages: DataFrame = { | ||
Implicits.checkCols(df, "file_hash") | ||
df.withColumn("lang", ClassifyLanguagesUDF.function(col("is_binary"), col("path"), col("content"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why file_hash
column is checked to be present in df, rather then is_binary
, path
and content
that are actually used below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops!, fixing in progress
|
||
private def getLanguage(isBinary: Boolean, path: String, content: Array[Byte]): Option[String] = isBinary match { | ||
case false => { | ||
val lang = Enry.getLanguage(path, content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a nitpick, but do you think
... = isBinay match {
case true => // ...
case false => None
}
has advantages over
... = if (isBinary) {
// ...
} else {
None
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I don't, I just wanted to use new scala patterns I learnt, but probably is more readable use an if
expression here. I'll change it
@mcarmonaa nice job! Few more comments above. Sorry, must have missed where another comment was addressed - do you see any reason not to make CI pass with .sh script? |
@mcarmonaa Regarding now hidden discussion
Do you think this approach f31b1e5 would not work for some reason? I think it may be reasonable to have a convention of UDFs registering themselves, thus removing this "cognitive burden" from the API users. |
Codecov Report
@@ Coverage Diff @@
## master #42 +/- ##
============================================
- Coverage 83.42% 83.42% -0.01%
- Complexity 31 32 +1
============================================
Files 13 14 +1
Lines 362 374 +12
Branches 62 65 +3
============================================
+ Hits 302 312 +10
Misses 27 27
- Partials 33 35 +2
Continue to review full report at Codecov.
|
As talking irl with @bzz, by the moment
still will be there. The rest of the feedback has been addressed |
@mcarmonaa awesome work! |
and 👍 for having nice tests! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than these two minor things, LGTM, great job! 🎉
build.sbt
Outdated
|
||
resolvers += "jitpack" at "https://jitpack.io", | ||
resolvers += "Local Ivy repository" at "file://"+Path.userHome.absolutePath+"/.ivy2/repository", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a TODO or FIXME to explain why this is here, just in case we forget to delete it after enry is published
|
||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is, I'll remove it
Currently is needed to have enry-java in your local ivi2 repo.