-
Notifications
You must be signed in to change notification settings - Fork 953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement @turf/clusters-dbscan
module
#812
Conversation
@DenisCarriere how can I help?
I believe including the centroids in the output might result actually useful as sometimes you are interested in those points and calculating them while processing the points is faster than doing it afterwards. |
👍 Agreed about including the As for help, when the cluster ID's are being associated, it only matches the first found and then skips all the other ones. What would need to happen is to sort/group the clusters by closest distance (not the first match). |
@DenisCarriere I don't know if this is exactly the issue you were referring to, but I feel there is something not right with the approach we are using for this clustering. The problem I think is that for the |
Yes this is the issue I was referencing, I actually used What if we define the That way the points 2 & 3 would have this as a param This approach might help solve any ambiguous cluster matching. Points that only have 1 cluster would still be an Array except with only 1 number. @stebogit thoughts? |
By the way... Thanks @mourner for another great library 👍 (supercluster) |
@DenisCarriere I think we should better define the expected result of the module. What if we assigned the I have no idea if this would be of any utility though. |
Well how would one go about to create clusters based on a distance? Right now we can create clusters based on No clue what the perfect outcome should be, however I'm sure we can come up with a creative way to make such module. |
removed id attribute from output points;
@@ -25,17 +25,12 @@ module.exports = function (points, maxDistance) { | |||
collectionOf(points, 'Point', 'Input must contain Points'); | |||
|
|||
// Create index | |||
const load = points.features.map(function (point, index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Love it! I think my first approached used the id
fields to do the matching, however I might of dropped that workflow mid way and forgot to remove the .map()
.
@stebogit Good catch!
* points-with-properties: 0.164ms | ||
* points1: 0.087ms | ||
* points2: 0.694ms | ||
* fiji x 1,320,371 ops/sec ±1.72% (80 runs sampled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚗 💨 Zoom Zoom!
@DenisCarriere I've been thinking a lot about this clustering problem and I keep wondering: Should we add a (required) In my research on clustering methods I keep seeing these three as most popular (i.e. I guess mostly used/useful) methods:
|
Thanks @stebogit for that initial research. For me, my initial intent with this distance cluster would most likely reflect Using this approach would probably solve your question about (
@stebogit Are you on the same page if we use the DBSCAN approach for CC: @morganherlocker & @mourner Feel free to shim in anytime if you feel this is going in the wrong direction. |
@stebogit I couldn't help myself to get this done in one night. Check out the new implementation, it's pretty solid and it uses some of the DBSCAN logic (minus not providing noise points - they are simply excluded). Next Steps
Many Points
Points 2
|
👍 👏 @DenisCarriere I'll take a look at this tomorrow. Quick thought, we could call this module |
👍 For 👍 renaming |
The reason why I wouldn't call it We can always improve on this at a later date... |
@DenisCarriere why don't we return this instead: return {
points: FeatureCollection<Points>, // with `clusterId` property
edges: Array<Array<number>>, // edges ids
centroids: Array<Array<number>>, // centroids ids
noise: Array<Array<number>> // noise ids
}; This would speed up the calculation (no feature creation) and slim the output, which seems now a little bloated to me, while still allowing iteration through each cluster or point type. Edit: |
noise.push(noisePoint); | ||
}); | ||
|
||
return { | ||
points: featureCollection(newPoints), | ||
edeges: featureCollection(edges), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo:
- edeges: featureCollection(edges),
+ edges: featureCollection(edges),
We just recently published that module... We change it to |
This is already deviating A LOT from the main purpose of TurfJS, all of the inputs/outputs should be in GeoJSON. We shouldn't even be pushing out an Array of FeatureCollection (if we don't have too). |
I like simply including the Quick update on the Typescript definition interface Point extends GeoJSON.Feature<GeoJSON.Point> {
properties: {
dbscan?: 'core' | 'edge' | 'noise';
[key: string]: any;
}
}
interface Output {
type: 'FeatureCollection'
features: Point[];
} |
@stebogit Next commit should include a lot of changes, have a review:
CC: @stebogit |
- Update typescript definition - Only output single FeatureCollection - Tag properties 'core' | 'edge' | 'noise' - Change minPoint default to 3
These types of modules are probably better to be abstracted out of TurfJS (only dealing with 2D points) and afterwards TurfJS would wrap it to simplify the GeoJSON integration. As a good example, @mourner's library are mainly all 2D ( |
// handle noise points, if any | ||
// Skip Noise if cluster is already associated | ||
// This might be a slight deviation of DBSCAN (or a bug in the library) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd drop this (since it's not a bug) and maybe add an explanation like
// edges points are tagged by DBSCAN as both 'noise' and 'cluster' as they can "reach" less than 'minPoints' number of points
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
point.properties['marker-symbol'] = 'circle-stroked'; | ||
point.properties['marker-size'] = 'medium'; | ||
points.push(point); | ||
featureEach(clustered, function (point) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd change this to:
switch (point.properties.dbscan) {
case 'core':
case 'edge': {
const coreColor = colours[point.properties.cluster];
const edgeColor = chromatism.brightness(-20, colours[point.properties.cluster]).hex;
point.properties['marker-color'] = (point.properties.dbscan === 'core') ? coreColor : edgeColor;
point.properties['marker-size'] = 'small';
points.push(point);
break;
}
case 'noise': {
point.properties['marker-color'] = '#AEAEAE';
point.properties['marker-symbol'] = 'circle-stroked';
point.properties['marker-size'] = 'medium';
points.push(point);
}
}
as edges
are not really a major feature to highlight.
But it's just a finesse 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol Yep! Looks good.
I really like this Switch statement, makes it really easy to control those clustered points.
@@ -17,7 +15,7 @@ var featureCollection = helpers.featureCollection; | |||
* @param {FeatureCollection<Point>} points to be clustered | |||
* @param {number} maxDistance Maximum Distance between any point of the cluster to generate the clusters (kilometers only) | |||
* @param {string} [units=kilometers] in which `maxDistance` is expressed, can be degrees, radians, miles, or kilometers | |||
* @param {number} [minPoints=1] Minimum number of points to generate a single cluster, points will be excluded if the | |||
* @param {number} [minPoints=3] Minimum number of points to generate a single cluster, points will be excluded if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -17,7 +15,7 @@ var featureCollection = helpers.featureCollection; | |||
* @param {FeatureCollection<Point>} points to be clustered | |||
* @param {number} maxDistance Maximum Distance between any point of the cluster to generate the clusters (kilometers only) | |||
* @param {string} [units=kilometers] in which `maxDistance` is expressed, can be degrees, radians, miles, or kilometers | |||
* @param {number} [minPoints=1] Minimum number of points to generate a single cluster, points will be excluded if the | |||
* @param {number} [minPoints=3] Minimum number of points to generate a single cluster, points will be excluded if the | |||
* cluster does not meet the minimum amounts of points. | |||
* @returns {Object} an object containing a `points` FeatureCollection, the input points where each Point | |||
* has given a `cluster` property with the cluster number it belongs, a `centroids` FeatureCollection of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be updated with the new dbscan
property
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Yep! It's defined in the Typescript definition, but yes it needs to be documented
* @example | ||
* var centroids = centroidFromProperty(points, 'cluster'); | ||
*/ | ||
function centroidFromProperty(geojson, property, properties) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stebogit Added back your centroids points 😄
Would be interesting to know the benchmark results on that centroidFromProperty
method (i'm sure it's really fast... just quickly scans the FeatureCollection once and then applies clusters based on those bins).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmmh... in density clustering the centroids are less useful/identifying/important than in k-means, I guess. 🤔
But I might be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see now, this is only in test.js
. Good 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! :) only for test.js
This where centroidByProperty
module would be used... we wouldn't apply this directly in the modules, but for visual purposes.
Also this can be applied against Polygons or any Geometry Types, finding the "centroid" of stuff based on properties is quite useful.
Cool! 😃 🎊 🚀 |
29 commits later.. 👍 |
@turf/clusters-distance
module@turf/clusters-dbscan
module
First draft of
@turf/clusters-dbscan
Ref. #811Compared to the
kmeans
cluster this is roughly 5x-50x times faster (thekdbush
index does help a lot -- The performance will most likely slow down as weighted clusters becomes implemented).To-Do
FeatureCollection<Point>
). Using thecluster
property param is already enough to be able to create a centroid/center/centerOfMass.More To-Dos
clusters-dbscan
Examples
points1.geojson
points2.geojson