-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix 'add_cities' to use modern GeoNames data #61
Conversation
The add_cities() method is documented close to zero under the Pycoast API. I found out from the code that a shape file was expected under db_root_path/CITIES/cities_15000_alternativ.shp. Googling for its name I found out that this probably stems from www.geonames.org. But this site only provides tab delimited UTF-8 text files. Christian Peters finally found the shape file in his old PyTROLL stuff. I was unable to read it at first as it was encoded latin1 while the shapefile Pycost code defaults to UTF-8. I was able to convert the encoding from latin1 to UTF-8 with QGIS. While I could use it now this left quite a big number of ??? in non ASCII city names. The attribute file cities_15000_alternativ.dbf finally revealed that this was a homebrew file converted from a cities15000.txt with latest updates from 2009. I later found the same cities_15000_alternativ.shp under the test data of the Pycoast git repo but no tests for 'cities'. I replaced the 60MB shape file by the easily available (and actively updated) 10MB cities5000.txt file that has twice the entries of the shape file (about 100'000 cities with more than 5000 inhabitants). |
I deleted the same 16 paranthesis under pil and 'asterisk' still works.
@lobsiger To answer your question, each PR should be separate and stay separate. It makes it easier to review. For example, is there any way you could split this PR into multiple pull requests? It is difficult at this point to figure out what this pull request is doing. I know it includes the fix we discovered on getting the asterisk to work (that should be its own PR). It also adds new shapes that can be drawn for points (separate PR)? And it fixes something about cities in general? For the extra CITIES file, so you're saying this didn't exist at all before? Looks like in #42 @mraspaud suggested we refactor points and cities so they share some logic. We'll have to look at that at the PCW in a couple weeks. |
I did my best. Maybe some core developer finds the time to look in my 3 PRs during the PCW in May 2022. |
@djhoese as explained further up in my monologue this PR fixes 'cities' that still existed in the pycoast code but has been almost undocumented and maybe worked somewhere in the past with an archaic homebrew shape file. 'cities' in the overlay dict used with satpy has probably never worked under Python3. We have always used 'points' when we wanted to have cities with symbol and name as overlay. The advantage of 'cities' is that you do not have to bother about (lon, lat) and also easily get special UTF-8 characters whatever keyboard you use. I reimplemented 'cities' as close as possible to existing 'points'. Today I added some more symbols that can be used in 'points' and 'cities'. And I deleted the 16 parantheses you proposed to make 'asterisk' work again ... |
As @mraspaud suggested in #42 it should now be childs play to refactor 'points' and 'cities' as they have close to identical code. The only difference in add_points() and add_cities() is that the former iterates over a points_list =[((lon, lat), 'Description'), ...] while the latter gets a simple cities_list=['City1', 'City2', ...] and then iterates a text file cities5000.txt line by line where it finds (lon, lat) for matching city names. I even have a version of add_cities() that simply generates a points_list from the cities_list as explained above and then just calls method add_points(). A better approach would probably make the common code a separate function. |
Be careful making any further changes to add_points as #56 will also make modifications to it. |
@djhoese I'm aware of #56. It solves the problem that up to now we only had one set of 'points' available from Satpy that uses add_overlay_from_dict(). I was thinking the same way for 'cities' where users might want to have different symbols or colors for 'capitals', 'megacities' and 'cities' as is e.g. the case in MeteorGIS used on LRPT reception stations. But that's probably an overkill. We now already have 'points' and 'cities' that can use different symbols. Addidional possibilities come from PyDecorate. Merging #56 will give even more choices but might also mean some small changes for your recent 'cache' hashing enhancement (?). |
I added the excellent param_key logic of PyCoast PR #56. Windows works now reading utf-8 encoded file cities5000.txt. I attach an image (reproduced with both GNU/Linux and Windows 10 Pro) made with two different parameter sets for 'cities' and 'cities1' |
When I go to the very details of the failing CodeFactor test I always see old code that I have changed/simplified meanwhile: Lines 1162 to 1314 in 0c9a236
Is that a bug of github? |
CodeFactor is a separate service outside of github that is just able to talk to GitHub to show us the results of analyzing the code. It does seem like the line numbers don't match your current changes, but the methods are still pretty complex. I would not worry about this for now unless you really just want to make your code prettier and easier to read. |
It's not just the line numbers: What is shown above is not my latest code. Once CodeFactor even presented the original code as forked. I have made a function for every symbol that I added. So I wonder whether CodeFactor is really testing my latest code?! |
Yes, I meant it was clear it wasn't the latest code because even the line numbers don't match. CodeFactor has been having issues lately. |
What is |
@djhoese red means reduced. It's the same text file as before. The problem was that when you browse to a *.txt text file on github the browser always wants to open it. cities5000r.txt has 2.2MB. The code doesn't care to open a text file as *.txt or *.red :). I added a simple (KISS) download procedure that compares to the installation of GSHHG shape files. As this is somthing most users will do only once I see no reason for something more complicated. I take it for granted that a version update of PyCoast will rebuild the documentation and also automatically adapt the API with the doc strings we crafted. Ooops it failed again??!! |
Let's stick with standard file extensions. Please use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please put both installation sections into a single installation.rst
file. We should (in the future) include pip and conda installation instructions there.
As this is somthing most users will do only once I see no reason for something more complicated.
I disagree. These instructions are pretty complicated and I'd like to improve them in the future. There are too many steps that go from downloading, to unzipping, to reading a README, to instructions that assume this procedure won't work so here are more steps on who to contact when things go wrong. Additionally, the cities file being downloaded is from the test_data directory.
Please make the changes to the installation file necessary, but then let's please leave this PR as it is so we can review it, merge it, and clean things up later.
@djhoese OK I'll do what you asked for. This means I'll have to update the cities2022.zip as well. BTW you don't even have to unzip that as it opens in the Explorer (or mc) and you can copy and paste things out. In my SatPy HOWTO 3.0 I had to describe how to install miniconda and GSHHG as well. PyTROLL/SatPy will never work for the mean Windows user that is afraid of opening a black "DOS" Window. I had some of those that stopped for good at the word directory ... But I agree that the Github test directory is probably not the place to put cities2022.zip as even 150MB of unused shape files can get lost there. @peters77 has tried the install with the reduce_cityfiles.py script only and had no problem to produce his own custom built cities.txt. One thing I'll always insist is that the user can choose the cities.txt at least between cities15000r.txt and cities5000r.txt and that there is no automatic filtering of cities with whatever filter algorithm. I have played with such things to no avail. They often produced just ugly overlays. The user must have and remain in full controll with the names he puts in the 'cities_list'. |
This isn't how it works on all systems and we can't assume users are familiar with .zip files...at least I'd like to avoid that in future simplifications of these processes.
In my opinion this should depend on what the user is doing. For basic introduction to the library a jupyter notebook(s) should work just fine although I suppose that if the user is doing a local install then they probably need to at least install the software and start the notebook in the terminal. But this doesn't mean that we can't put everything in one or two helper functions (NOT in this PR).
Is this script in this PR? Do we need it to be?
Sure, but if the library made it easy to download the entire list of cities, would that matter? Or does having all the cities really slow down processing? If so, maybe we can reformat the file after download for easier future parsing. But this too should not be done in this PR. I'm not a huge fan of a .zip file being in the repository at all no matter the size. I would rather have some steps for downloading the files from an official source. I haven't looked too deeply at the GeoNames site so I'll need to do that to see the options. I did just notice they have a webservice which mentions JSON return values. Would be nice if we could do a single request and get a list of most of the cities and store it as a (hopefully small) JSON or CSV file. https://www.geonames.org/export/ws-overview.html Bottom line, let's find something simple that doesn't involve storing too much in this repository and uses GeoNames directly. In future PRs I/we can update the library to do things in a more automated fashion. |
@djhoese I made the changes you requested. Regarding your post above I can only say that I fundamentally disagree in more than one point. May I also add that you asked me to reduce the size of the file cities5000.txt what I did from 11MB down to 2.2MB with a script I published in this discussion 3 days ago. Now you want again to download or at least query all they have on geonames.org (not everything is free there). And my last point: PyTROLL/SatPy runs now on a couple of amateur EUMETCast receivers 24/7. Not all these PCs have a permanent connection to the Internet. But they receive 400GB+ with all 3 services per day and do not care about the 2.2MB of a locally stored cities.txt file. All tests (except for CodeFactor) pass again. I ask you to merge this PR #61 now. As you also asked me I will not touch it again. If you take the code as is you will have to at least adapt the link to cities2022.zip in docs/source/installation.rst when this moves into the pytroll/pycoast main branch or wherever you move this archive. |
Could you link to the comment where you posted that script? I assumed it was a file that you had committed, sorry. But I also don't see it in the discussion so GitHub may be hiding/minimizing it from my view.
Regarding the minimization, that was when this was test data. That is, data that was only used for testing. It was not meant to be needed by users except for developers/contributors to the package so that they could run the tests. It is generally frowned upon to store data files (especially binary) in a git repository. I had no idea that shapefiles were being stored in the test_data of this repository until this PR. For small enough test data that doesn't change this makes sense to do it this way. For data that is needed by every user that wants to use this feature, but also doesn't meet the needs or provide the flexibility to see every available data point (the various options available from GeoNames), this doesn't make sense. I'm not saying we need to download or query for every city that they have in their database and I'm not saying this should be stored in the repository. I'm saying that if we're going to go to the trouble of telling the user to download data files from a website (the .zip file in your installation instructions) then we might as well point to the upstream GeoNames data file and have users download the more complete set.
I'm very familiar with the idea of users/servers not having access to the outside internet. In these cases there is a requirement to pre-download assets needed for processing. This is a common requirement and need of the processing done in Satpy, pyspectral, cartopy, etc.
I don't care what space users have on their processing servers. If they want to do the processing which requires all the cities then they will find the space for that information or they'll contact us to tell us they need a better way. I care about the unnecessary download and inclusion of data files in a revision control system (this git repository) and the potential storage of those data files in the packages sent to PyPI. This leads to every CI job having to download a large repository just to run tests. The tests should be use a couple lines (possibly fake) that mimic what the real data files would contain.
I will look over this PR again later. I will merge it after I've made the updates you've suggested. At the end of the day I (and other pytroll maintainers) need to be able to justify the decisions made to this code base when people ask about them. We also need to worry about CI bandwidth, download sizes, repository sizes, user support, bug fixes, documentation, and many other things. The decisions and behavior of this package need to work for many people and for many use cases. We do what we can maintaining the library with the time we have, but we have to be careful that we don't make the library harder to use or maintain as that would only be a step backward. Thank you for all of your contributions over the last couple months, your suggestions, and your patience as we find the best way to implement these fixes and features. |
@djhoese it seems to me you have thrown away cities2022.zip without even looking what is in it. It contained my way to give the user a choice and even more insight. The code now still passes the tests, but it will not work for any user as it expects city files rearranged and reduced in size. I have no problem with working again with the big originals text files from geonames.org that are still far below what the current non working 60MB cities_15000_alternatv.dbf is. You will have to change back:
to what it was way back in this conversation and also have to replace my 900Bytes cities.txt that now still keeps the tests going. |
@lobsiger @djhoese I have not read the whole thread, but I think I should contribute a little something: |
@lobsiger I did look at it, but apparently not deep enough. I saw the names of the files and skimmed the README, but didn't look at the script and didn't realize it was modifying the columns in the file. I thought it was only filtering based on population. Sorry for the oversight. I will fix the test and then the code. For longer term maintenance it would be easiest if we didn't have to keep a separate instance of this data in this repository and if we could match the upstream file format/structure. |
@mraspaud the problem in a nutshell: My proposal: |
@lobsiger I am working on this now. No need for further coding work from you for now as I've asked enough of you already.
As you said, I didn't realize that there were these old data files sitting in the repository. When you download/clone a git repository you (by default) get every file in its history whether it was deleted in a later commit or not. This is one reason why I'm hesitant to add more files.
The instructions as I've written them in this PR are only an initial step. This pull request has a lot of work done to it already and I don't want to add all the code that would be necessary for automating any of this download process here. |
@djhoese it's my understanding that these 128MB have never been deleted. They show up on github for everybody in the main branch. They have been moved there 2013 by @mraspaud from another place. But they reveal to originate from 2009. /~https://github.com/pytroll/pycoast/tree/main/pycoast/tests/test_data/cities There must (or certainly should) be a way to delete deprecated and unused files for good from github. |
There is but it requires rewriting commits sadly. |
Then deleting these old files means changing this commit from 2015 (not 2013 as I stated wrongly) ? /~https://github.com/pytroll/pycoast/tree/5ec760a92562b929cc76f5442ae39c8e381b30ce Or you even have to trace that further back as the cities_15000_alternativ stuff was only moved 2015? Quite incredible :-( ... |
Yes, unfortunately git does not like changing the commit history, because that means potentially loosing consistency between clones/forks of the repo... |
Ok, so I think this PR is ready for merging and then I can move on to some of the larger scale changes that were discussed here. I will merge this with "Squash and merge" so all commits are combined into one commit and the extra data files are not retained in the git history. I will merge this in a minute or two after I check a few things. |
...man I really want to remove those shapefiles from the history...better not. Thanks @lobsiger for all your work on this and the other PRs. We're really starting to get things in a good place. |
git diff origin/main **/*py | flake8 --diff
This should make 'cities' work again. Parameters are now identical with 'points'. Replaced homebrew old shape file (still latin1. latest entries from 2009) with UTF-8 text file cities5000.txt that can easily be downloaded from geonames.org . Added tests (or at least tried to). This PR does not include the fixes from my other two recent Pycoast PRs that are still waiting for review.