Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write tests #3

Open
gedankenstuecke opened this issue Dec 15, 2017 · 10 comments
Open

Write tests #3

gedankenstuecke opened this issue Dec 15, 2017 · 10 comments

Comments

@gedankenstuecke
Copy link
Owner

The old problem :P

@troublemagnet
Copy link
Contributor

@gedankenstuecke may I work on this?

@gedankenstuecke
Copy link
Owner Author

Absolutely! 👍

@jobliz
Copy link
Contributor

jobliz commented May 16, 2018

@gedankenstuecke I was thinking on this issue and realized I have doubts on how to test the data loading and analyzing process. The create_main_dataframe function needs a twitter zip archive to be tested, and the functions in analyse_data.py all need the created dataframe as a parameter to be tested, so I think it comes down to where a twitter archive file should be stored. I see two alternatives:

  • The tests would be very general and able to pass with any twitter archive that the dev places in a specific directory, so the test data should be downloaded separately after cloning the repository, but I'm not sure how one could test specific analysis return values from potentially different archives.

  • The tests would be archive-specific (for example, depending on test_archive.zip). If this were the case I see three ways to go: The user could be instructed in the documentation to download the file from a static URL, the file itself could be uploaded to the repository, or a new script could download the archive from a static URL only once, before tests can be run.

In a project I once did we placed the test data inside the code repository, but I'm not sure if that'd fit the idea behind twitter-analyser, so I wanted to know more before doing a pull request.

@gedankenstuecke
Copy link
Owner Author

@jobliz I think having an archive specific test would be okay with me.

To keep the file size (and test duration) manageable one could just minimize the data in the test_archive.zip to something like two months worth of data (e.g. take december 2008 and january 2009, with that it also includes the yearly break).

Does that sound good?

@jobliz
Copy link
Contributor

jobliz commented May 16, 2018

Sure! I'll create a minimized archive and do new tests with it.

@gedankenstuecke
Copy link
Owner Author

Awesome, thanks!

@jobliz
Copy link
Contributor

jobliz commented May 17, 2018

I made a minified archive file by deleting JSON files in the /tweets directory and manually setting a new tweet total in payload_details.js, then tried to run the analysis functions on the resulting dataframe. Unexpectedly, two functions raised exceptions. test_create_timeline raises NotImplementedError: Not supported for type RangeIndex and create_hourly_stats raises KeyError: 'Weekday'. This doesn't happen with the full archive (and yes, testing does take way too long to be practical with all tweets). I tried the same tests with a 4 month archive including November 2008 and February 2009 too, but it also raised the same exceptions.

Given that I'm not entirely sure why this is happening in the code and the tests aren't really passing I'm not yet making a PR, but the code is in a branch in my fork.

Do you have any idea why a minimized archive might raise exceptions?

@gedankenstuecke
Copy link
Owner Author

I haven't found time to look closely into the data. But I suspect that both create_hourly_stats and create_timeline might crash due to a lack of GPS data in the 2008/2009 data, as it only works for tweets that have latitude & longitude to find out the time zone/have a lat/long to plot on the map later on.

Looking at my own data it seems that geolocations on Twitter became available only at mid-2011. So probably I just made a bad recommendation on which data to pick for the test set.

This would also explain why it doesn't happen for the full data set, as we do have GPS coordinates in the full archive.

@jobliz
Copy link
Contributor

jobliz commented May 18, 2018

That makes sense. I will make a minimized archive from months later than mid-2011 and see how it goes. If I'm getting it right then these crashes will also happen when loading full archives from people that haven't turned on location data. When I receive my archive from Twitter I will check if it does to be sure.

@gedankenstuecke
Copy link
Owner Author

Great, thanks so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants