Skip to content

Detailed formatcheck.py documentation

echang97 edited this page Jul 31, 2019 · 23 revisions

Functions

add_item

Parameter(s):

  • key - Item to be added or modified
  • value - Unit of measurement to be associated with key
  • dct - The Dictionary that this is being applied to

If key is already in the dictionary, add the value to the set associated with the key

Otherwise, add the associate the new key with a new set containing only value

[Ex 1.] key = "Gas", value = "mcf" -> { "Gas": {"mcf"} }

[Ex 2.] key = "Geothermal - Electrical Generation", value = "Kilowatt Hours" 
        -> { "Geothermal - Electrical Generation": {"Kilowatt Hours"} }

        key = "Geothermal - Electrical Generation", value = "Thousands of Pounds" 
        -> { "Geothermal - Electrical Generation": {"Kilowatt Hours", "Thousands of Pounds"} }

get_com_pro

Parameter(s):

  • cols - Columns from Pandas DataFrame Checks cols for "Commodity" or "Product"

Returns "n/a" if "Commodity" and "Product" are both present or both missing

Otherwise it returns whichever is present

[Ex 1.] cols = ["Commodity"] -> returns "Commodity"
[Ex 2.] cols = ["Product"] -> returns "Product"
[Ex 3.] cols = ["Commodity", "Product"] -> returns "Commodity"

get_data_type

Parameter(s):

  • name - Name of the Excel file

Field(s):

  • lower - name in all lowercase letters
  • prefixes = ["cy","fy","monthly","company","federal","native","production","revenue","disbribution"]

Returns a String based on the Excel file given

If any entries from prefixes are found in name, they will be added to the final String

[Ex] name = "federal_production_CY03-18" -> returns "cyfederalproduction_"

split_unit

Parameter(s):

  • string - String to be split

Returns a List of Strings separated either by the right-most opening parentheses "(" or the left-most comma ","

[Ex 1] string = "Gas (mcf)" -> ["Gas", "mcf"]
[Ex 2] string = "Geothermal - Electrical Generation, Kilowatt Hours" 
       = ["Geothermal - Electrical Generation", "Kilowatt Hours"]
[Ex 3] string = "Geothermal - sulfur" = ["Geothermal - sulfur", ""]

Class: Setup

get_header

Parameter(s):

  • file - A Pandas DataFrame

Returns column names as a List

get_unit_dict

Returns a dictionary of item and units. Calls split_unit and add_item

Product
Salt (tons)
Soda Ash (tons)
Sodium Bi-Carbonate (tons)
Gas (mcf)
Borate Products (tons)

Returns {"Salt" : "tons", 
         "Soda Ash" : ", 
         "Sodium Bi-Carbonate : "tons", 
         "Gas" : "mcf", 
         "Borate Products" : "tons"}

Class: FormatChecker

read_config

Parameter(s):

  • type - Prefix for config file represented by a String

Returns an a dictionary based on the JSON file

get_w_count

Parameter(s):

  • file - A Pandas DataFrame

Returns a tuple based on the number of "W"s found in Volume or "Withheld"s found in State

Calendar Year  Land Category  Land Class     State  ... Product                       Volume
2003                 Onshore     Federal        CA  ... Salt (tons)                   33,622
2003                 Onshore     Federal        CA  ... Soda Ash (tons)                    W
2003                 Onshore     Federal        CA  ... Sodium Bi-Carbonate (tons)         W
2003                 Onshore     Federal        CA  ... Gas (mcf)                    4,885.6
2003	             Onshore	 Federal  Withheld  ... Borate Products (tons)	      31,124

Returns (2,1)

check_header

Parameter(s):

  • file - A Pandas DataFrame

Iterates through default header and checks if specific Field Names are present.

Prints out if a Field Name is missing or in the wrong order

Unexpected Field Names are printed separately.

[Ex] default = ["Month", "Calendar Year", "Land Class", "Land Category", "Commodity", "Volume"]
     columns = ["Moth", "Calendar Year", "Land Category", "Land Class", "Commodity", "Volume"]

-> "Month": Not Present, "Land Category": Unexpected Order, "Land Class": Unexpected Order
   New Cols: Moth

check_misc_cols

Parameter(s):

  • file - A Pandas DataFrame

Iterates through non-numerical fields and checks for unexpected entries.

Also checks Calendar Year

check_nan

Parameter(s):

  • file - A Pandas DataFrame

Iterates through specific columns and prints out cell with missing information

check_unit_dict

Parameter(s):

  • file - A Pandas DataFrame

Iterates through column with expected units. Splits each entry by item and unit. Compare to default unit dictionary to determine if valid