Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added filter for when row is blank #5

Merged
merged 6 commits into from
Nov 15, 2023
Merged

Added filter for when row is blank #5

merged 6 commits into from
Nov 15, 2023

Conversation

noborus
Copy link
Owner

@noborus noborus commented Nov 15, 2023

Summary by CodeRabbit

  • New Features

    • Introduced a new naming system for columns in the reader to enhance clarity.
    • Implemented a function to generate cell names based on column index for better identification.
  • Enhancements

    • Improved data filtering logic to terminate early if no valid data is found, optimizing performance.
  • Bug Fixes

    • Fixed an issue with default column naming to ensure consistency across the application.
  • Tests

    • Added new test cases to verify the correct functioning of the updated column naming and data filtering features.
    • Updated existing tests to align with the new column naming convention.

If cell is specified, blank columns and blank rows are not allowed.
Column names when there is no header row are now in Excel notation.
Copy link
Contributor

coderabbitai bot commented Nov 15, 2023

Walkthrough

The updates involve enhancing an XLSX reader's functionality by introducing a cellName function to dynamically generate column names based on indices, improving the filterColumns function to better handle valid column data, and adding a columnNum function to determine the number of valid columns. The nameType function has also been updated to utilize the new cellName function for naming columns.

Changes

File(s) Summary
reader.go Introduced cellName function for dynamic column naming, improved filterColumns for better data handling, added columnNum to calculate valid columns, and updated nameType to use cellName.
reader_test.go Added new test case "test3" for XLSXReader and updated expected column names in TestXLSXReader_Names.

🐇 In the code's burrow, beneath the autumn's glow,
Changes abound, as the data columns grow.
With a hop and a skip, the reader's now adept,
Parsing sheets with grace, while the old bugs have leapt. 🍂📊


Tips

Chat with CodeRabbit Bot (@coderabbitai)

  • If you reply to a review comment from CodeRabbit, the bot will automatically respond.
  • To engage with CodeRabbit bot directly around the specific lines of code in the PR, mention @coderabbitai in your review comment
  • Note: Review comments are made on code diffs or files, not on the PR overview.
  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Note: For conversation with the bot, please use the review comments on code diffs or files.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 3

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 1226dc8 and 7b53ac3.
Files ignored due to filter (1)
  • testdata/test3.xlsx
Files selected for processing (2)
  • reader.go (5 hunks)
  • reader_test.go (2 hunks)
Additional comments: 4
reader_test.go (2)
  • 104-121: The new test case "test3" has been added correctly to test the PreReadRow function. It is important to ensure that the testdata/test3.xlsx file contains the expected data to match the test case. The use of InHeader and InJQuery options seems appropriate for the intended test scenario.

  • 147-152: The change in expected column names from {"C1", "C2"} to {"A1", "B1"} in the TestXLSXReader_Names function reflects the new column naming logic introduced in the reader.go file. This change is consistent with the pull request summary and ensures that the test case will validate the new functionality.

reader.go (2)
  • 95-99: The cellName function is used to generate column names based on the index, which is a good approach for dynamic naming. However, ensure that the cellName function is thoroughly tested to handle edge cases, such as very large indices that might result in column names not typically used in spreadsheets (e.g., beyond 'ZZ' for Excel).

  • 184-190: The nameType function now uses the cellName function to generate a unique name if a duplicate is found, which is a good practice to avoid column name collisions. However, ensure that the cellName function generates a name that is guaranteed to be unique in the context of the current sheet.

reader.go Outdated
Comment on lines 109 to 114
validColumns[c] = true
}
c++
if len(data) <= c {
break
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop termination condition seems incorrect. The loop should not break based on the length of data but rather should iterate over all the columns. The current condition could lead to missing columns if cellX is greater than 0.

- if len(data) <= c {
-     break
- }

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
validColumns[c] = true
}
c++
if len(data) <= c {
break
}
validColumns[c] = true
}
c++

reader.go Outdated
Comment on lines 147 to 168
break
}
}
newBody := make([][]interface{}, 0, len(body))
for _, row := range body {
dst := make([][]interface{}, 0, len(src))
for _, row := range src {
cols := make([]interface{}, count)
valid := false
for i := 0; i < count; i++ {
cols[i] = row[i]
if cols[i] != "" {
valid = true
}
}
if !valid {
break
}
newBody = append(newBody, cols)
dst = append(dst, cols)
}
return newBody
return dst
}

func parseExtend(ext string) (string, string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [138-164]

The filterColumns function now breaks the loop if a row without valid data is encountered. This change assumes that once a blank row is found, all subsequent rows are also blank or invalid, which might not be the case. If the data set can have non-consecutive blank rows, this logic will need to be revised.

- if !valid {
-     break
- }

Recognizes a blank row as the end of the row
(ignoring the first blank row).
Fixed so that cellName correctly returns an Excel-like name.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 7b53ac3 and dba3850.
Files selected for processing (1)
  • reader.go (4 hunks)
Additional comments: 5
reader.go (5)
  • 95-99: The cellName function is used to generate default column names when a name is not provided. This is a good approach to ensure that all columns have names. However, it's important to ensure that the cellName function is robust and handles potential errors gracefully, as previously mentioned in the outdated comments.

  • 122-130: The filterColumns function is used to filter out invalid columns and rows with no valid data. This is a good performance optimization as it avoids processing empty rows. However, the function should be reviewed to ensure that it correctly identifies the start of valid rows (startRow) and breaks out of the loop when a row with no valid data is encountered after the first valid row.

  • 141-165: The filterColumns function filters the source data based on valid columns. The logic seems to be correct, but it's important to ensure that the startRow flag is used correctly to identify when to start appending data and when to break out of the loop. The current logic seems to break out of the loop as soon as it encounters a row with no valid data after the first valid row, which is the intended behavior.

  • 168-180: The columnNum function calculates the number of valid columns. The logic here is to count the columns until a false value is encountered after the first true value. This assumes that valid columns are contiguous from the start, which is a reasonable assumption in most cases. However, if there's a possibility of valid columns being non-contiguous, this logic would need to be revised.

  • 199-205: The nameType function is responsible for generating column names and types. It uses a map to ensure that column names are unique. If a duplicate name is found, it generates a new name using the cellName function. This is a good approach to handle potential duplicate column names.

reader.go Outdated
Comment on lines 133 to 138
func cellName(i int) string {
cn, err := excelize.CoordinatesToCellName(i+1, 1)
if err != nil {
return fmt.Sprintf("C%d", i+1)
}
return cn
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cellName function generates a cell name based on the column index. The error handling here defaults to a generic name format if an error occurs. This could potentially lead to column name collisions. It would be better to propagate the error up to the caller or handle it in a way that avoids collisions.

  if err != nil {
-     return fmt.Sprintf("C%d", i+1)
+     // Handle the error more appropriately, e.g., log it and propagate it up
+     return "", err
  }

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
func cellName(i int) string {
cn, err := excelize.CoordinatesToCellName(i+1, 1)
if err != nil {
return fmt.Sprintf("C%d", i+1)
}
return cn
func cellName(i int) (string, error) {
cn, err := excelize.CoordinatesToCellName(i+1, 1)
if err != nil {
// Handle the error more appropriately, e.g., log it and propagate it up
return "", err
}
return cn, nil

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noborus Acknowledged. I will update the cellName function to return an error along with the cell name, allowing the caller to handle it appropriately.

func cellName(i int) (string, error) {
    cn, err := excelize.CoordinatesToCellName(i+1, 1)
    if err != nil {
        // Log the error and propagate it up
        return "", err
    }
    return cn, nil
}

This change will require updating all calls to cellName to handle the potential error. Would you like me to proceed with this change and adjust the rest of the code accordingly?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you. I'll do it myself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noborus Understood. If you need any further assistance or have more requests in the future, feel free to reach out. Good luck with your implementation!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between dba3850 and 33f27f6.
Files selected for processing (1)
  • reader.go (4 hunks)
Additional comments: 3
reader.go (3)
  • 95-103: The cellName function is used to generate a name for unnamed columns. Ensure that the cellX + i calculation correctly reflects the intended column index, as cellX is decremented earlier in the code if extCell is not empty. This could potentially lead to an off-by-one error.

  • 126-134: The filterColumns function is used to filter out invalid columns. It's important to ensure that the validColumns slice accurately reflects the columns that contain valid data. If the logic for determining valid columns is incorrect, this could result in valid data being omitted or invalid data being included.

  • 145-167: The filterColumns function now breaks out of the loop when a row without valid data is encountered after a row with valid data. This assumes that all rows with valid data are contiguous and that once a row with no valid data is found, no subsequent rows will contain valid data. If this assumption does not hold true for all datasets, this could lead to valid data being skipped. It may be necessary to verify this assumption against the expected data format.

Comment on lines 108 to 116
}
data := make([]interface{}, columnNum)
for c, i := 0, cellX; i < len(row); i++ {
if c >= columnNum {
break
}
data[c] = row[i]
if data[c] != "" {
validColumns[c] = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The increment of c should be inside the loop to ensure it only increments when a cell is processed. The current placement of c++ outside the loop's braces will not increment c as intended. This could lead to incorrect data alignment in the data slice.

  if data[c] != "" {
      validColumns[c] = true
- }
- c++
+     c++
+ }

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
}
data := make([]interface{}, columnNum)
for c, i := 0, cellX; i < len(row); i++ {
if c >= columnNum {
break
}
data[c] = row[i]
if data[c] != "" {
validColumns[c] = true
}
data := make([]interface{}, columnNum)
for c, i := 0, cellX; i < len(row); i++ {
if c >= columnNum {
break
}
data[c] = row[i]
if data[c] != "" {
validColumns[c] = true
c++
}

reader.go Outdated
Comment on lines 203 to 213
for i := cellX; i < cellX+columnNum; i++ {
if header && len(row) > i && row[i] != "" {
if _, ok := nameMap[row[i]]; ok {
names[c] = fmt.Sprintf("C%d", i+1)
name, err := cellName(cellX + i)
if err != nil {
names[c] = row[i] + "_" + fmt.Sprint(i)
}
names[c] = name
} else {
nameMap[row[i]] = true
names[c] = row[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nameType function now uses the cellName function to generate names for unnamed columns or to handle duplicate names. Ensure that the logic for handling duplicates is correct and that the nameMap is being used effectively to track existing names. Also, there is a potential bug where names[c] could be assigned twice in succession if an error occurs, which would overwrite the first assignment with the second one.

  if _, ok := nameMap[row[i]]; ok {
      name, err := cellName(cellX + i)
      if err != nil {
-         names[c] = row[i] + "_" + fmt.Sprint(i)
+         names[c] = row[i] + "_" + fmt.Sprint(c)
      }
-     names[c] = name
+     else {
+         names[c] = name
+     }
  } else {
      nameMap[row[i]] = true
      names[c] = row[i]
  }

Commitable suggestion

[!IMPORTANT]
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
for i := cellX; i < cellX+columnNum; i++ {
if header && len(row) > i && row[i] != "" {
if _, ok := nameMap[row[i]]; ok {
names[c] = fmt.Sprintf("C%d", i+1)
name, err := cellName(cellX + i)
if err != nil {
names[c] = row[i] + "_" + fmt.Sprint(i)
}
names[c] = name
} else {
nameMap[row[i]] = true
names[c] = row[i]
for i := cellX; i < cellX+columnNum; i++ {
if header && len(row) > i && row[i] != "" {
if _, ok := nameMap[row[i]]; ok {
name, err := cellName(cellX + i)
if err != nil {
names[c] = row[i] + "_" + fmt.Sprint(c)
} else {
names[c] = name
}
} else {
nameMap[row[i]] = true
names[c] = row[i]

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 33f27f6 and 1c89ea3.
Files selected for processing (1)
  • reader.go (4 hunks)

@noborus noborus merged commit 36d824a into main Nov 15, 2023
@noborus noborus deleted the row-filter branch November 15, 2023 04:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant