Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add facilities to analyse lists #7770

Open
rdstern opened this issue Aug 14, 2022 · 0 comments
Open

Add facilities to analyse lists #7770

rdstern opened this issue Aug 14, 2022 · 0 comments
Assignees
Milestone

Comments

@rdstern
Copy link
Collaborator

rdstern commented Aug 14, 2022

In File > New Data Frame > Lists > Words/Literature > Shakespeare/sonnets, the variable list.lines comes into R-Instat as a list.

image

There is a bit of initial work for @lilyclements and then (assuming she agrees) most could perhaps be done by @anastasia-mbithe ?

a) I have added the request for lists to have a proper data type into issue #7493 because that is on adding another type of data into R-Instat. But having a type called (L) is more urgent, because currently it has nothing and hence looks numeric.
b) It looks like a complicated character variable, but the function data_book$convert_column_to_type(data_name="data", col_names="list.lines1", to_type="character")
doesn't work. It converts all values to NA. Please could this function work. (This is probably a @lilyclements task?)
c) The menu Prepare > Column: Text mainly uses the stringr package. These functions seem to work directly on lists, but most of the dialogue options don't allow the variable, because it isn't factor or character. Please allow this option to be added. Perhaps @anastasia-mbithe could do this? The Split command does allow the list variables.
d) Improve the Prepare > Data Reshape > Stack > Unnest option to be able to stack these data into line by line. It is an excellent example of multiple response.
0) The unnest function we use is from the tidytext package. There is now one, with the same name in tidyr. I think we are still ok using the one we have, but a check by @lilyclements would help.

  1. I assume it will work directly on lists, so allow that type of data - just as discussed in c) above.
  2. The option paragraph allows a separator, called paragraph.break but there is no option in the dialogue. So add this option. We could call it Pattern perhaps?
  3. Similarly when token is regex there is a pattern = possibility.
  4. Check, but I think, in each case we could usefully add our regex keyboard as an option for the pattern.

Find what works for this example? I am almost there below:
image

I used the lines as follows:

# Code generated by the dialog, Stack (Pivot Longer)

list.lines1 <- data_book$get_columns_from_data(data_name="data", col_names="list.lines1")
data <- data_book$get_data_frame(data_name="data")
data_unnest1 <- tidytext::unnest_tokens(input=list.lines1, tbl=data, output="lines", token="paragraphs",paragraph_break = "\", \"")
data_book$import_data(data_tables=list(data_unnest1=data_unnest1))

rm(list=c("data_unnest1", "list.lines1", "data"))

That searched for the string ", " and you see that it misses the occasion where there is either a second space, or perhaps a line return in the string. My regex is not good enough. I am not even sure here why I need just a single \ here as an escape.

  1. Add the to_lower = FALSE , when it is false. The default is TRUE.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants