Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: A valid tsibble must have distinct rows identified by key and index. Please use duplicates() to check the duplicated rows. #191

Closed
edgBR opened this issue May 25, 2020 · 6 comments

Comments

@edgBR
Copy link

edgBR commented May 25, 2020

Dear colleagues,

I am trying to fit an auto.arima model to multiple weekly time series grouped by 3 columns (machine, signal and instrument)

> df
# A tibble: 1,776,835 x 5
# Groups:   machine, signal, instrument[4,277]
  machine            signal snsr_dt    instrument snsr_val_clean
   <chr>               <dbl> <date>      <dbl>          <dbl>
 1 xxxxxx             71 2011-03-14      1          63543

Created on 2020-05-25 by the reprex package (v0.3.0)

However when I am trying to convert them to a tsibble I am getting the following error:

> data_tstibble <- as_tsibble(df, key = c(machine, signal, instrument))
Using `snsr_dt` as index variable.
Error: A valid tsibble must have distinct rows identified by key and index.
Please use `duplicates()` to check the duplicated rows.
Run `rlang::last_error()` to see where the error occurred.

Created on 2020-05-25 by the reprex package (v0.3.0)

Every row in that dataframe is unique in terms of combination of machine, signal and instrument and snsr_val. It is possible however that for an specific combination of machine, signal and instrument I have duplicated values (as the sensor is a cumulative one and if it is not measuring this value will stay constant). Is this the case why I have to check into the duplicates?

Is there any work around this?

BR
/Edgar

@mitchelloharawild
Copy link
Member

It is possible however that for an specific combination of machine, signal and instrument I have duplicated values (as the sensor is a cumulative one and if it is not measuring this value will stay constant). Is this the case why I have to check into the duplicates?

Even if the measured value remains constant, then the time of observation (index: snsr_dt) should differ. The error is due to a particular series (key combination) in the tsibble having two or more observations during the exact same time period. As ARIMA() requires evenly spaced observations in time, having two observations for the same time point will be problematic.

Perhaps the error message can be tweaked to de-emphasise 'duplicated rows' and instead emphasise the temporal nature of the duplicate.

@edgBR
Copy link
Author

edgBR commented May 25, 2020

Hi @mitchelloharawild,

I am afraid that I do not understand the error message:

duplicates(df1)
Using `snsr_dt` as index variable.
# A tibble: 1,755,129 x 4
   snsr_dt    machine              signal     instrument
   <date>     <chr>               <dbl>       <dbl>
 1 2011-03-14 machine1               71         1
 2 2011-03-21 machine1               71         1
 3 2011-03-28 machine1               71         1

The snsr_dt seems different to me. Also, the process that is generating this data is actually dropping duplicates at the end. So I am a little bit confused here.

BR
/Edgar

@mitchelloharawild
Copy link
Member

mitchelloharawild commented May 25, 2020

When using the duplicates() function, you should also specify your key and index variables.
duplicates(df, key = c(machine, signal, instrument), index = snsr_dt)

@edgBR
Copy link
Author

edgBR commented May 26, 2020

Dear @mitchelloharawild,

Thank you for the details. In fact I have 62 duplicates. I might have miss-understood the dplyr::distinct() function but basically I was doing:

df_raw <- dbGetQuery(myDB, 
                                  "SELECT * FROM target_table
                                  WHERE locf_tag='N'
                                  AND (EXTRACT(WEEKDAY FROM timestamp)=1)
                                  AND (EXTRACT(HOUR FROM timestamp)=1)") %>% 
  rename_all(tolower) %>% 
  group_by(machine, signal, instrument)

df<- df_raw %>% filter(n() >= 52*2) %>% #removing groups with less than 2y of observations
  mutate(snsr_val_clean = if_else(condition = inc_tag == 'Y', true = snsr_val, false = NA_real_)) %>% 
  mutate(snsr_val_clean = na.approx(snsr_val_clean, na.rm=FALSE)) %>% 
  # tk_augment_timeseries_signature(snsr_dt) %>% 
  # select_if(negate(is.factor)) %>%  
  # select(-c(diff)) %>% 
 select(-c(interp_qlty, base_qlty, inc_tag, locf_tag, timestamp, snsr_val)) %>% 
  # select(-c("hour", "minute", "second", "hour12", "wday", "wday.xts", "am.pm")) %>% 
  drop_na() %>% 
  distinct()

Shouldn't this remove all of the duplicates?

BR
/Edgar

@mitchelloharawild
Copy link
Member

How you remove your duplicates is problem specific, although distinct() should help with this. Now that we've confirmed this is not a bug, please ask questions on a community help forum.

@aloes2512
Copy link

I get the same error messages donloading as csv file and using pivotlonger to get a three column file the first is the index as monthly data, the second (the key) are 14 observations at every month marked as factors and finally a value to each index key pair all are distinct. Trying to find dublicates I get the error message: Error in dublicates(., key = Pattern, index = yr.mnth) :
could not find function "dublicates" Now my confusion is complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants