Error: A valid tsibble must have distinct rows identified by key and index. Please use `duplicates()` to check the duplicated rows. #191

edgBR · 2020-05-25T12:53:32Z

Dear colleagues,

I am trying to fit an auto.arima model to multiple weekly time series grouped by 3 columns (machine, signal and instrument)

> df
# A tibble: 1,776,835 x 5
# Groups:   machine, signal, instrument[4,277]
  machine            signal snsr_dt    instrument snsr_val_clean
   <chr>               <dbl> <date>      <dbl>          <dbl>
 1 xxxxxx             71 2011-03-14      1          63543

^{Created on 2020-05-25 by the reprex package (v0.3.0)}

However when I am trying to convert them to a tsibble I am getting the following error:

> data_tstibble <- as_tsibble(df, key = c(machine, signal, instrument))
Using `snsr_dt` as index variable.
Error: A valid tsibble must have distinct rows identified by key and index.
Please use `duplicates()` to check the duplicated rows.
Run `rlang::last_error()` to see where the error occurred.

^{Created on 2020-05-25 by the reprex package (v0.3.0)}

Every row in that dataframe is unique in terms of combination of machine, signal and instrument and snsr_val. It is possible however that for an specific combination of machine, signal and instrument I have duplicated values (as the sensor is a cumulative one and if it is not measuring this value will stay constant). Is this the case why I have to check into the duplicates?

Is there any work around this?

BR
/Edgar

mitchelloharawild · 2020-05-25T13:09:07Z

It is possible however that for an specific combination of machine, signal and instrument I have duplicated values (as the sensor is a cumulative one and if it is not measuring this value will stay constant). Is this the case why I have to check into the duplicates?

Even if the measured value remains constant, then the time of observation (index: snsr_dt) should differ. The error is due to a particular series (key combination) in the tsibble having two or more observations during the exact same time period. As ARIMA() requires evenly spaced observations in time, having two observations for the same time point will be problematic.

Perhaps the error message can be tweaked to de-emphasise 'duplicated rows' and instead emphasise the temporal nature of the duplicate.

edgBR · 2020-05-25T15:15:06Z

Hi @mitchelloharawild,

I am afraid that I do not understand the error message:

duplicates(df1)
Using `snsr_dt` as index variable.
# A tibble: 1,755,129 x 4
   snsr_dt    machine              signal     instrument
   <date>     <chr>               <dbl>       <dbl>
 1 2011-03-14 machine1               71         1
 2 2011-03-21 machine1               71         1
 3 2011-03-28 machine1               71         1

The snsr_dt seems different to me. Also, the process that is generating this data is actually dropping duplicates at the end. So I am a little bit confused here.

BR
/Edgar

mitchelloharawild · 2020-05-25T15:43:01Z

When using the duplicates() function, you should also specify your key and index variables.
duplicates(df, key = c(machine, signal, instrument), index = snsr_dt)

edgBR · 2020-05-26T07:55:44Z

Dear @mitchelloharawild,

Thank you for the details. In fact I have 62 duplicates. I might have miss-understood the dplyr::distinct() function but basically I was doing:

df_raw <- dbGetQuery(myDB, 
                                  "SELECT * FROM target_table
                                  WHERE locf_tag='N'
                                  AND (EXTRACT(WEEKDAY FROM timestamp)=1)
                                  AND (EXTRACT(HOUR FROM timestamp)=1)") %>% 
  rename_all(tolower) %>% 
  group_by(machine, signal, instrument)

df<- df_raw %>% filter(n() >= 52*2) %>% #removing groups with less than 2y of observations
  mutate(snsr_val_clean = if_else(condition = inc_tag == 'Y', true = snsr_val, false = NA_real_)) %>% 
  mutate(snsr_val_clean = na.approx(snsr_val_clean, na.rm=FALSE)) %>% 
  # tk_augment_timeseries_signature(snsr_dt) %>% 
  # select_if(negate(is.factor)) %>%  
  # select(-c(diff)) %>% 
 select(-c(interp_qlty, base_qlty, inc_tag, locf_tag, timestamp, snsr_val)) %>% 
  # select(-c("hour", "minute", "second", "hour12", "wday", "wday.xts", "am.pm")) %>% 
  drop_na() %>% 
  distinct()

Shouldn't this remove all of the duplicates?

BR
/Edgar

mitchelloharawild · 2020-05-26T08:00:20Z

How you remove your duplicates is problem specific, although distinct() should help with this. Now that we've confirmed this is not a bug, please ask questions on a community help forum.

aloes2512 · 2024-03-27T22:36:38Z

I get the same error messages donloading as csv file and using pivotlonger to get a three column file the first is the index as monthly data, the second (the key) are 14 observations at every month marked as factors and finally a value to each index key pair all are distinct. Trying to find dublicates I get the error message: Error in dublicates(., key = Pattern, index = yr.mnth) :
could not find function "dublicates" Now my confusion is complete

mitchelloharawild closed this as completed May 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: A valid tsibble must have distinct rows identified by key and index. Please use `duplicates()` to check the duplicated rows. #191

Error: A valid tsibble must have distinct rows identified by key and index. Please use `duplicates()` to check the duplicated rows. #191

edgBR commented May 25, 2020 •

edited

Loading

mitchelloharawild commented May 25, 2020

edgBR commented May 25, 2020

mitchelloharawild commented May 25, 2020 •

edited

Loading

edgBR commented May 26, 2020

mitchelloharawild commented May 26, 2020

aloes2512 commented Mar 27, 2024

Error: A valid tsibble must have distinct rows identified by key and index. Please use duplicates() to check the duplicated rows. #191

Error: A valid tsibble must have distinct rows identified by key and index. Please use duplicates() to check the duplicated rows. #191

Comments

edgBR commented May 25, 2020 • edited Loading

mitchelloharawild commented May 25, 2020

edgBR commented May 25, 2020

mitchelloharawild commented May 25, 2020 • edited Loading

edgBR commented May 26, 2020

mitchelloharawild commented May 26, 2020

aloes2512 commented Mar 27, 2024

Error: A valid tsibble must have distinct rows identified by key and index. Please use `duplicates()` to check the duplicated rows. #191

Error: A valid tsibble must have distinct rows identified by key and index. Please use `duplicates()` to check the duplicated rows. #191

edgBR commented May 25, 2020 •

edited

Loading

mitchelloharawild commented May 25, 2020 •

edited

Loading