Skip to content
This repository has been archived by the owner on Dec 21, 2023. It is now read-only.

TBlob.get_with_shape: new and old shape do not match total elements #2286

Closed
rplom opened this issue Sep 6, 2019 · 5 comments
Closed

TBlob.get_with_shape: new and old shape do not match total elements #2286

rplom opened this issue Sep 6, 2019 · 5 comments

Comments

@rplom
Copy link
Contributor

rplom commented Sep 6, 2019

My tensor size is now over 4 billion and mxnet can't handle it. I built the new branch of MXNet and tried it, but by pulling that in I needed a new numpy and tensorflow. Turi kind of works with the new libraries but some accuracy seems lost.

File "/Users/creator/venv/lib/python3.6/site-packages/mxnet/base.py", line 146, in check_call
raise MXNetError(py_str(LIB.MXGetLastError()))
mxnet.base.MXNetError: [05:01:12] include/mxnet/./tensor_blob.h:276: Check failed: this->shape
.Size() == shape.Size() (4304252928 vs. 9285632) TBlob.get_with_shape: new and old shape do not match total elements

Stack trace returned 8 entries:
[bt] (0) 0 libmxnet.so 0x000000014c7a17cf libmxnet.so + 59343
[bt] (1) 1 libmxnet.so 0x000000014c7a156f libmxnet.so + 58735
[bt] (2) 2 libmxnet.so 0x000000014c7c3249 libmxnet.so + 197193
[bt] (3) 3 libmxnet.so 0x000000014d89f01e MXNDListFree + 1430366
[bt] (4) 4 libmxnet.so 0x000000014d886f95 MXNDListFree + 1331925
[bt] (5) 5 libmxnet.so 0x000000014d7115fd MXNDArraySyncCopyFromCPU + 13
[bt] (6) 6 _ctypes.cpython-36m-darwin.so 0x0000000112d00237 ffi_call_unix64 + 79
[bt] (7) 7 ??? 0x00007ffee14b78c0 0x0 + 140732678240448

Large Tensor Support
Before MXNet 1.5.0, MXNet supported a maximal tensor size of around 4 billion (2³²). This was due to uint32_t being used as the default data type for tensor size, as well as variable indexing. Now you can enable large tensor support by changing the following build flag to 1: USE_INT64_TENSOR_SIZE = 1.(Note this is set to 0 by default) This enabled large scale training for example large graph network training using Deep Graph Library.

see: apache/mxnet#9207
https://medium.com/apache-mxnet/apache-mxnet-1-5-0-release-is-now-available-4138f5233401

@TobyRoseman
Copy link
Collaborator

TobyRoseman commented Sep 6, 2019

This is the repository for TuriCreate not for MXNet. Did you mean to create an issue for MXNet?

I'm going to close this issue. Feel to clarify how this is a TuriCreate issue and reopen.

@rplom
Copy link
Contributor Author

rplom commented Sep 6, 2019

No, that is not the problem, TuriCreate requires an MXNet 1.1.0 that has the issue. It's not compatible with fixed version to MXNet. Because then it needs a new numpy and tensor. Can TuriCreate be updated to not require such outdated frameworks?

I don't see why you'd close the issue without even considering the context. TuriCreate is crippled by MXNet and you should change your requirements or else no one will be able to create big models. I can't use the audio classifier past ~60K items before it dies.

@TobyRoseman
Copy link
Collaborator

@rplom - What OS and Python version are you using? Could you share the python stack trace and your code?

@rplom
Copy link
Contributor Author

rplom commented Sep 10, 2019

@TobyRoseman i opened the other ticket about the requirements. I have Python 3.6.5. But the thing is unless you can build a custom MXNet with the USE_INT64_TENSOR_SIZE flag (which I will do locally) you don't get the fixed 64 bit integer stuff. If I build that I'm on MXNet 1.5 and then it's the classic pulling a decency thread. Thanks for looking at this. I have something pretty amazing I'm building with Turi and am pushing the boundaries a bit..

@rplom
Copy link
Contributor Author

rplom commented Sep 10, 2019

I'm happy to share the code privately with you, but I don't want to post it here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants