Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

details::os::path_exists(const std::string&) doesn't handle UTF8 filename correctly under Windows #2977

Closed
liubing opened this issue Jan 13, 2024 · 17 comments · Fixed by #2978

Comments

@liubing
Copy link
Contributor

liubing commented Jan 13, 2024

My application must support both Windows and Linux, so every path and filename inside the app is encoded UTF8 with std::string, and WCHAR is not used (SPDLOG_WCHAR_FILENAMES is not enabled).

But under Windows (code taken from the latest spdlog 1.13.0):

SPDLOG_INLINE bool path_exists(const filename_t &filename) SPDLOG_NOEXCEPT {
#ifdef _WIN32
    #ifdef SPDLOG_WCHAR_FILENAMES
    auto attribs = ::GetFileAttributesW(filename.c_str());
    #else
    auto attribs = ::GetFileAttributesA(filename.c_str());
    #endif
    return attribs != INVALID_FILE_ATTRIBUTES;
#else  // common linux/unix all have the stat system call

GetFileAttributesA() will use CP_ACP (the default system code page) not CP_UTF8, which will fail if the path contains UTF8-encoded non-ascii characters.
My current work around for this issue is:

SPDLOG_INLINE bool path_exists(const filename_t &filename) SPDLOG_NOEXCEPT {
    std::error_code ec;
    return std::filesystem::exists(filename, ec);
}

But I know std::filesystem requires C++17, so it doesn't count as a general solution for spdlog.

Note for the above code to work under Windows, we need to call
setlocale(LC_ALL, ".UTF8")
at the beginning of the app (https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170#utf-8-support), so that std::filesystem::path will treat the pathname as UTF8 encoded string.

I just open this issue for discussion, still looking for a more general solution. Thanks.

@liubing
Copy link
Contributor Author

liubing commented Jan 13, 2024

It turns out removing the #ifdef _WIN32 branch solves the problem perfectly, which leaves:

// Return true if path exists (file or directory)
SPDLOG_INLINE bool path_exists(const filename_t &filename) SPDLOG_NOEXCEPT {
    struct stat buffer;
    return (::stat(filename.c_str(), &buffer) == 0);
}

Because MSVC also support stat() function.

If setlocale(LC_ALL, ".UTF8") is called before doing stat(), the filename will automatically be treated as UTF8 string.

@tt4g
Copy link
Contributor

tt4g commented Jan 13, 2024

The character encoding of char* is not specified in C/C++.
Even if stat() can be called on Windows, there is no guarantee that other functions that take char* in UTF-8 encoding as an argument will work correctly.

If you know that the character encoding of char* is UTF-8, it is the developer's responsibility to convert it to the character encoding required by the execution environment.
SPDLOG_WCHAR_FILENAMES is one way to do this.

@liubing
Copy link
Contributor Author

liubing commented Jan 13, 2024

path_exits() is the only odd place that calls Windows API directly, and all other places in os-inl.h just call C runtime functions, such as _fsopen(), std::rename(), ::stat(), etc.

On Windows as long as setlocale(LC_ALL, ".UTF8") is called at the beginning of the app code, all these C runtime functions will treat char* as UTF8 strings. These additional layers of indirection via C runtime over Windows API provide a much better way out of the dilemma of choosing SPDLOG_WCHAR_FILENAMES or not.

SPDLOG_WCHAR_FILENAMES is not an option for me. For cross-platform development, this setup (every string is UTF8, avoid std::wstring completely) is becoming more and more prevalent.

If we can remove the only odd place in path_exists(), it would make UTF8 handling under Windows without SPDLOG_WCHAR_FILENAMES much more smoothly.

Thanks.

@liubing
Copy link
Contributor Author

liubing commented Jan 13, 2024

In my opinion, this solves the issue completely:

// Return true if path exists (file or directory)
SPDLOG_INLINE bool path_exists(const filename_t &filename) SPDLOG_NOEXCEPT {
#if defined(_WIN32) && defined(SPDLOG_WCHAR_FILENAMES)
    auto attribs = ::GetFileAttributesW(filename.c_str());
    return attribs != INVALID_FILE_ATTRIBUTES;
#else  // common linux/unix and Windows all have the stat system call
    struct stat buffer;
    return (::stat(filename.c_str(), &buffer) == 0);
#endif
}
  1. On Windows with SPDLOG_WCHAR_FILENAMES, this goes to GetFileAttributesW() without any problem.
  2. On Windows with setlocale(LC_ALL, ".UTF8"), the filename will be treated like UTF8 string, which is what we want exactly.
  3. On Windows without setlocale(LC_ALL, ".UTF8"), the filename will be treated with the current code page, which will be the same effect as the old GetFileAttributesA() code path.

@tt4g
Copy link
Contributor

tt4g commented Jan 13, 2024

I can find _stat() in MSDN, but not stat().
Is it really supported on Windows OS?

@liubing
Copy link
Contributor Author

liubing commented Jan 13, 2024

Well, I think so. There's an stat.h in my Windows SDK, and it says:

        static __inline int __CRTDECL stat(char const* const _FileName, struct stat* const _Stat)
        {
            _STATIC_ASSERT(sizeof(struct stat) == sizeof(struct _stat64i32));
            return _stat64i32(_FileName, (struct _stat64i32*)_Stat);
        }

Just wrapping around some _statxxx function.

@tt4g
Copy link
Contributor

tt4g commented Jan 13, 2024

You can send a pull request, but I am not sure if it will be accepted if you are using an API that is not officially documented.

@liubing
Copy link
Contributor Author

liubing commented Jan 13, 2024

How about this one:

// Return true if path exists (file or directory)
SPDLOG_INLINE bool path_exists(const filename_t &filename) SPDLOG_NOEXCEPT {
#ifdef _WIN32
    struct _stat64i32 buffer;
    #ifdef SPDLOG_WCHAR_FILENAMES
    return (::_wstat64i32(filename.c_str(), &buffer) == 0);
    #else
    return (::_stat64i32(filename.c_str(), &buffer) == 0);
    #endif
#else  // common linux/unix all have the stat system call
    struct stat buffer;
    return (::stat(filename.c_str(), &buffer) == 0);
#endif
}

Need your opinion before sending the PR.
Thanks.

@tt4g
Copy link
Contributor

tt4g commented Jan 13, 2024

Looking at the MSDN (https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/stat-functions?view=msvc-170#time-type-and-file-length-type-variations-of-_stat), the size of the time data and file data differs depending on _statXXX.
Will _stat64i32 (_wstat64i32) work correctly on 32bit Windows?

@liubing
Copy link
Contributor Author

liubing commented Jan 13, 2024

Yes, I think so.

I currently don't have a pure 32bit Windows (I don't think anyone has these days), but I can compile it to x86 mode and run as 32bit process without any problem.

@tt4g
Copy link
Contributor

tt4g commented Jan 13, 2024

If it works, then there is no problem.
Please send me a PR.
However, @gabime is the last person to review and merge.

@liubing
Copy link
Contributor Author

liubing commented Jan 13, 2024

OK, thank you very much.

@tt4g
Copy link
Contributor

tt4g commented Jan 13, 2024

For your information.

I just remembered that I used std::filesystem::u8path() when I encountered a similar problem in the past.
std::filesystem::path::string() returns a std::string character encoding that can be passed to the OS API. This is used for cross-platform support.

std::string file_path = "/path/to/file";
#ifdef _WIN32
file_path = std::filesystem::u8path(file_path.c_str()).string();
#endif

This could be a workaround if the PR is not merged.

However, std::filesystem::u8path() is deprecated in C++20 because char8_t was added in C++20.

@liubing
Copy link
Contributor Author

liubing commented Jan 13, 2024

Thank you, this is very helpful!

What could I use if I happens to compile to C++20?

@tt4g
Copy link
Contributor

tt4g commented Jan 13, 2024

It also works with C++20.
Perhaps std::filesystem::u8path() will be removed in a future C++.

@liubing
Copy link
Contributor Author

liubing commented Jan 13, 2024

Maybe std::filesystem::path(reinterpret_cast<const char8_t*>("utf8/encoded/path")) is the new preferred way to do the same as std::filesystem::u8path()?

@tt4g
Copy link
Contributor

tt4g commented Jan 13, 2024

In C++20, yes.
However, reinterpret_cast<const char8_t*>(const char *) is not safe in C++20.
See the following link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants