New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document that OsString and OsStr are bytes; provide conversions to bytes
#95290
base: master
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
It seems like this only requires that there is some variant of OsStr that is byte-based encoded as a superset of UTF-8, right? It would be possible for OsStr to wrap something like I mostly ask/note this because I want to make sure that this kind of option is explicitly mentioned in the writeup (and potentially rejected, either for complexity or infeasibility if (for example) we can't think of an encoding that will work), so that we can point back to it when thinking about this in the future. |
Fair enough; strictly speaking, the requirement would be that We could, for example, use some encoding that has a second variant for a native |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…o bytes This commit consists almost entirely of documentation, updating the module documentation in `std::ffi` and the type documentation on `OsString`. This only adds two methods: `OsStr::as_bytes` and `OsString::into_vec`, and plumbs those methods down to the existing implementations for both UNIX and Windows.
Note that this crate doesn't rely on the internal representation at all. It decodes the encoded Also, the preferred way to use the crate is to use I think this PR would still be a good idea, since many people rely on the representation anyway. However, I wanted to clarify those points.
One option would be to do the same as os_str_bytes and document limited invariants about the encoding. This would allow many of the same changes that could be made now to the internal representation. |
|
There is one problem if this PR gets merged and starts getting used. Without a wrapper, it becomes very easy to mix WTF-8 and UTF-8 bytes. For searching and concatenation, that would be a problem, since both need special handling of WTF-8 bytes. Also, if storage is permitted, the encoding could never change. |
I have no objections to documenting a lesser guarantee and making the encoding less specified, if that's needed in order to get consensus to make this change. |
rustdoc: Early doc link resolution fixes and refactorings A subset of rust-lang#94857 that shouldn't cause perf regressions, but should fix some issues like https://rust-lang.zulipchat.com/#narrow/stream/266220-rustdoc/topic/ICE.20in.20collect_intra_doc_links.2Ers rust-lang#95290 and improve performance in cases like rust-lang#95694.
rustdoc: Early doc link resolution fixes and refactorings A subset of rust-lang#94857 that shouldn't cause perf regressions, but should fix some issues like https://rust-lang.zulipchat.com/#narrow/stream/266220-rustdoc/topic/ICE.20in.20collect_intra_doc_links.2Ers rust-lang#95290 and improve performance in cases like rust-lang#95694.
The
OsStringandOsStrtypes are wrappers around bytes, and are required by existing stable APIs to be byte-based encodings that are supersets of UTF-8 (since a&strcan become an&OsStrwithout allocation). However, both types are opaque, and do not provide access to raw bytes, primarily because on Windows they use the WTF-8 encoding, the standard for which recommends against exposing it.Nonetheless, many crates do rely on
OsStrandOsStringbeing bytes. We already provide theOsStrExtandOsStringExtextension traits, which get used reasonably widely despite being non-portable. And theos_str_bytescrate has millions of downloads, and tens of thousands of new downloads everyday.This PR is an effort to seek consensus for making the contents of an
OsStrorOsStringavailable as bytes, inspired by recent discussions within the libs-api team.This is a minimal PR to acheive that goal. This PR consists almost entirely of documentation, updating the module documentation in
std::ffiand the type documentation onOsString. This PR only adds two methods:OsStr::as_bytesandOsString::into_vec, and plumbs those methods down to the existing implementations for both UNIX and Windows. Both methods are unstable (and I'll file a tracking issue for them as soon as I take this PR out of draft status).In particular, this PR does not add any new
FromorTryFromorIntoorTryIntoimpls, nor does it add methods to constructOsStrorOsStringfrom bytes (which would require error handling), nor does it attempt to do any further unification of implementations between Windows and UNIX, nor does it extend these types to have more of the functionality of bstr. All of those things could happen in potential future changes, and I'd be happy to make some of those changes myself, but I'd like to focus on the initial consensus for making the change before further work that depends on the change.As one further note of the implications of this change: there has been a proposal to change OsStr/OsString on Windows from WTF-8 to a more complex encoding that can potentially speed up matching for use with the Pattern API. Since that proposal, we've decided to seal the Pattern API and avoid stabilizing it. I would propose that we not switch the encoding of OsStr/OsString, and proceed with exposing bytes using the current encoding. Another option would be to expose bytes but not specify the exact encoding beyond "superset of UTF-8".
r? @ghost