-
-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Make format
validate by default
#1520
Comments
FWIW I am of like mind. (Corvus.JsonSchema does pass all the optional format tests, so that's easy for me to say!) |
I agree that users expect that |
I also agree users expect As described in https://json-schema.org/draft/2019-09/release-notes#format-vocabulary, it was very hard to make implementations interoperable on the validation they should perform and at which degree, and these inconsistencies sometimes ended up being more painful and confusing than Sounds like if we go in this direction, we should not only make the currently optional tests required, but also extend them a lot more, which might be tricky given that some formats don't even have an "official" test suite of their own (again like URIs?) Maybe a middle ground would be to make |
It was always an annotation (or before annotations were a thing, it just wasn't validated). |
I agree having some minimal requirement set for each format is probably warranted. |
Another question this raises (as highlighted by @jdesrosiers' #1510 (comment)) is how we want implementations to handle formats they don't understand. Should implementations automatically pass validation for these or fail them? I think an implementation failing validation (or maybe even refusing to process the schema) gives a more expected outcome when compared with an implementation that does support the format. Getting a pass from an implementation that doesn't know the format when one that does fails feel more wrong to me. Thoughts? |
Agreed. I think the biggest concern was the inconsistency between implementations. The test suite can address that issue. Maybe it's not perfect, but it can get better over time as we find edge cases. I think the only way making |
I have different opinions depending on how we end up defining what asserting on |
IMO
|
No, it doesn't. People can still use
Making it always behave either way would definitively remove ambiguity. Making it validate would align with users' expectations. If you're referring to the level of support for each format offered by different implementations, other comments in this issue address that by saying we need more rigorous (and non-optional) testing.
It probably would be a breaking change for many schemas that don't declare On the other hand, it will fix the multitude of schemas that exist in the wild which expect format validation. It's also going to be a burden for many tooling maintainers because a lot of them don't support these formats fully. (I'm one of them.)
JSON Schema very clearly defines the specifications for each format, and they're all language-agnostic (except maybe
Not all formats can be well-represented with regex, which is the only validation |
@Julian what would it take to get Bowtie to report on formats, maybe even just locally, so we can get some rough numbers? |
Indeed there's a great need to have a "format" keyword that validates. The difficulty of going about with this before has been a couple things:
Even when unknown keywords are ignored, validation keywords typically cause errors when their value is outside the permitted values, so I'd expect an error. Similarly, a validating "format" is not too different from a $ref that has predefined names—you're referencing some external, arbitrary validator, and if you don't know what that is, that's an error. |
Can you give some examples? And if a format wouldn't be well-represented with regex how would that format be validated by implementers? |
Could you clarify this? Are you saying that a validating "email" format would reject a number? I can't find such a requirement. In fact the (optional) test suite verifies that format ignore value types to which the format doesn't apply. Maybe I'm misunderstanding you.
These generally need parsers. The spec says that it's expected that an implementation will rely on established functionality to perform the validation. |
Well, I think this is precisely the point where this concern goes into ambiguity: the same
As for the |
So you're right, most of the formats won't reject e.g. numbers. There's still a related problem when if I have a schema like More specifically: if it's possible to break down a keyword into multiple parts, because someone might want one to reject and not the other, then separate keywords are justified. "type" is the most common instance of this. |
I think everyone has recognized this is a breaking change, but I'm glad this was brought up because we haven't stated explicitly the consequences of that fact. After the stable release, breaking changes won't be allowed, so if we want to make this change, it would have to be before the release.
In addition to Greg's response to this, there are other problems with relying on |
I totally dissagree on this one, and I think others should express their opinion on this matter. As for me, I will always prefer |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
@SorinGFS I think we're talking about two different things. If I understand you correctly, you understand But, that's not the way |
This comment was marked as off-topic.
This comment was marked as off-topic.
This is an incorrect assumption. Different systems handle dates differently. Some (like many spreadsheets) internally represent dates as numbers. .Net uses the
This is a valid opinion, but it is not an industry standard or best practice. JSON Schema (long ago) made a decision that the ideal way to represent dates in JSON should be strings per RFC 3339. My guess is that, at the time, JSON Schema was published under IETF, and they wanted to use an IETF standard for date representation. Since then, the majority of the internet (based on my experience, which is admittedly .Net heavy) seems to have decided they like ISO 8601 better. (Happy to be proven wrong.) So if anything, we should be changing But that's not what this discussion is about. We're not discussing any format specifically. We're discussing whether the In the future, please keep discussions on topic. If you have something new to discuss, please open a new discussion. If it's related, add a link. |
I'm torn. I am tempted by stringFormat, numberFormat being the validating keywords because it also addresses the type: [string, number] issue (though personally I don't like the array form of type and feel you should be doing composition properly, lazybones 😀). But it doesn't address the existing expectations of format. And to comply with the existing expectations of format requires a breaking change. I think I prefer to avoid the breaking change, and introduce new keywords with new semantics. Not least because they address a second issue. |
I vote for not introducing a breaking change, keeping This also gives us the freedom of potentially allowing us to choose which string types will we standardise on the next version instead of standardising all of them. For example, On naming, I get |
I agree. |
So it looks like we have the following choice:
Personally, I don't think that people will know that there's a new keyword, whether the new one validates or annotates, no matter how much we advertise it. (Evidenced by the continued questions around Moreover, it's only a breaking change in schemas that don't declare I'd like to get our users' feelings on this. We need to ask something like what they expect the validation outcome of this to be: // schema
{
"type": "string",
"format": "email"
}
// instance
"definitely not an email" (As I went to put something up on reddit, reddit decided to start having problems.) The primary problem with leaving Why not give them what they want? |
If While many non-standard formats could be replaced by a {
"type": "string",
"format": "semver"
} easier to work with than {
"type": "string",
"pattern": "^(?P<major>0|[1-9]\\d*)\\.(?P<minor>0|[1-9]\\d*)\\.(?P<patch>0|[1-9]\\d*)(?:-(?P<prerelease>(?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\\.(?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\\.[0-9a-zA-Z-]+)*))?$"
} So my personal preference would be to make My biggest concern then would be about adding new formats to future versions of JSON Schema. We would surely want this to be considered a backward-compatible change, and it would be unless a validator were to have implemented a formerly-custom format differently to the newly-standardised spec. I think that would be an acceptable risk, under the assumption that a new format would not be standardised without first performing due diligence of considering the likelihood of causing consumer breakage. Presumably this would be done by continuing what's largely already done today - facilitating public discussion, soliciting feedback from implementors etc. Then whenever a new format is being considered for standardisation, in the likely event that implementors generally ignore and/or have consistent rules for validation of the format, then it could be added with no problems. Otherwise, in the event of implementations having inconsistent validation rules for a format, it would need to be determined whether it's worth standardising it anyway and accepting the potential breakage (e.g. if it's accepted that those implementations are performing validation incorrectly due to misunderstanding the format), or else avoiding breakage by choosing a different name for the format. |
Absolutely a good callout. This was brought up recently to me by someone else offline while discussing it. No answers; just the concern. As a first pass at this, I'd say that the spec would require validation of the formats it defines and immediate passing validation for non-spec formats. An implementation must not provide validation for custom formats unless explicitly configured to do so. (I'd want to state what an implementation MUST NOT do rather than state what it MAY do. Implementations are going to do what they want anyway, so it's not useful to explicitly state what they're allowed to do. See also #1509.) This would provide a standard expected behavior for a compliant implementation "out of the box", but also allows for users to specify their own formats.
100%. This was stated above, but the example provides a lot of clarity. Thanks. |
This is a good point. I fully agree.
So I generally agree with this. The only thing I'm concerned about is that Not necessarily proposing we should, but for example, I'd feel much better making |
Currently, the requirement isn't ambiguous. Each format references some other specification which identifies its strict requirements. However, some of them can be a pain to implement fully, and even when using a 3rd-party-lib to support the format, there are no guarantees of support. I think the solution here is to:
We actually already do this for regex support (Core 6.4) for
(I'd like to rephrase some of this to make it more a requirement on implementations rather than schema authors, but that's #1509.) I think this is a good solution that's easily testable and removes a lot of edge cases from hard requirement. |
It should be handled the same way any unknown value on a keyword is handled, with an error. An unknown $ref is not ignored, so why ignore an unknown format?
There is a problem in that adding a format is not backwards compatible if pre-existing validators ignore the new format: It would create a situation where new validators will reject certain inputs, but the pre-existing validators ignore the format, and accept the same input. If a format is too new and authoring a schema using it would produce an error in too many validators, then perhaps there should be a way to signal that a few different alternatives are equivalent, e.g. you say
|
@awwright, yeah, that makes sense to me. Considering that, I would update what I said before to "the spec would require validation of the formats it defines and error for non-spec formats." This still leaves the door open for implementations to support custom formats, and it still requires users to explicitly opt in to that behavior, indicating they understand the risk of future-version collision. |
This is a great start, but I think it doesn't fully solve the problem. Some "permissive" implementations will stay at the lower bar, some may go very strict, and many might be somewhere in between. At that point, schemas stop being fully interoperable between implementations even within the same dialect. If there was a way for us to better define the exact requirements and not only the lower bar, even for complex ones, I'd be 100% sold on this proposal. |
I'm open to suggestions, but I don't think that being 100% exact is possible with formats like The point of the minimum requirements isn't really for implementations. It's for users to know where that line is. As long as they stay within the "minimum support" bounds, they're guaranteed to be interoperable. (We state it as a requiremnt of the implementation, but it's an assurance for users.) |
I agree JSON Schema formats should include arbitrarily complicated formats, e.g. context free languages and even Turing-complete languages. However note most of the formats specified are regular, and validators shouldn't be required to support formats more complicated than regular. I think the only exception are internationalized domain names. Also, |
Could we suggest in the spec specific regular expressions like the ones @awwright is mentioning? i.e. a SHOULD as a lower bar? |
I do pass all the optional format stuff, and I've got a suite of "fairly good" regexes that deal 'test-passingly' (I hesitate to say "correctly") with some of the more complex internationalisation requirements along with some "encoding/decoding" you have to do to check "a bit more properly". I can describe the algorithms and regexes in a way that might be helpful for implementers. |
The lower bar needs to be a MUST. Complete support would be the SHOULD. |
Sorry for being late to the discussion. I want to start by talking about the duality of the I agree with the sentiment that a new keyword will go relatively unknown. One solution to that problem is to deprecate Another option that occurred to me is to use the same I really like the idea of requiring extension formats that need to validate to be URIs. That eliminates the possibility of us introducing a breaking change by adding new formats. So, we could allow unknown formats to be ignored without a breaking change concern. However, I think there are other reasons not to ignore unknown formats. The biggest reason is consistency. Users should expect consistent results from different validators unless they specifically opt-in to something custom. If one implementation understands Regarding the validation of things like the |
@mwadams Can you link me to those? Happy to be the guinea pig on this if you agree, as I don't support
Sounds great. I'd still like to see more testing for these formats on the official test suite. I think most implementers just look at the basic coverage we have for i.e. |
I agree - I'm not super attentive to the grammar if I pass the tests (though most of our tests are derived from the relevant RFCs and are quite tricky to pass!) I'll see about pulling something together. |
For everyone here, please see the PR I've created for this ☝️ |
I'm afraid I haven't had time to get to that yet, but (Follow the references to see how they are used in the various validation functions. They should be fairly readable to C++ Devs!) |
There's a long and sticky history around
format
.format
has never required validation.format
validation has always been the decision of the implementation.The result of all of this is that implementation support for validation has been spotty at best. Despite the JSON Schema specs referencing very concretely defined formats (by referencing other specs), implementations that do support validation don't all support each format equally. This has been the primary driving force behind keeping
format
as an opt-in validation.With 2019-09, we decided that it was time to give the option of
format
validation to the schema author. They could enable validation by using a meta-schema which listed the Format Vocabulary with atrue
value, which meant, "format
validation is required to process this schema."In 2020-12, we further refined this by offering two separate vocabularies, one that treats the keyword as an annotation and one that treats it as an assertion. The argument was that the behavior of a keyword shouldn't change based on whether the vocabulary was required or not.
However, the fact remains that our users consistently report (via questions in Slack, GitHub, and StackOverflow) that they expect
format
to validate. (The most recent case I can think of was only last week, in .Net's effort to build a short-term solution for schema generation from types.)This consistency in user expectations leads me to believe that we should officially make
format
an assertion keyword and strictly enforce it by moving the appropriate tests into the required section of the Test Suite.(Personally, I'm not passing all of the optional
format
tests, so I'll have to do some work to get there or document why they're not supported.)The text was updated successfully, but these errors were encountered: