You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

363 lines
18 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# HTML小实验用代码分析HTML标准
你好我是winter。
前面的课程中我们已经讲解了大部分的HTML标签。
然而为了突出重点我们还是会忽略一些标签类型。比如表单类标签和表格类标签我认为只有少数前端工程师用过比如我在整个手机淘宝的工作生涯中一次表格类标签都没有用到表单类则只用过input也只有几次。
那么剩下的标签我们怎么样去了解它们呢当然是查阅HTML标准。
由于阅读标准有一定门槛需要了解一些机制这节课我为你设计了一个小实验用JavaScript代码去抽取标准中我们需要的信息。
## HTML标准
我们采用WHATWG的living standard标准我们先来看看标准是如何描述一个标签的这里我们看到有下面这些内容。
```
Categories:
Flow content.
Phrasing content.
Embedded content.
If the element has a controls attribute: Interactive content.
Palpable content.
Contexts in which this element can be used:
Where embedded content is expected.
Content model:
If the element has a src attribute: zero or more track elements, then transparent, but with no media element descendants.
If the element does not have a src attribute: zero or more source elements, then zero or more track elements, then transparent, but with no media element descendants.
Tag omission in text/html:
Neither tag is omissible.
Content attributes:
Global attributes
src — Address of the resource
crossorigin — How the element handles crossorigin requests
poster — Poster frame to show prior to video playback
preload — Hints how much buffering the media resource will likely need
autoplay — Hint that the media resource can be started automatically when the page is loaded
playsinline — Encourage the user agent to display video content within the element's playback area
loop — Whether to loop the media resource
muted — Whether to mute the media resource by default
controls — Show user agent controls
width — Horizontal dimension
height — Vertical dimension
DOM interface:
[Exposed=Window, HTMLConstructor]
interface HTMLVideoElement : HTMLMediaElement {
[CEReactions] attribute unsigned long width;
[CEReactions] attribute unsigned long height;
readonly attribute unsigned long videoWidth;
readonly attribute unsigned long videoHeight;
[CEReactions] attribute USVString poster;
[CEReactions] attribute boolean playsInline;
};
```
我们看到这里的描述分为6个部分有下面这些内容。
* Categories标签所属的分类。
* Contexts in which this element can be used标签能够用在哪里。
* Content model标签的内容模型。
* Tag omission in text/html标签是否可以省略。
* Content attributes内容属性。
* DOM interface用WebIDL定义的元素类型接口。
这一节课我们关注一下Categories、Contexts in which this element can be used、Content model这几个部分。我会带你从标准中抓取数据做一个小工具用来检查X标签是否能放入Y标签内。
## 代码角度分析HTML标准
HTML标准描述用词非常的严谨这给我们抓取数据带来了巨大的方便首先我们打开单页面版HTML标准
* [https://html.spec.whatwg.org/](https://html.spec.whatwg.org/)
在这个页面上,我们执行一下以下代码:
```
Array.prototype.map.call(document.querySelectorAll(".element"), e=>e.innerText);
```
这样我们就得到了所有元素的定义了现在有107个元素。
不过比较尴尬的是这些文本中并不包含元素名我们只好从id属性中获取最后代码类似这样
```JavaScript
var elementDefinations = Array.prototype.map.call(document.querySelectorAll(".element"), e => ({
text:e.innerText,
name:e.childNodes[0].childNodes[0].id.match(/the\-([\s\S]+)\-element:/)?RegExp.$1:null}));
```
接下来我们用代码理解一下这些文本。首先我们来分析一下这些文本它分成了6个部分而且顺序非常固定这样我们可以用JavaScript的正则表达式匹配来拆分六个字段。
我们这个小实验的目标是计算元素之间的包含关系因此我们先关心一下categories和contentModel两个字段。
```JavaScript
for(let defination of elementDefinations) {
console.log(defination.name + ":")
let categories = defination.text.match(/Categories:\n([\s\S]+)\nContexts in which this element can be used:/)[1].split("\n");
for(let category of categories) {
console.log(category);
}
/*
let contentModel = defination.text.match(/Content model:\n([\s\S]+)\nTag omission in text\/html:/)[1].split("\n");
for(let line of contentModel)
console.log(line);
*/
}
```
接下来我们来处理category。
首先category的写法中最基本的就是直接描述了category的句子我们把这些不带任何条件的category先保存起来然后打印出来其它的描述看看
```JavaScript
for(let defination of elementDefinations) {
//console.log(defination.name + ":")
let categories = defination.text.match(/Categories:\n([\s\S]+)\nContexts in which this element can be used:/)[1].split("\n");
defination.categories = [];
for(let category of categories) {
if(category.match(/^([^ ]+) content./))
defination.categories.push(RegExp.$1);
else
console.log(category)
}
/*
let contentModel = defination.text.match(/Content model:\n([\s\S]+)\nTag omission in text\/html:/)[1].split("\n");
for(let line of contentModel)
console.log(line);
*/
}
```
这里我们要处理的第一个逻辑是带if的情况。
然后我们来看看剩下的情况:
```
None.
Sectioning root.
None.
Sectioning root.
None.
Form-associated element.
Listed and submittable form-associated element.
None.
Sectioning root.
None.
If the type attribute is not in the Hidden state: Listed, labelable, submittable, resettable, and autocapitalize-inheriting form-associated element.
If the type attribute is in the Hidden state: Listed, submittable, resettable, and autocapitalize-inheriting form-associated element.
Listed, labelable, submittable, and autocapitalize-inheriting form-associated element.
Listed, labelable, submittable, resettable, and autocapitalize-inheriting form-associated element.
None.
Listed, labelable, submittable, resettable, and autocapitalize-inheriting form-associated element.
Listed, labelable, resettable, and autocapitalize-inheriting form-associated element.
Labelable element.
Sectioning root.
Listed and autocapitalize-inheriting form-associated element.
None.
Sectioning root.
None.
Sectioning root.
Script-supporting element.
```
这里出现了几个概念:
* None
* Sectioning root
* Form-associated element
* Labelable element
* Script-supporting element
如果我们要真正完美地实现元素分类,就必须要在代码中加入正则表达式来解析这些规则,这里作为今天的课后问题,留给你自己完成。
接下来我们看看Content Model我们照例先处理掉最简单点的部分就是带分类的内容模型
```JavaScript
for(let defination of elementDefinations) {
//console.log(defination.name + ":")
let categories = defination.text.match(/Categories:\n([\s\S]+)\nContexts in which this element can be used:/)[1].split("\n");
defination.contentModel = [];
let contentModel = defination.text.match(/Content model:\n([\s\S]+)\nTag omission in text\/html:/)[1].split("\n");
for(let line of contentModel)
if(line.match(/^([^ ]+) content./))
defination.contentModel.push(RegExp.$1);
else
console.log(line)
}
```
好了,我们照例看看剩下了什么:
```
A head element followed by a body element.
If the document is an iframe srcdoc document or if title information is available from a higher-level protocol: Zero or more elements of metadata content, of which no more than one is a title element and no more than one is a base element.
Otherwise: One or more elements of metadata content, of which exactly one is a title element and no more than one is a base element.
Text that is not inter-element whitespace.
Nothing.
Text that gives a conformant style sheet.
One or more h1, h2, h3, h4, h5, h6 elements, optionally intermixed with script-supporting elements.
Nothing.
Zero or more li and script-supporting elements.
Either: Zero or more groups each consisting of one or more dt elements followed by one or more dd elements, optionally intermixed with script-supporting elements.
Or: One or more div elements, optionally intermixed with script-supporting elements.
Either: one figcaption element followed by flow content.
Or: flow content followed by one figcaption element.
Or: flow content.
If the element is a child of a dl element: one or more dt elements followed by one or more dd elements, optionally intermixed with script-supporting elements.
If the element is not a child of a dl element: flow content.
Transparent, but there must be no interactive content or a element descendants.
See prose.
Text.
If the element has a datetime attribute: Phrasing content.
Otherwise: Text, but must match requirements described in prose below.
Nothing.
Transparent.
Zero or more source elements, followed by one img element, optionally intermixed with script-supporting elements.
Nothing.
Zero or more param elements, then, transparent.
Nothing.
If the element has a src attribute: zero or more track elements, then transparent, but with no media element descendants.
If the element does not have a src attribute: zero or more source elements, then zero or more track elements, then transparent, but with no media element descendants.
If the element has a src attribute: zero or more track elements, then transparent, but with no media element descendants.
If the element does not have a src attribute: zero or more source elements, then zero or more track elements, then transparent, but with no media element descendants.
Nothing.
Transparent.
Nothing.
In this order: optionally a caption element, followed by zero or more colgroup elements, followed optionally by a thead element, followed by either zero or more tbody elements or one or more tr elements, followed optionally by a tfoot element, optionally intermixed with one or more script-supporting elements.
If the span attribute is present: Nothing.
If the span attribute is absent: Zero or more col and template elements.
Nothing.
Zero or more tr and script-supporting elements.
Zero or more td, th, and script-supporting elements.
Nothing.
Zero or more option, optgroup, and script-supporting elements.
Either: phrasing content.
Or: Zero or more option and script-supporting elements.
Zero or more option and script-supporting elements.
If the element has a label attribute and a value attribute: Nothing.
If the element has a label attribute but no value attribute: Text.
If the element has no label attribute and is not a child of a datalist element: Text that is not inter-element whitespace.
If the element has no label attribute and is a child of a datalist element: Text.
Text.
Optionally a legend element, followed by flow content.
One summary element followed by flow content.
Either: phrasing content.
Or: one element of heading content.
If there is no src attribute, depends on the value of the type attribute, but must match script content restrictions.
If there is a src attribute, the element must be either empty or contain only script documentation that also matches script content restrictions.
When scripting is disabled, in a head element: in any order, zero or more link elements, zero or more style elements, and zero or more meta elements.
When scripting is disabled, not in a head element: transparent, but there must be no noscript element descendants.
Otherwise: text that conforms to the requirements given in the prose.
Nothing (for clarification, see example).
Transparent
Transparent, but with no interactive content descendants except for a elements, img elements with usemap attributes, button elements, input elements whose type attribute are in the Checkbox or Radio Button states, input elements that are buttons, select elements with a multiple attribute or a display size greater than 1, and elements that would not be interactive content except for having the tabindex attribute specified.
```
这有点复杂我们还是把它做一些分类首先我们过滤掉带If的情况、Text和Transparent。
```JavaScript
for(let defination of elementDefinations) {
//console.log(defination.name + ":")
let categories = defination.text.match(/Categories:\n([\s\S]+)\nContexts in which this element can be used:/)[1].split("\n");
defination.contentModel = [];
let contentModel = defination.text.match(/Content model:\n([\s\S]+)\nTag omission in text\/html:/)[1].split("\n");
for(let line of contentModel)
if(line.match(/([^ ]+) content./))
defination.contentModel.push(RegExp.$1);
else if(line.match(/Nothing.|Transparent./));
else if(line.match(/^Text[\s\S]*.$/));
else
console.log(line)
}
```
这时候我们再来执行看看:
```
A head element followed by a body element.
One or more h1, h2, h3, h4, h5, h6 elements, optionally intermixed with script-supporting elements.
Zero or more li and script-supporting elements.
Either: Zero or more groups each consisting of one or more dt elements followed by one or more dd elements, optionally intermixed with script-supporting elements.
Or: One or more div elements, optionally intermixed with script-supporting elements.
If the element is a child of a dl element: one or more dt elements followed by one or more dd elements, optionally intermixed with script-supporting elements.
See prose.
Otherwise: Text, but must match requirements described in prose below.
Zero or more source elements, followed by one img element, optionally intermixed with script-supporting elements.
Zero or more param elements, then, transparent.
If the element has a src attribute: zero or more track elements, then transparent, but with no media element descendants.
If the element does not have a src attribute: zero or more source elements, then zero or more track elements, then transparent, but with no media element descendants.
If the element has a src attribute: zero or more track elements, then transparent, but with no media element descendants.
If the element does not have a src attribute: zero or more source elements, then zero or more track elements, then transparent, but with no media element descendants.
In this order: optionally a caption element, followed by zero or more colgroup elements, followed optionally by a thead element, followed by either zero or more tbody elements or one or more tr elements, followed optionally by a tfoot element, optionally intermixed with one or more script-supporting elements.
If the span attribute is absent: Zero or more col and template elements.
Zero or more tr and script-supporting elements.
Zero or more td, th, and script-supporting elements.
Zero or more option, optgroup, and script-supporting elements.
Or: Zero or more option and script-supporting elements.
Zero or more option and script-supporting elements.
If the element has a label attribute but no value attribute: Text.
If the element has no label attribute and is not a child of a datalist element: Text that is not inter-element whitespace.
If the element has no label attribute and is a child of a datalist element: Text.
When scripting is disabled, in a head element: in any order, zero or more link elements, zero or more style elements, and zero or more meta elements.
When scripting is disabled, not in a head element: transparent, but there must be no noscript element descendants.
Otherwise: text that conforms to the requirements given in the prose.
```
这下剩余的就少多了我们可以看到基本上剩下的都是直接描述可用的元素了如果你愿意还可以用代码进一步解析不过如果是我的话会选择手工把它们写成JSON了毕竟只有三十多行文本。
好了有了contentModel和category我们要检查某一元素是否可以作为另一元素的子元素就可以判断一下两边是否匹配啦首先我们要做个索引
```JavaScript
var dictionary = Object.create(null);
for(let defination of elementDefinations) {
dictionary[defination.name] = defination;
}
```
然后我们编写一下我们的check函数
```JavaScript
function check(parent, child) {
for(let category of child.categories)
if(parent.contentModel.categories.contains(category))
return true;
if(parent.contentModel.names.contains(child.name))
return true;
return false;
}
```
## 总结
这一节课我们完成了一个小实验利用工具分析Web标准文本来获得元素的信息。
通过这个实验我希望能够传递一种思路代码能够帮助我们从Web标准中挖掘出来很多想要的信息编写代码的过程也是更深入理解标准的契机。
我们前面的课程中把元素分成了几类来讲解,但是这些分类只能大概地覆盖所有的标签,我设置课程的目标也是讲解标签背后的知识,而非每一种标签的细节。具体每一种标签的属性和细节,可以留给大家自己去整理。
这一节课的产出,则是“绝对完整的标签列表”,也是我学习和阅读标准的小技巧,通过代码我们可以从不同的侧面分析标准的内容,挖掘需要注意的点,这是一种非常好的学习方法。