Skip to content

Commit

Permalink
feat(gatsby-source-filesystem): Only generate hashes when a file has …
Browse files Browse the repository at this point in the history
…changed, and add an option for skipping hashing (#37464)

Co-authored-by: LekoArts <lekoarts@gmail.com>
  • Loading branch information
FraserThompson and LekoArts authored Jan 17, 2023
1 parent 949132b commit df58891
Show file tree
Hide file tree
Showing 5 changed files with 236 additions and 111 deletions.
76 changes: 47 additions & 29 deletions packages/gatsby-source-filesystem/README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,28 @@
# gatsby-source-filesystem

A Gatsby source plugin for sourcing data into your Gatsby application
from your local filesystem.
A Gatsby source plugin for sourcing data into your Gatsby application from your local filesystem.

The plugin creates `File` nodes from files. The various "transformer"
plugins can transform `File` nodes into various other types of data e.g.
`gatsby-transformer-json` transforms JSON files into JSON data nodes and
`gatsby-transformer-remark` transforms markdown files into `MarkdownRemark`
nodes from which you can query an HTML representation of the markdown.
The plugin creates `File` nodes from files. The various "transformer" plugins can transform `File` nodes into various other types of data e.g. [`gatsby-transformer-json`](https://www.gatsbyjs.com/plugins/gatsby-transformer-json/) transforms JSON files into JSON data nodes and [`gatsby-transformer-remark`](https://www.gatsbyjs.com/plugins/gatsby-transformer-remark/) transforms markdown files into `MarkdownRemark` nodes from which you can query an HTML representation of the markdown.

## Install

`npm install gatsby-source-filesystem`
```shell
npm install gatsby-source-filesystem
```

## How to use

```javascript
// In your gatsby-config.js
You can have multiple instances of this plugin to read source nodes from different locations on your filesystem. Be sure to give each instance a unique `name`.

```js:title=gatsby-config.js
module.exports = {
plugins: [
// You can have multiple instances of this plugin
// to read source nodes from different locations on your
// filesystem.
//
// The following sets up the Jekyll pattern of having a
// "pages" directory for Markdown files and a "data" directory
// for `.json`, `.yaml`, `.csv`.
{
resolve: `gatsby-source-filesystem`,
options: {
// The unique name for each instance
name: `pages`,
// Path to the directory
path: `${__dirname}/src/pages/`,
},
},
Expand All @@ -38,7 +31,10 @@ module.exports = {
options: {
name: `data`,
path: `${__dirname}/src/data/`,
ignore: [`**/\.*`], // ignore files starting with a dot
// Ignore files starting with a dot
ignore: [`**/\.*`],
// Use "mtime" and "inode" to fingerprint files (to check if file has changed)
fastHash: true,
},
},
],
Expand All @@ -47,9 +43,23 @@ module.exports = {

## Options

In addition to the name and path parameters you may pass an optional `ignore` array of file globs to ignore.
### name

**Required**

A unique name for the `gatsby-source-filesytem` instance. This name will also be a key on the `File` node called `sourceInstanceName`. You can use this e.g. for filtering.

### path

**Required**

Path to the folder that should be sourced. Ideally an absolute path.

They will be added to the following default list:
### ignore

**Optional**

Array of file globs to ignore. They will be added to the following default list:

```text
**/*.un~
Expand All @@ -62,8 +72,24 @@ They will be added to the following default list:
../**/dist/**
```

### fastHash

**Optional**

By default, `gatsby-source-filesystem` creates an MD5 hash of each file to determine if it has changed between sourcing. However, on sites with many large files this can lead to a significant slowdown. Thus you can enable the `fastHash` setting to use an alternative hashing mechanism.

`fastHash` uses the `mtime` and `inode` to fingerprint the files. On a modern OS this can be considered a robust solution to determine if a file has changed, however on older systems it can be unreliable. Therefore it's not enabled by default.

### Environment variables

To prevent concurrent requests overload of `processRemoteNode`, you can adjust the `200` default concurrent downloads, with `GATSBY_CONCURRENT_DOWNLOAD` environment variable.

In case that due to spotty network, or slow connection, some remote files fail to download. Even after multiple retries and adjusting concurrent downloads, you can adjust timeout and retry settings with these environment variables:

- `GATSBY_STALL_RETRY_LIMIT`, default: `3`
- `GATSBY_STALL_TIMEOUT`, default: `30000`
- `GATSBY_CONNECTION_TIMEOUT`, default: `30000`

## How to query

You can query file nodes like the following:
Expand Down Expand Up @@ -263,7 +289,7 @@ The `createFileNodeFromBuffer` helper accepts a `Buffer`, caches its contents to

The name of the file can be passed to the `createFileNodeFromBuffer` helper. If no name is given, the content hash will be used to determine the name.

## Example usage
#### Example usage

The following example is adapted from the source of [`gatsby-source-mysql`](/~https://github.com/malcolm-kee/gatsby-source-mysql):

Expand Down Expand Up @@ -338,11 +364,3 @@ function createMySqlNodes({ name, __sql, idField, keys }, results, ctx) {

module.exports = createMySqlNodes
```

## Troubleshooting

In case that due to spotty network, or slow connection, some remote files fail to download. Even after multiple retries and adjusting concurrent downloads, you can adjust timeout and retry settings with these environment variables:

- `GATSBY_STALL_RETRY_LIMIT`, default: `3`
- `GATSBY_STALL_TIMEOUT`, default: `30000`
- `GATSBY_CONNECTION_TIMEOUT`, default: `30000`
1 change: 0 additions & 1 deletion packages/gatsby-source-filesystem/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@
"file-type": "^16.5.4",
"fs-extra": "^11.1.0",
"gatsby-core-utils": "^4.5.0-next.0",
"md5-file": "^5.0.0",
"mime": "^3.0.0",
"pretty-bytes": "^5.6.0",
"valid-url": "^1.0.9",
Expand Down
236 changes: 161 additions & 75 deletions packages/gatsby-source-filesystem/src/__tests__/create-file-node.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,95 @@ const fs = require(`fs-extra`)

const fsStatBak = fs.stat

const createMockCache = (get = jest.fn()) => {
return {
get,
set: jest.fn(),
directory: __dirname,
}
}

const createMockCreateNodeId = () => {
const createNodeId = jest.fn()
createNodeId.mockReturnValue(`uuid-from-gatsby`)
return createNodeId
}

// MD5 hash of the file (if the mock below changes this should change)
const fileHash = `8d777f385d3dfec8815d20f7496026dc`

// mtime + inode (if the mock below changes this should change)
const fileFastHash = `123456123456`

function testNode(node, dname, fname, contentDigest) {
// Sanitize all filenames
Object.keys(node).forEach(key => {
if (typeof node[key] === `string`) {
node[key] = node[key].replace(new RegExp(dname, `g`), `<DIR>`)
node[key] = node[key].replace(new RegExp(fname, `g`), `<FILE>`)
}
})
Object.keys(node.internal).forEach(key => {
if (typeof node.internal[key] === `string`) {
node.internal[key] = node.internal[key].replace(
new RegExp(dname, `g`),
`<DIR>`
)
node.internal[key] = node.internal[key].replace(
new RegExp(fname, `g`),
`<FILE>`
)
}
})

// Note: this snapshot should update if the mock below is changed
expect(node).toMatchInlineSnapshot(`
Object {
"absolutePath": "<DIR>/f",
"accessTime": "1970-01-01T00:02:03.456Z",
"atime": "1970-01-01T00:02:03.456Z",
"atimeMs": 123456,
"base": "f",
"birthTime": "1970-01-01T00:02:03.456Z",
"birthtime": "1970-01-01T00:02:03.456Z",
"birthtimeMs": 123456,
"blksize": 123456,
"blocks": 123456,
"changeTime": "1970-01-01T00:02:03.456Z",
"children": Array [],
"ctime": "1970-01-01T00:02:03.456Z",
"ctimeMs": 123456,
"dev": 123456,
"dir": "<DIR>",
"ext": "",
"extension": "",
"id": "uuid-from-gatsby",
"ino": 123456,
"internal": Object {
"contentDigest": "${contentDigest}",
"description": "File \\"<DIR>/f\\"",
"mediaType": "application/octet-stream",
"type": "File",
},
"mode": 123456,
"modifiedTime": "1970-01-01T00:02:03.456Z",
"mtime": "1970-01-01T00:02:03.456Z",
"mtimeMs": 123456,
"name": "f",
"nlink": 123456,
"parent": null,
"prettySize": "123 kB",
"rdev": 123456,
"relativeDirectory": "<DIR>",
"relativePath": "<DIR>/f",
"root": "",
"size": 123456,
"sourceInstanceName": "__PROGRAMMATIC__",
"uid": 123456,
}
`)
}

// FIXME: This test needs to not use snapshots because of file differences
// and locations across users and CI systems
describe(`create-file-node`, () => {
Expand Down Expand Up @@ -43,93 +132,90 @@ describe(`create-file-node`, () => {
})

it(`creates a file node`, async () => {
const createNodeId = jest.fn()
createNodeId.mockReturnValue(`uuid-from-gatsby`)
const createNodeId = createMockCreateNodeId()

const cache = createMockCache()

return createFileNode(
path.resolve(`${__dirname}/fixtures/file.json`),
createNodeId,
{}
{},
cache
)
})

it(`records the shape of the node`, async () => {
const dname = fs.mkdtempSync(`gatsby-create-file-node-test`).trim()
try {
const fname = path.join(dname, `f`)
console.log(dname, fname)
fs.writeFileSync(fname, `data`)
try {
const createNodeId = jest.fn()
createNodeId.mockReturnValue(`uuid-from-gatsby`)

const node = await createFileNode(fname, createNodeId, {})

// Sanitize all filenames
Object.keys(node).forEach(key => {
if (typeof node[key] === `string`) {
node[key] = node[key].replace(new RegExp(dname, `g`), `<DIR>`)
node[key] = node[key].replace(new RegExp(fname, `g`), `<FILE>`)
}
})
Object.keys(node.internal).forEach(key => {
if (typeof node.internal[key] === `string`) {
node.internal[key] = node.internal[key].replace(
new RegExp(dname, `g`),
`<DIR>`
)
node.internal[key] = node.internal[key].replace(
new RegExp(fname, `g`),
`<FILE>`
)
}
})

// Note: this snapshot should update if the mock above is changed
expect(node).toMatchInlineSnapshot(`
Object {
"absolutePath": "<DIR>/f",
"accessTime": "1970-01-01T00:02:03.456Z",
"atime": "1970-01-01T00:02:03.456Z",
"atimeMs": 123456,
"base": "f",
"birthTime": "1970-01-01T00:02:03.456Z",
"birthtime": "1970-01-01T00:02:03.456Z",
"birthtimeMs": 123456,
"blksize": 123456,
"blocks": 123456,
"changeTime": "1970-01-01T00:02:03.456Z",
"children": Array [],
"ctime": "1970-01-01T00:02:03.456Z",
"ctimeMs": 123456,
"dev": 123456,
"dir": "<DIR>",
"ext": "",
"extension": "",
"id": "uuid-from-gatsby",
"ino": 123456,
"internal": Object {
"contentDigest": "8d777f385d3dfec8815d20f7496026dc",
"description": "File \\"<DIR>/f\\"",
"mediaType": "application/octet-stream",
"type": "File",
},
"mode": 123456,
"modifiedTime": "1970-01-01T00:02:03.456Z",
"mtime": "1970-01-01T00:02:03.456Z",
"mtimeMs": 123456,
"name": "f",
"nlink": 123456,
"parent": null,
"prettySize": "123 kB",
"rdev": 123456,
"relativeDirectory": "<DIR>",
"relativePath": "<DIR>/f",
"root": "",
"size": 123456,
"sourceInstanceName": "__PROGRAMMATIC__",
"uid": 123456,
}
`)
const createNodeId = createMockCreateNodeId()

const emptyCache = {
get: jest.fn(),
set: jest.fn(),
directory: __dirname,
}

const node = await createFileNode(fname, createNodeId, {}, emptyCache)

testNode(node, dname, fname, fileHash)
} finally {
fs.unlinkSync(fname)
}
} finally {
fs.rmdirSync(dname)
}
})

it(`records the shape of the node from cache`, async () => {
const dname = fs.mkdtempSync(`gatsby-create-file-node-test`).trim()
try {
const fname = path.join(dname, `f`)
fs.writeFileSync(fname, `data`)
try {
const createNodeId = createMockCreateNodeId()

const getFromCache = jest.fn()
getFromCache.mockReturnValue(fileHash)
const cache = createMockCache(getFromCache)

const nodeFromCache = await createFileNode(
fname,
createNodeId,
{},
cache
)

testNode(nodeFromCache, dname, fname, fileHash)
} finally {
fs.unlinkSync(fname)
}
} finally {
fs.rmdirSync(dname)
}
})

it(`records the shape of the fast hashed node`, async () => {
const dname = fs.mkdtempSync(`gatsby-create-file-node-test`).trim()
try {
const fname = path.join(dname, `f`)
fs.writeFileSync(fname, `data`)
try {
const createNodeId = createMockCreateNodeId()
const cache = createMockCache()

const nodeFastHash = await createFileNode(
fname,
createNodeId,
{
fastHash: true,
},
cache
)

testNode(nodeFastHash, dname, fname, fileFastHash)
} finally {
fs.unlinkSync(fname)
}
Expand Down
Loading

0 comments on commit df58891

Please sign in to comment.