Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build pages from data source #5074

Closed
regisphilibert opened this issue Aug 14, 2018 · 83 comments
Closed

Build pages from data source #5074

regisphilibert opened this issue Aug 14, 2018 · 83 comments

Comments

@regisphilibert
Copy link
Member

regisphilibert commented Aug 14, 2018

Currently Hugo handles internal and external data source with getJSON/getCSV which is great for using data source in a template.

But Hugo cannot, using a data set of items, build a page for each of them plus related list pages like it does from the content directory files.

Here is a fresh start on specing this important step in the roadmap.

As a user, I can only see the configuration aspect of the task.

I don’t see many configuration tied issues except for the mapping of the key/values collected from the data source and the obvious external or internal endpoint of the data set. The following are suggestions regarding how those configurations could be managed by the users followed by a code block example.

Endpoint/URL/Local file

Depending on use cases there may be a need of one or several url/path.

For many projects, not every page types (post, page etc…) may be built from the same source. Type could be defined from a data source key or as a source parameter.

I suppose there could be other parameters per sources.

Front Matter mapping

User must be able to map the keys from the data source to Hugo’s commonly used Front Matter Variables (title, permalink, slug, taxonomies, etc…).
Every keys not referenced in the map configuration could be stored as is a user-define Front Matter available in the .Params object but this should not be default as there maybe way too many.

Example:

This is a realtor agency, a branch of a bigger one.

Their pages are built with hugo's local markdown

They have an old wordpress site whose > 100 blog posts they did not want to convert to markdown. So they load those blog posts from a local data file on top of the local Hugo's own markdown posts.

They use a tier service to create job posts when they need to fill a new position. They want to hosts those job listing on their site though. Their jobs are served by https://ourjobs.com/api/client/george-and-son/jobs.json

The most important part of the website are their realty listings. They add their listing to their mother company's own website whose API in turn serves those at https://api.mtl-realtors/listings/?branch=george-and-son&status=available

Configuration

title: George and Son (A MTL Realtors Agency)

dataSources:
  - source: data/old_site_posts.json
    contentPath: blog
    mapping: 
      Title: post_title
      Date: post_date
      Type: post_type
      Content: post_content
      Params.location.city: post_meta.city
      Params.location.country: post_meta.country

  - source: https://ourjobs.com/api/client/george-and-son/jobs.json
    contentPath: jobs
    mapping: 
      Title: job_label
      Content: job_description

  - source: https://api.mtl-realtors/listings/?branch=george-and-son&status=available
    contentPath: listings/:Type/
    grabAllFrontMatter: true
    mapping: 
      Type: amenity_kind
      Title: name
      Content: description
      Params.neighbourhood: geo.neighbour
      Params.city: geo.city

Content structure

This results in a content "shadow" structure. Hard lines dir/files are local, while dashed ones are remote.

content
├── _index.md
├── about.md
├── contact.md
├── blog
│     ├─── happy-halloween.md
│     ├─── merry-christmas.md
│     ├- - nice-summer
│     └- - hello-world
├- -listings
│     ├- - appartment
│     │   ├- - opportunity-studio
│     │   ├- - mile-end-condo
│     │   └- - downtown-tower-1
│     └- - house
│         └- - cottage-green
└- - jobs
      ├- - marketing-director
      └- - accountant-internship
@bep
Copy link
Member

bep commented Aug 14, 2018

Thanks for starting this discussion. I suspect we have to go some rounds on this to get to where we want.

Yes, we need field mapping. But when I thought about this problem, I imagined something more than a 1:1 mapping between an article with a permalink and some content in Hugo. I have thought about it as content adapters. I think it even helps to think of the current filesystem as a filesystem Hugo content adapter.

So, if this is how it looks like on disk:

content
├── _index.md
├── blog
│   └── first-post
│       ├── index.md
│       └── sunset.jpg
└── logo.png

What would the above look like if the data source was JSON or XML? Or even WordPress?

It should, of course, be possible to set the URL "per post" (like it is in content files), but it should also be possible to be part of the content section tree with permalink config per section, translations etc.. So, when you have 1 content dir + some other data sources, it ends up as one merged view.

@regisphilibert
Copy link
Member Author

regisphilibert commented Aug 14, 2018

As most data sources are usually a flat list of items, I suppose building the content directory structure will require some more mapping.

There are the type and section keys to be used as well as maybe others which would help position the item in the content structure.
There could also be urlsource parameter designed the same way as the global config one except it would take one of the mapped key's as pattern (I'll update my example after this):

url: /:Section/:Title/

I suppose there is no way around having many source configuration params/mapping which Hugo may need to best adapt the data source to the desired structure. Maybe even having to use some pattern/regex/glob to best adapt those like the url suggestion above.

As for default structure. If there is no configured data source with a type parameter of blog, then Hugo will build it from content, the rest would be build from data source (supposing we a Page Bundle toggle, media mapping). See this real content merged with data source "phantom" structure:

content
├── _index.md
├── blog
│   └── first-post
│       ├── index.md
│       └── sunset.jpg
├ - - - recipe  (from data source)
│        └- - - first-recipe
│               ├ - - - index                
│               └ - - - cake-frosting.jpg
└── logo.png

@regisphilibert
Copy link
Member Author

regisphilibert commented Aug 16, 2018

@bep now I understand more fully what you meant (I think). The config needs to tell Hugo how to model the content structure so it can build its pages from that.
In a sense we are not building pages from data source we are building a content structure from both local content and remote data source which Hugo will interpret and build pages from.

To reflect this here I added to the desc a better project example to illustrate both configuration possibilities and the resulting "content" structure.
This is a project we can add to in order to maybe better spec what this feature should achieve.

@bep
Copy link
Member

bep commented Nov 2, 2018

@regisphilibert I have been thinking about this, and I think the challenge with all of this isn't the mapping (we can experiment until we get a "working and good looking scheme"), but more the practical workflow -- esp. how to handle state/updates.

  • As an editor, I would love it if my site (including content) was as static as possible at commit time (v1.3.0 of Hugo Times is this).
  • That is, if I the editor, looked at the Netlify preview on GitHub and push merge, I would be sadly disappointed if I then ended up with something completely different.
  • I think this is an often overlooked quality of static sites: Versioned content.

I understand that in a dynamic world with JS APIs etc., the above will not be entirely true, always. But it should be a core requirement whenever possible.

A person in another thread mentioned GatsbyJS's create-source-plugin.

I don't think their approach of emulating the file system is a good match for Hugo, but I'm more curious about how they pull in data.

Ensure local data is synced with its source and 100% accurate. If your source allows you to add an updatedSince query (or something similar) you can store the last time you fetched data using setPluginStatus.

This is me guessing a little, but if I commit my GatsbyJS with some create-source-plugin sources to GitHub and build on Netlify, those sources will be pulled completely on every build (which I guess also is sub-optimal in the performance department). I suspect setPluginStatus is a local thing and the updatedSince is a way to speed up local development.

Given the above assumptions, the Gatsby approach does not meet the "static content" criteria above. I'm not sure how they can assure that the data is "100% accurate", but the important part here is that you have no way of knowing if the source has changed.

So, I was thinkering about:

  1. Adding a sgllite3 database as a "build cache"
  2. Adding a "prepare step" (somehow) that exports the non-file content sources out into a merge-friendly text format (i.e. consistent ordering etc.)

The output of 2) is what we use to build the final site.

There are probably some practical holes in the above. But there are several upsides. sqllite3 has some very interesting features (which could enable more cool stuff), so if you would want to make that the "master", you could probably edit your data directly in the DB, and you could probably drop the "flat file format" and put your DB into source control ... This is me thinking out loud a little.

@regisphilibert
Copy link
Member Author

That is, if I the editor, looked at the Netlify preview on GitHub and push merge, I would be sadly disappointed if I then ended up with something completely different

I'm not sure about this. And I already apologize if some of my lack of understanding of the technology/feature at hands bias my view.

I guess most of the use cases for this will be using contentful or WordPress Rest API or FireBase to manage your content, and let Hugo build the site from this remote source plus maybe a few other ones (remote and local).
In this use case, the editor will not see markdown and probably not the Netlify preview or that merge button, but only the contentful or WordPress dashboard and create/edit their content from there.
When a new page is published out of the draft zone, the editor will expect it to be visible on the site with few regard to the repo status. On bigger sites where several editors work at the same time, Hugo's built speed will help making sure the website can be "refreshed" often in order to keep up with content editing.

But this does not change the fact that we need caching and being able to tell the difference between the cached source and the remote one efficiently.

In order to handle the "when", by this I mean the decision between calling the remote source or using the cached one, I was thinking about a setting per source indicating at which rate it should be checked.
If setting is one hour, then Hugo would check cached source time and if older than one hour, call remote. It would then use and cache the remote source only if it differs from the cached one. (Maybe using a hash to compare cached vs remote ?)

I'm not sure I understand the process described with sqlite3, would this mean having a database inside Hugo ? 🤔

@bep
Copy link
Member

bep commented Nov 2, 2018

My talk about "database etc." clutters the discussion. This process cannot be stateless/dumb, was my main point. With 10 remote resources, some of them possibly out of your control, you (or I) would want some kind of control over:

  1. If it should be published.
  2. Possibly also when it should be published.

None of the above allows for a simple "pull and push". So, if you do your builds on a CI server (Netlify), but do your editing on your local PC, that state must be handled somehow so Netlify knows ... what. Note that the answer to 1) and 2) could possibly be to "publish everything, always", if that's your cup of tea.

@regisphilibert
Copy link
Member Author

regisphilibert commented Nov 2, 2018

Note that the answer to 1) and 2) could possibly be to "publish everything, always", if that's your cup of tea.

Yeah, maybe some people want it or default to it but offering more control is definetly a must have I think.

So, if you do your builds on a CI server (Netlify), but do your editing on your local PC, that state must be handled somehow so Netlify knows ...

True but I didn't really saw it as Hugo's business. In my mind, a CI pipeline would have to be put into place above Hugo.
So when the source is edited (using contentful or other) the CI is notified and can run something like hugo --fetch-source="contentful".

Or a simple cronjobs (don't know what to call those in the modern JAMstack) could be set in place so website is build every hour with hugo --fetch-source="contentful" and every day with hugo --fetch-source="events,weather"

@bep
Copy link
Member

bep commented Nov 3, 2018

OK, I'm possibly overthinking it (at least for a v1 of this). But for the above to work at speed and for big sites, you need a proper cache you can depend on. I notice the GatsbyJS WordPress plugin saying that "this should work for any number of posts", but if you want this to work for your 10K WP blog, you really need to avoid pulling down everything all the time. I will investigate this vs Netlify and CircleCI.

@regisphilibert
Copy link
Member Author

but if you want this to work for your 10K WP blog, you really need to avoid pulling down everything all the time

Yes. Time is essence!
I can't imagine how long Gatsby would take to build a 10K WP blog considering it already takes 18s to build the hello-world starter kit.

And this is precisely why big content projects want to turn to Hugo.

@regisphilibert
Copy link
Member Author

regisphilibert commented Nov 22, 2018

After spending time with playing with the friendly competition and its data source solutions.
It becomes apparent that one of the biggest challenges of the current issue (now that Front Matter mapping will be taken care of by #5455) will be how the user can define everything Hugo needs to know in order to

  1. efficiently connect to remote or local a data source,
  2. retrieve the desired data,
  3. and merge it into its content system (path etc...).

3 will be unique to each project and potentially source.
On the other hand 1 and 2 will be for the most part, constant for many data sources, like WordPress API or Contentful.
For example, for a source of type WordPress REST API, Hugo will always use the same set of endpoints plus a few custom ones potentially added by the user.
It will also systematically uses the same parameter to fetch paginated items.

We could group the settings of 1 and 2 into one Data Source Type (DST).
Then, in the line of Output Formats and MediaTypes, any newly defined Data Source could use X or Y Data Source Type.

This way any DST could be potentially:

  • Reusable among one project without repeating same lengthy settings (2 different WordPress APIs for one website)
  • Shared among users as setting files.
  • Built-in

Rough example of DataSourceType/DataSources settings:

DataSourceTypes
  - name: wordpress
    endpoint_base: wp-json/v2/
    endpoints: ['posts', 'page', 'listings']
    pagination: true
    pagination_param: page=:page
    [...]

DataSources:
  - source: https://api.wordpress.blog.com/
    type: wordpress
    contentPath: blog/
    [...]

@bwklein
Copy link
Contributor

bwklein commented Dec 2, 2018

I wanted to throw this into the discussion because it's a demonstration of how I generated temporary .md files from two merged sets of JSON data (Google Sheets API). These .md files are only generated and used during compilation and are not saved into the repository.

https://www.bryanklein.com/blog/hugo-python-gsheets-oh-my/

This is a fairly simple script, but you can see that I needed to filter the data source and map the 2 source JSON data sets to front matter parameters per page.

@regisphilibert
Copy link
Member Author

@bwklein Thanks for this very informative input but... this belongs in a "tips and tricks" thread in the discourse which could mention this issue. Not the other way around :)

PS: This really belongs there, people would love to read this I'm sure.

@itwars
Copy link

itwars commented Dec 30, 2018

@regisphilibert I'm exactly looking for the same thing!
Having json parts in a page is simple, but generate post from json... Headless CMS to Hugo 👍

@bep bep added the Enhancement label Jan 2, 2019
@bep bep added this to the v0.54 milestone Jan 2, 2019
bep added a commit to bep/hugo that referenced this issue Jan 2, 2019
bep added a commit to bep/hugo that referenced this issue Jan 2, 2019
bep added a commit to bep/hugo that referenced this issue Jan 2, 2019
bep added a commit to bep/hugo that referenced this issue Jan 2, 2019
bep added a commit to bep/hugo that referenced this issue Jan 2, 2019
bep added a commit to bep/hugo that referenced this issue Jan 3, 2019
bep added a commit to bep/hugo that referenced this issue Jan 3, 2019
bep added a commit to bep/hugo that referenced this issue Jan 3, 2019
bep added a commit to bep/hugo that referenced this issue Jan 3, 2019
@bep bep added this to the v0.119.0 milestone Sep 15, 2023
@marceloverdijk
Copy link

marceloverdijk commented Sep 28, 2023

Maybe good to link this issue on the Hugo roadmap page?
And link it to #6310 as well.

@bep bep modified the milestones: v0.119.0, v0.120.0 Oct 4, 2023
@alainbuysse
Copy link

In my mind, the amazing Regis has solved this, and for my company it's not an issue any more. I give the strongest recommendation to Regis's article on the subject on The New Dynamic. It's the "monster spotting" example.

https://www.thenewdynamic.com/article/toward-using-a-headless-cms-with-hugo-part-2-building-from-remote-api/

My team implemented this on 50+ sites. It's no more complicated than Forestry ever was. We can set up a modest site in about a day or less. It's massively flexible, which we value, and provides both a "built" solution in line with the use case of HUGO as well as a strong CMS-agnostic integration path.

We trigger builds either from our CMS on change, using a Netlify build hook, or we have a manual "publish" action which cUrl's the web hook. Many clients seem to prefer this idea of "wait and publish all at once" as opposed to every change being instant.

Anyway. Regis solved it. It's not a problem any more. I promise, just follow his lead. This is the path.

Used this solution also and it worked perfect !!
I am using it together with directus and netlify... love it !
Thank you !

@bep bep modified the milestones: v0.120.0, v0.121.0 Oct 31, 2023
@despens
Copy link

despens commented Nov 6, 2023

Regis' solution with a "nested" Hugo project is quite amazing and saved me lots of time. 🙏 Properly integrating pages generated from data into Hugo would be great, because running two instances of Hugo during development is very error-prone. 😅

@bep bep modified the milestones: v0.121.0, v0.122.0 Dec 6, 2023
@bep bep modified the milestones: v0.122.0, v0.123.0, v0.124.0 Jan 27, 2024
@bep bep modified the milestones: v0.124.0, v0.125.0 Mar 4, 2024
@bep bep modified the milestones: v0.125.0, v0.126.0 Apr 23, 2024
bep added a commit to bep/hugo that referenced this issue May 13, 2024
bep added a commit to bep/hugo that referenced this issue May 13, 2024
bep added a commit to bep/hugo that referenced this issue May 13, 2024
bep added a commit to bep/hugo that referenced this issue May 14, 2024
bep added a commit to bep/hugo that referenced this issue May 14, 2024
bep added a commit to bep/hugo that referenced this issue May 14, 2024
@bep bep closed this as completed in e2d66e3 May 14, 2024
Copy link

github-actions bot commented Jun 5, 2024

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests