apub2gmi/README.md

# apub2gmi.py

This is a script which takes an archive exported from a Mastodon account, looks for media attachments and uses them to build an archive for a Gemini server.

I use it to update the [Glossatory archives](gemini://gemini.mikelynch.org/glossatory/) from my account [@GLOSSATORY](https://weirder.earth/@GLOSSATORY).

It builds a hierarchy of folders with the structure YYYY/MM/DD and copies
media attachments into the appropriate day's folder.

It also adds index.gmi files at each level. Index files at the top and year
levels have links to the next level down. Index files at the month level have
links to all of the attachments for that month.

It assumes that there's only one media attachment per post. If it finds a post with more than one attachment it will only copy the first and issue a warning.

The alt-text is used as the title of each image in the month-level index file. If you want to only use part of the alt-text, you can provide a list of regular expressions in the config file which will be matched against it.

## Usage

    python apub2gmi.py --archive PATH_OF_YOUR_ACTIVITYPUB_ARCHIVE/ --output GEMINI_OUTPUT --config CONFIG.json [--text OPTIONAL_COLOPHON_TEXT ] [--debug]


## Config

The configuration file is JSON as follows:

	{
		"url_re": "^/YOUR_SERVERS_MEDIA_ATTACHMENT_PATH/(.*)$",
		"title_res": [
			"^(.*?)\\.\\s*(.*)$"
		]
	}


`url_re` should match the URLs of media attachments in the ActivityPub JSON. This will depend on your server - here's an example from the GLOSSATORY archive.


	"attachment":
	[
	    {
	        "type": "Document",
	        "mediaType": "image/jpeg",
	        "url": "/weirderearth/media_attachments/files/105/839/131/582/626/008/original/9e2423c3ffd70dd0.jpeg",
	        "name": "BILLET: an unmarried working person (often used for making tying) The drawing depicts a person seated at a bench tying knots in a long cord.",
	        "blurhash": "U2Ss50M{~qt7-;t7IUt7_3-;RjM{RjD%-;WB",
	        "width": 1280,
	        "height": 1280
	    }
	],


`title_res` is optional: it's a list of Python regular expressions which will be matched against the alt-text. The text used for the index page is the `()` group or groups from the first regexp which matches. If there's more than one group in the re, the results are joined with spaces.

Comments - questions - issues? let me know at [@mikelynch@aus.social](https://aus.social/@mikelynch)