I made a new thing: a website for making blackout poetry with over nine million chunks of text extracted from Project Gutenberg. It's here at (LINK https://blackout.tilde.town blackout.tilde.town) .
(IMG https://tilde.town/~vilmibm/blackout.png a screenshot of a blackout poem that reads: the picturesque decay remains an idea of the beautiful)
Ever since (LINK https://tilde.town/~kc ~kc) posted (LINK https://tilde.town/~kc/blackout this page) I've been inspired by blackout poetry. I wanted an interface not only for doing it, but for giving me novel text to work with as well.
I used Project Gutenberg's (LINK https://gutenberg.org/policy/robot_access.html robot access instructions) to get about 12 gigabytes of compressed plaintext English language books. It translated to about 35,000 books once duplicate encodings were ignored.
(LINK https://git.tilde.town/vilmibm/gutchunk This code) , gutchunk, uncompressed the books and combed through them for what i'm calling "chunks." I was looking for meaty sections of text that would make for good blackout poetry fodder. My approach is fairly naive. I store text in a buffer until I see two newlines, then check if I have enough in the buffer; if I do, I cut a chunk. If I don't, I discard it.
To my extreme pleasure I ended up with over nine million chunks. This is all sitting in a sqlite3 database on the town and if you're reading this and are also a townie, let me know if you want access to it.
When I was working on (LINK https://github.com/vilmibm/prosaic prosaic) over the years I got a lot of junk from my sloppy parsing of gutenberg books. I was young and silly and not writing great code then. I was also afflicted with this perverse need to ingest ALL of the text into my cut-up corpora. I got a lot of cruft: chapter headings, tables of content, captions, and similar. So far I've pulled well over a hundred of my nine million chunks and they all look quite good. My simple heuristic avoided a lot of the noise that I get when running prosaic. Of course, I'm missing some text: short bits of dialogue, for example. This kind of thing would have haunted me in the past, but now knowing that mystery remains in these books feels good. I don't like finding (LINK https://tilde.town/~vilmibm/swamp the bottom of the swamp) .
If you're interested, the code for blackout.tilde.town is also up on [our gitea](https://git.tilde.town/vilmibm/blackout) .
There is no way to iterate over the chunks; you get a random one every single page load. Given the size of the ID space, this should mean an infinitesimally small chance for repeats. I wanted an experience like (LINK https://en.wikipedia.org/wiki/The_Library_of_Babel the library of babel) ; one of wandering and digging up scraps to scrawl upon.
I'm hosting this decidedly personal project on tilde.town because I felt like it was a nice fit for our community. It's also my house and I can do whatever, though I try not to have that mindset too often.
I may also make an SSH-hosted text-mode version. I haven't decided.
I've already been really pleased with the experience of making poems using the new site and hope you like it, too. Please let me know on (LINK https://tiny.tilde.website/@vilmibm mastodon) or wherever if you're making stuff with it.