What we learned about managing a Mastodon server in the first year of hcommons.social

A network of blue, green, and white icons

Written by Dimitris Tzouris, Infrastructure Developer for Humanities Commons

In November, we celebrated one year from the launch of hcommons.social, our Mastodon instance. As of today, more than two thousand members have registered, with new ones joining daily.

Mastodon is a free, non-profit and open source social network and microblogging service that can be self-hosted. It is part of the fediverse, a collection of federated services that communicate and interact with each other using the ActivityPub protocol.

The impetus that set hcommons.social in motion was the rapid downturn Twitter had taken after the changing of its ownership in late 2022, which resulted in many users leaving the service. It was a turbulent time for social media, with connections breaking and people losing networks that had taken many years to build and foster. Therefore, hcommons.social had to be born in order to provide a safe alternative, not just for academics and scholars, but also for anyone with an active interest in research and education looking for refuge – a place to begin again. As people were flocking to various other social networks, hcommons.social started as a trusted shelter for anybody looking to reconnect with their online peers in a safe space and down the road, it has become so much more. What follows is a look behind the curtains at how hcommons.social came to be and what goes into running and maintaining it.

Not the smoothest rollout

For us, rolling out the new service came with some bumps. We started out using a pre-built version of Mastodon, provided as a one-click install by a cloud computing service called DigitalOcean. Unfortunately, the specific version of Mastodon was old and we ran into all kinds of problems when we tried to update it. It turned out that everything the service was running on was old, so we had to do a complete rebuild from the ground up. Migrating from the original version database to a newer one was extremely challenging, but thanks to Steve Ramsey’s support, we got it stabilized. At that point, we had chosen to switch to a Mastodon fork called Hometown and its author, Darius Kazemi, provided some key help. What made Hometown stand out for us is that it provides the ability to restrict publication of a post to just the hcommons.social community. This means that followers registered on other Mastodon instances are not able to see a post and hcommons.social users are not able to repost it on other instances. This provides an additional layer of safety and protection for our users.

The lag

As new members started joining daily and the total number of registered and active members kept increasing, people using Mastodon started experiencing slowdowns. With the situation at Twitter becoming more dire week after week, this became a regular challenge for the influx of new users. The reason for the lag was the way a service called Sidekiq is set up by default. Sidekiq is an open source service that handles scheduled tasks that run in the background. Mastodon uses it to send emails, push updates to other instances, forward replies, refresh trending hashtags etc. All these tasks are placed in different queues and they are handled by separate processes, based on the type of task. Each process has a number of threads, which makes it possible to run tasks in parallel.

By default, Sidekiq was set up with only one process handling all queues using 25 threads. This meant that all background tasks were handled by the same process, thus creating a bottleneck.

A screenshot of the Sidekiq database showing one listing for hcommons.social

This caused lag when there was increased activity, sometimes resulting in posts getting pushed hours later. Users’ feeds were slow to update and the posts were from much earlier.

To deal with that, we had to reconfigure Sidekiq. The new setup would include the scheduler and mailer queues in their own process and three other processes which would deal with all other queues in parallel. Each process now has 10 threads, which are more than enough with our existing database infrastructure.

A screenshot of the Sidekiq database system showing 4 listings for hcommons.social

Which brings us to another major issue.

The database

At first, a 30-GB managed database on DigitalOcean sounded like more than enough storage, along with additional object storage for files. Well, 12 months and 2050 users later, the database storage we were using had ballooned to 42 GB. The reason for this rapid size increase is the way the fediverse works. When people follow accounts from other Mastodon servers, the local instance caches the posts, along with all the attached media. To deal with the increased storage demands, we did two things:

Each day, we started cleaning up about 15 GB of old cached media on the server, using a shell script that runs with a cron job (a tool that executes commands at specified time intervals.) This recovers some space on the server.
We periodically run pg_repack (an extension for table management) on the database to reduce the size of the tables. This has only helped us regain a limited amount of storage space.

This is what the disk usage chart looked like after we ran pg_repack on the ten largest tables of our 60-GB database and got disk usage down from 69% to 51%.

A screenshot of a disk usage chart showing where storage bumps were hit within the system

Since then, we have had to bump the storage twice, first to 80 GB and a few months later to 100 GB. We are currently using 52% of that database storage space. The statuses table alone has ballooned to 19 GB. Another thing we had to do is reduce the temporary file limit in order to contain the data growth.

Apart from all that maintenance and tuning, we’ve been steadily improving hcommons.social by applying the regular Hometown updates. Our latest update was in late September and we’re super excited for the next one, which will introduce Mastodon 4.2 features, including a revamped search experience that will allow searching of posts.

All of this work was not easy, but the Mastodon community was there to help along the way. Mastodon, as an alternative to commercial social media platforms, is only viable thanks to its users and supporters. By supporting Mastodon, we have access to the Discord server, where admins and developers can share ideas and help each other. We are thankful to them and we are committed to contributing to the community as we move forward.

What’s next?

Last year as part of our #GivingTuesday campaign, we asked for funds to support site hosting, site maintenance, and establishing a moderator community for hcommons.social. Over the course of the year, as you can see, a lot of technical work has gone towards supporting our server. And so far, moderation has not needed the same amount of attention, as there has not been a large influx of reports as hcommons.social has continued to grow. Our current moderation process relies on internal review within our small team, and although we regularly receive and respond to reports, to date, we have not sought additional support on this front. To be frank, more work would be required to establish and support a moderation team than what it takes to handle things as we do now, and as a small team with big dreams we have to be judicious in where we deploy our resources. However, as our team continues to monitor reports and assess the number of reports we receive on a regular basis, we will pursue additional moderation options as we see fit. We count on your trust to make these decisions and if you see signs that our current system isn’t meeting your expectations, we very much hope you’ll let us know. Our DMs are always open.

What are your thoughts on how we’re handling moderation? If you think we should bring in more external voices, let us know!

January 8, 2024January 8, 2024

Open Infrastructures and the Future of Knowledge Production, part 2

In my last post, I unpacked some of the reasons why open infrastructures matter for the future of knowledge production, and I talked a bit about how Humanities Commons and hcommons.social strive to live out their principles of community governance that truly open infrastructure requires. But I ended on a less cheerleadery note: We aren’t a perfect alternative to the corporate platforms by which we’re surrounded. And this is where we need to dig down into the dirty underside of digital infrastructure. As Deb Chachra points out, the term “infrastructure” literally points to those systems that are hidden, in our walls, under our floors, and buried underground. If we are going to mitigate the inequities created by and sustained through our infrastructures, we have to get busy unearthing those systems and finding ways to build new ones.

And so: We need to take a hard look at the fact that the infrastructure that Humanities Commons is built upon is AWS, or Amazon Web Services. As you might guess from the name, AWS is part of the Greater Jeff Bezos Empire, and every dollar that we spend to host with them helps to keep that empire running. And run it does! Amazon’s revenue derived from AWS passed $80 billion-with-a-b in 2022, and as of August 2023, AWS hosted 42 percent of the top 100,000 websites, and 25 percent of the top one million (ironically enough including BuiltWith, the site from which these data are made available).

Why has Amazon become such a powerful force in web hosting and cloud computing? Largely because they provide not just servers but a powerful and wide-ranging suite of tools that help folks like us not just make our platform available but also help keep it stable and secure and enable it to scale with enormous flexibility. AWS provides connected equipment and tools that would be more than a full-time job for someone to maintain in-house, and it enables redundancy and global reach at speed, and it’s relatively easy to manage.

So… it works for us, just as it works for 42,000 of the top 100,000 websites across the internet. But I’m not happy about it. It’s not just that I hate feeding more money into the Bezos empire every month, but that I know for certain that our values and Bezos’s do not align. And every so often I have to stop and ask myself how much good it does for us to build pathways of escape from the extractive clutches of Elsevier and Springer-Nature, only to have those pathways deliver us all into the gaping maw of Amazon?

AWS has a stranglehold on web-based platforms of our size, as we’re too complicated for a server kept under the desk, too big for a smaller hosting service, and too small for our own data center. And if you don’t want to deal with the risks and costs involved in owning and operating the metal yourself, there just aren’t many alternatives, and certainly not many good ones.

Our host institution, Michigan State University, like most institutions its size, operates both a large-scale data center through our central IT unit and a high-performance computing center under the aegis of the office of research and innovation. The latter can’t really help us, as it’s focused pretty exclusively on computational uses and not at all on service hosting. And the former comes with a suite of restrictions and regulations in terms of access and security – pretty understandably so, given recent attacks and exploits such as the one that caused our neighbor to the east to disconnect the entire campus from the internet on the first day of classes – but nevertheless restrictions that make it impossible for us to be flexible enough with our work.

In fact, central IT strongly encourages projects like ours to make use of cloud computing, given the complexity of our needs and the risk-averseness of the campus. And we have our pick! AWS, Microsoft’s Azure, and Google Cloud Services.

I just can’t help but think that it’s a Bad Thing for academic and nonprofit services like ours – services that are working to be open, and public, and values aligned with our communities – to be dependent upon Silicon Valley megacorps for our very presence. We need alternatives. Real alternatives. And I fear that we’re going to have to invent them, because as the example of open access publishing demonstrates, waiting to see what commercial providers come up with is certain to increase our lock-in, and increase the level of resources they extract from our campuses.

So what might it look like if our infrastructure for the future of knowledge production and dissemination was community-led all the way down? What might enable the Commons to leave AWS behind and instead contribute our resources to supporting a truly shared, openly governed, not-for-profit cloud service? Could such a service be collaborative, with all member research institutions and organizations paying into a shared, professionally staffed data center?

King’s College London and Jisc think so – they established the first collaborative research data center in the world nine years ago, precisely in order to help UK institutions achieve economies of scale, to increase energy efficiency, and to reduce costs. Of course, it’s a lot easier to get all the UK institutions of higher education on board with such a centralized initiative, partly because there are fewer of them and partly because they are all centrally funded.

But what if Internet2, for instance, instead of restricting its areas of interest to networking and protocols, and instead of offering to connect member institutions with corporate cloud services, instead provided a real alternative – one that was not just developed for the academic community but that would be governed by that community? What if each member institution or organization agreed to contribute its existing infrastructure, along with its annual maintenance budget, to a shared, distributed, community-owned cloud computing center? Could excess capacity then be offered at reasonable prices to other nonprofit institutions or organizations or projects like mine, in a way that might entice them away from the Silicon Valley megacorps? Would our institutions, our libraries, our publishers, and our many other web-based projects find themselves with better control over their futures?

None of what I’m suggesting here would be easy, and a lot of the questions I’ve just asked fall – at least for the moment – into the realm of the pipe dream. But if we were to be willing to press forward with them, we might find ourselves in a world in which the scholarly communication infrastructures on which we build, develop, design, and publish our work can help us foster rather than hinder social and epistemic justice, can empower communities of practice by centering their needs and their work to meet them, and can enable trustworthy community governance and decision-making in support of truly open, public, shared infrastructures for the future of knowledge production.

January 5, 2024January 5, 2024

Open Infrastructures and the Future of Knowledge Production, part 1

I’ve been thinking a good bit lately about the ways that the future of knowledge production depends upon the openness of the infrastructures that support our work. For a lot of people, the word “infrastructure” triggers a yawn reflex, and not without reason. As Deb Chachra points out in her brilliant new book, How Infrastructure Works, the best thing that infrastructure can do is remain invisible and just work. But as Chachra also argues, the shape of our entire culture is dependent on our infrastructure, and where inequities are part of those systems’ engineering, they constrain the ways that culture can evolve. Infrastructure matters enormously, and the scholarly communication infrastructures on which we build, develop, design, and publish our work have deep implications for our abilities to foster social and epistemic justice in our knowledge production and communication practices, to empower communities of practice and their concerns in the development and dissemination of knowledge, and to enable trustworthy governance and decision-making that is led by the communities that our publications and platforms are intended to serve. Our team is far from alone in thinking about these questions right now. We’re seeing the idea of “open infrastructure” pop up a lot lately, in no small part because folks are recognizing that a commitment to open, public infrastructures is necessary to ensure that scholarly communication can become actually equitable.

What do I mean by “actually equitable”? How might that sense of equity intersect with the aims of the open-access movement? Over the last twenty-plus years that movement has worked to transform scholarly communication, arguing in part that if our work could be read more openly by anyone, it might both have more impact on the world at large and create a more equitable knowledge environment. It’s of course true that open access in its many present flavors has done a lot to make more research available to be read online. But the movement toward open access began as a means of attempting to break the stranglehold that a few extractive corporate publishers have established over the research and publishing process – and in that, it hasn’t succeeded. The last decade in particular has revealed all of the resilience with which capital responds to challenges, as those corporate publishers have in fact become more profitable than ever. Not only have they figured out how to exploit article processing charges in order to make some work published in their journals openly available while continuing to charge libraries for subscriptions to the journals as a whole, but they’ve also developed whole new business plans like the so-called “read and publish” agreements that keep many institutions tied to them, and they’ve developed new platforms and infrastructures like discovery engines and research information management systems that serve to increase corporate lock-in over the work produced on campus.

For all these reasons, the 20th anniversary statement of the Budapest Open Access Initiative took on a slightly different focus, noting that “OA is not an end in itself, but a means to other ends, above all, to the equity, quality, usability, and sustainability of research.” In order to achieve those ends, the statement proposes several key recommendations – and chief among them?

Host OA research on open infrastructure. Host and publish OA texts, data, metadata, code, and other digital research outputs on open, community-controlled infrastructure. Use infrastructure that minimizes the risk of future access restrictions or control by commercial organizations. Where open infrastructure is not yet adequate for current needs, develop it further.

This recommendation recognizes that the control of the infrastructure by profit-seeking entities cements inequities – and this is true even where the large corporate publishers purport to create opportunities for the disadvantaged by offering fee waivers and discounts on their publishing charges. Those discounts only serve to normalize a culture in which it is considered correct for those who produce knowledge to pay corporations to host and circulate it.

What scholarly communication needs today, more than anything, is a broad-based sense of accountability to scholars and fields and institutions rather than shareholders. Hence the call in the 20th anniversary Budapest statement for hosting open access research on open infrastructure: infrastructure that is led by us, and accountable to us.

This is the fundamental orientation and driving purpose of Humanities Commons. Our goal is to provide a non-extractive, community-led and transparently governed alternative to commercial platforms. We also want to encourage our users to rethink the purposes and the dynamics of publishing altogether, in ways that might allow for the development of new, open, collective, equitable processes of creating and sharing knowledge that re-center agency over the ways that scholarly work develops and circulates with the scholars themselves. As a result, we have put in place a participatory governance structure that enables both individual users and our institutional sustaining members to have a voice in the project’s future, and we have developed network policies that emphasize inclusion and openness. We are committed to transparency in our finances, and most importantly to remaining not-for-profit in perpetuity.

We are also working to build and sustain the kinds of new platforms and services that will allow for rich conversations among members of our community and between that community and the rest of the world. A year ago, seeing the handwriting on the wall for the platform formerly known as Twitter (and frankly having suffered through quite a number of unhappy years there before the beginning of the end), we launched hcommons.social, a Hometown-flavored Mastodon instance, in the hopes of providing a collegial, community-oriented space for informal communication among scholars and practitioners everywhere. We currently have more than 2000 users on our instance who are connecting with users throughout the Fediverse, and we support those users through a strong moderation policy and code of conduct. We also work to ensure that new policies and processes are discussed with that community before they’re implemented.

This kind of openness matters enormously, not just to ensure that we’re living up to the values that we’ve established for our projects, but to ensure that there’s a worthwhile future for them. Cory Doctorow has written extensively of late about what he has famously called the “enshittification” of the internet, a process in which value is sucked out of the community and into the pockets of shareholders. Users are left with no control over the platform, or the content they’ve provided to it. And this, he notes in a post on the new corporate platforms seeking to replace Twitter, remains true even if their C-suite is populated by good actors, because they’re still walled gardens.

The problem with walled gardens is partly about their ownership, but largely about their governance. It’s not just that the owners of any particular proprietary network might turn out to be racist, fascist megalomaniacs – it’s that we have no control if and when they do. Choosing open platforms means that we as users have a say in the future of the plots of ground we choose to develop. This is especially true for the kind of work, like knowledge production, that is intended to have a public benefit. It’s incumbent on us to ensure that those gardens aren’t walled, that they don’t just have a gate that management may one day decide to unlock to let select folks in or out. Rather, our gardens must be open from the start, open to connect and cultivate in the ways that we as a community decide.

As Doctorow notes, Mastodon is far from perfect, and as much as I love our own instance, hcommons.social is far from perfect. But we’re doing our best to ensure that we’re running it in the open. And operating in the open, both for the Commons and for hcommons.social, means for us that we are accountable to our users and responsible for safeguarding the openness of their work. Together, those two ideals undergird our commitment to provide alternatives to the many platforms that purport to make scholarly work more accessible but in fact serve as mechanisms of corporate data capture, extracting value from creators and institutions for private rather than public gain.

But, as I note, we aren’t a perfect solution to the problems of corporate control in scholarly communication. More on why in my next post.