ylliX - Online Advertising Network
Zero-Cost News Monitoring System

Zero-Cost News Monitoring System


Panda reviewing documents (ChatGPT )

You won’t believe if say that you can build a media monitoring system with all the freely available resources lasting for years.

Part of my hobby is collecting data. I don’t know where I got that habit, but I’m pretty satisfied when I have a large amount of information in one place. Lately, I was thinking of building a system to automatically collect data. It has some advantages.

There are media researchers who extract useful information from daily news and prepare datasets from it. Usually, human rights watch groups used to do that. If they don’t have the technology, they have to use lots of people to collect and verify news. Watching these news items every single day and extracting useful ones is tedious. Automation would save lots of time and energy.

Whenever I build a system, I start with a very small one. Because I have built dozens of projects, if I had to pay for all of them, I would be broke by now. Therefore, I try to utilize freely available resources as much as I can. The good thing is you don’t need to rely on funding, and you don’t need to worry about the future when funding sources dry out. Your project will survive on its own. Of course, there is a downside; people tend to think these are no-cost, unimpressed by the size of the project and often undermining its effectiveness. The labor and ideas to operate this at minimal cost are not actually free. Someone always has to invest out of their pocket.

Ok, the introduction is quite enough. Let’s focus on today’s topic. It’s about my media monitoring project. I used all the simple and freely available tools to operate the whole system.

The idea is like this: I want to monitor news websites and Telegram channels. Sometimes I missed the news, sometimes I need a collection of them to find data. And I am quite a busy person, so I can’t spare time to watch it every day. Besides, this isn’t my main job.

That’s why I used a good old method called RSS (Really Simple Syndication) . This technology has probably been around for 20 years or so? Generally, you can get news from websites through a stable URL link. Not all websites support RSS though. Later, I also found a way to get RSS feeds from Telegram channels.

When I get the source link for the RSS feed, I use Google Spreadsheets to store that news. Every single news item delivered through the RSS feed will be stored with Title, Description, Date, and Image etc. in the sheet.

Ok, I got the link and I got a place to store. Google Sheets already has a formula to monitor the RSS feed and save in the sheet. But the formula overwrites the articles when new ones arrive. I need a new strategy to get all new feeds and append them as new records in Google Sheets. Later, I wrote the Google script to do this function with the help of ChatGPT.

After a few trials and errors, I found a way to do that. There are some minor adjustments to the function. These include cleaning the format of articles, normalizing the date-time format, not pulling the already saved feed articles, etc. Each feed can only carry 20 articles at most. I set up a time trigger in Google Scripts. That script will run every hour; I guess it’s more than enough to get articles from the feed. When I monitor lots of feeds, sometimes timeouts happen. That’s a trade-off. I can increase the frequency of the script to run every 30 minutes. The more frequently the script is run, the fewer articles it has to deliver each run.

Alright, we have set up the feed, done some cleaning work, and added a trigger. Now the next step is managing data volume. Google Spreadsheets has some limits; each month there are thousands of new records collected in the sheet. The spreadsheet gets slower as the volume grows. At one point, Google Sheets will no longer be able to run and eventually break. I definitely don’t want to go back and fix that every time it happens.

I planned to do auto-archiving. There will only be one main spreadsheet that monitors the news. In the first hour of the first day of the month, another scheduled Google Apps Script will run. This will create a new spreadsheet named archive_sheetname_month_year and copy every record from previous months into the new sheet. When everything has been transferred, it will clean up the main spreadsheet.

In that way, the main spreadsheet always restarts from zero record each month and saves a backup copy of previous months in separate sheets.

I believe this system is totally zero-cost, with no server fee or service fee. I used Google’s officially supported services within their quota. And the archive sheets can be created and maintained by themselves for years.

I created a Telegram bot to reshare the RSS news back to a channel. So, ordinary users can also enjoy the news while the records are saved as archives in the spreadsheets.

I designed the workflow and ChatGPT helped me achieve this. Kudos to OpenAI. These kinds of efficient, self-sustaining systems is small but can last for years without additional effort.

If you enjoy this, you can buy me a coffee. The source code is already shared on GitHub.

https://github.com/nchanko/gsheet_rss_scraper



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *