Get all your news in one place.

100’s of premium titles.
One app.

Start reading

Get all your news in one place.

100’s of premium titles. One news app.

Start reading

International Business Times

Matt Emma

The Future of AI and Web Scraping: A Feedback Loop in Motion

Gemini Google

The term "unicorn" is sometimes used to describe people like Zia Ahmad, whose professional versatility makes them a rare find. A back-end developer experienced in web scraping. A data scientist who is also an AI engineer, contributing to the improvement of Google's Gemini. Yet, Zia says that his true passion is teaching and has over 40 published courses to prove it. Zia teaches on a variety of topics, with a primary focus on data science, web scraping, and AI applications.

This motley crew of topics will also come together in his presentation, titled "The AI-Scraper Loop: How Machine Learning Improves Web Scraping (and Vice Versa)," which he will give at OxyCon 2025, a web scraping conference organized by Oxylabs.

After a very versatile career journey, you have landed at Turing. Please tell us more about your current work?

One day, I got a call from Turing saying they wanted me to work on Google's project, and that was really exciting because I was already a user of this product called Gemini. Now they wanted me to work on it, which was a great opportunity.

I'm responsible for the aspect of Gemini that deals with data analysis, data science, machine learning, and AI questions. When you feed Gemini an Excel or CSV file and ask it to generate insights, build a machine learning model, or calculate the average of a column, we actively monitor Gemini's responses, analyze what it's doing right or wrong, write reports, and submit them. We also suggest improvements to the back-end engineers based on our findings.

You also teach many online courses. Do you have a signature or dream course?

One dream course I still want to create is on the very topic I'll be speaking about at OxyCon—how we can merge artificial intelligence and web scraping. These are the two areas I've worked in professionally, and there's a sweet spot where these two fields can complement each other beautifully. And I've found that there are very few "bridge courses" that connect two technologies like this.

The topic of web scraping and AI—especially web scraping for AI—is currently quite controversial. There's a lot of tension between content creators, publishers and AI companies that need data to train models. What's your view on this?

My view is this: if the data is publicly available, there shouldn't be any question about scraping it. Whether it's an article, product info, or public medical data—if it's viewable manually, it should be scrapable.

For example, consider an e-commerce company that lists product details publicly on its website. That information is already out there for anyone to see. Scraping it just makes the process faster and more efficient. If I didn't scrape it, I could hire 100 people to manually go through the site and collect the same information. That's also doable—so why not automate it?

However, if the data is private—for example, stored in someone's Google Drive—then it absolutely shouldn't be scraped. In fact, even manual access to private data without permission shouldn't be allowed.

That's the boundary I draw. Public data, which anyone can access, should be fair game for scraping. Private data, which is not openly available, requires explicit consent.

Turning to your presentation at OxyCon, could you give us a sneak peek? What are you planning to share with the audience?

AI and web scraping can form a loop. I'll explain how that loop could work, where data scraped from the web helps train AI models, and how those models, in turn, improve scraping. I'll cover both the benefits and the potential downsides of this feedback cycle.

While not everything about it is practically feasible just yet, it's an exciting concept. I'll be discussing what's possible today, what might come in the future, what the blockers are, and how we might overcome them. I also run a small business focused on data labelling and annotation. So I'll talk about how data annotation fits into this loop.

Why isn't it feasible now? Is it because artificial intelligence is not at that level at the moment?

Yes. No matter how intelligent your AI agent is, it's still going to make mistakes. And if there's no one monitoring the process, you can end up in a vicious loop where those mistakes multiply at an exponential rate. That kind of growth in error is really dangerous.

There needs to be a domain expert involved. For example, if the data is related to medical science, then a medical expert should validate both the input coming from web scraping into AI and the output going back from AI into the scraping system.

When you think about the scale of that—having domain experts label all the data going in and out of AI models—it sounds massive.

Of course. That's one of the biggest challenges in data labelling—you need a lot of people. There has to be validation at both ends—before data goes in and after it comes out.

When I say "domain expert," it doesn't always mean someone with a PhD. For example, if we're working with traffic image data, a regular person can label that. In that context, they are the domain expert.

However, in other cases—like annotating MRI scans or X-rays—we do need medical professionals. That's very expensive. The same applies to financial documents—we need experts to annotate those too, and again, it costs a lot. Wherever real people are involved, the cost goes up.

You mentioned that your own business is related to solving this problem. Could you expand on that?

Yes. I run a data annotation company called Prism Soft. We have contracts with AI companies, especially those working in computer vision. We receive large volumes of image-based or video-based data—sometimes even text or audio—and our job is to annotate it. That means we prepare the data so it's ready to be fed into AI models.

Before we implemented our AI-assisted workflows, everything was done manually. For example, if there's an image with 20 cars, and the client wants bounding boxes drawn around each car, someone would have to draw all 20 boxes by hand. That takes time, and time means money. When you're working with millions of images, the costs skyrocket.

That's exactly what we've worked to solve, and with considerable effort, we were able to automate roughly 60% of the process. I think that's the best we can achieve for now, given current technologies.

However, we're working on increasing that percentage, and the pace of AI development is truly remarkable. If I predict something will happen in five years, it might actually happen in five months. You never know. There are so many factors involved—data, computational power, and engineering capabilities.

The speed of AI development makes people uneasy. Some fear we might soon lose control over AI. What is your opinion on this?

That's actually the most frequently asked question. Whenever I'm on a flight and someone asks what I do, the follow-up is usually, "Do you think the Terminator scenario will come true? Will AI destroy us?"

If you interview 10 AI experts, you'll likely get 10 different opinions on this. But the world is already taking steps to control AI. One important development is something called XAI, or explainable AI. When new models come to market, they're trained not to respond to harmful or inappropriate questions.

When we train a model like Gemini, the first question is always: "Is it safe?" Safety is a top priority. Large, reputable organizations put a lot of focus on this. Smaller players may not, but they often don't have the data or computational power to create models that could be dangerous at scale.

That said—again, a potentially controversial point—we don't know what military or intelligence agencies in powerful countries are doing with AI. If we can build models like ChatGPT or Gemini, then obviously, AI can also be used in ways that raise serious ethical concerns.

Another source of concern is the changing job market. People fear that their jobs will be taken from them.

AI has already transformed many industries. Think about content creation. We used to have teams of ten content writers. Now, one person can do the job by asking AI to write the content and then simply proofread it. AI isn't taking jobs away; it's shifting the job market. And this has happened throughout history.

New technologies create new types of jobs, and those who don't evolve with the changes risk being left behind. So yes, this is just history repeating itself—but at a much faster pace.

I think evolution is the only thing that can keep us relevant. If I sit still and think my current skill set is enough, there's a 99.9% chance I'll be replaced. If not by AI, then by someone more up-to-date with new technologies.

Even AI itself is constantly evolving. It's not the same as it was in 2015. There are tons of new models being developed, and AI is now one of the most active research areas globally. That means new technologies are being introduced daily. If you don't keep up, you will fall behind.

Some say that AI is developing so fast now that even if it stopped evolving for a whole year, we'd still struggle to catch up with everything. Do you think we even have a chance?

That's true. AI is evolving so quickly that even now, I find myself learning about tools and models that were released two years ago. The sheer volume of development is massive—there are only 24 hours in a day, and it's hard to keep up.

I think the key is to stay focused on your area. Whatever your profession is—whether you're a content creator, developer, or analyst—just make a habit of exploring AI tools relevant to your field once a week.

Some people believe, although I don't agree, that AI won't continue evolving at the same pace because it has already consumed most of the high-quality, human-generated data available. The argument goes: now that AI has absorbed the historical data, there's not as much new, meaningful information being produced, especially if most of the new data is AI-generated.

Why do you think that's not the case?

Because every day, we see a new AI model that's better than the one that came out yesterday. Especially in areas like agentic AI—models that can perform daily tasks—we're seeing rapid improvement. If AI can start handling routine tasks for employees, that's a significant leap.

So if we've already fed AI the last 100 years of history, feeding it again doesn't help. Now we're just adding new data one day at a time. And while that's happening slowly, AI is still getting better.

So what's still driving this rapid improvement?

Engineering. Every day, researchers are coming up with new techniques for how to train AI using the same data in smarter ways. Think of it like a classroom: there's one teacher and 40 students, but each student learns differently. Some will score 95%, others 70%, and some maybe 30%. Everyone has a different learning technique.

Same with AI—there's the same data, but with different architectures, training strategies, and optimization techniques, you can extract new value. Some approaches require powerful hardware, such as GPUs and TPUs. Others are more hardware-efficient. The techniques are constantly evolving, which is what drives AI forward.

So the big question now is: how long can we keep coming up with better, more revolutionary models? And my answer is—there are six or seven billion people in the world, and that means there are six or seven billion possibilities.

Read news from 100’s of titles, curated specifically for you.

Already a member? Sign in here