For our last DeepCrawl webinar, we were joined by Hamlet Batista, CEO of RankSense, who sat down with Jon Myers to talk about automating the generation of quality text for image alt tags and page descriptions. During the webinar with Hamlet, so many brilliant questions were submitted by our audience that there wasn’t time to answer all of them live.
What’s your preferred source for finding new/interesting papers?
In addition to Papers with Code, the Stanford CS224n Class shares really interesting papers at the end of each class. Follow them and their professor on Twitter. There is also a public list of past papers.
What is the minimum viable number of images that makes using Pythia worthwhile compared to manually doing this?
Pythia doesn’t need you to provide training images to provide descriptions. You can simply clone one of the demos. Let’s say the image captioning one, run through all the cells and provide your own images at the end.
The caption you get from this approach will be generic as they are generally trained on the Microsoft COCO dataset. You’d want to provide your own images and captions in order to personalize the generation so it is more specific to your site or your clients.
I love research stuff and new ideas coming each week, but have a huge problem with generating business value from those solutions. Do you know who (or which organization) might directly benefit from automatic image captioning or text summarization (which are of course not perfect)?
Here is one example of how we use adapt this technique to drive value for our ecommerce clients:
We have trained image captioning models on product images and corresponding five star reviews summaries. The review summaries play the role of the captions here, but the review summaries generally express why the user bought the product, so they are benefit-driven.
Academic researchers lack the business context to make a powerful connection like this one. But, if you learn these skills you can make these connections and find a lot of business value. The research and academic papers alone won’t cut it for you.
It is the same process to train a captioning model on a generic, open dataset like COCO, that is it to train on your own data set. The best part is that you can use Transfer Learning to build your model on top of the one trained on COCO.
Have you tried to use MonkeyLearn?
No. There is a lot of value in easy-to-use, build it for you AI tools.
I don’t use them because I don’t know what they are doing on their end. I don’t know how to improve or adapt their methods or combine them with other ideas/processes.
They limit my ability to find novel solutions to my most painful problems. But, as I said, they have their place and solve many predefined problems.
What’s the best resource would you recommend to learn python for a non-coder?
This is the best training for SEOs that want to get started on Python for data science – Faster Data Science Education
Do you think Google teams can detect text generation?
Yes. There is already a tool that helps detect generated text.
I think generating text that helps users shouldn’t get you in trouble with Google. Google is one of the companies pushing the hardest to get everyone into AI. Now, if you use text generation to create garbage, useless pages, you will definitely have a problem.
Can this be extended to other languages? I’ve recently started using a deep learning based translator to translate English content to German and the translator is off more often than it’s on point.
Yes. I haven’t tried this in other languages, but it is definitely possible.
The way deep learning for NLP works is you first replace the words with numbers and create a vocabulary.
I typically see this vocabulary with 30-50k words. After that step is done, it is all numbers to the algorithms. The predicted numbers need to be converted back to letters using the same vocabulary. So, in your case, the word in your vocabulary need to be in the language you expect. This means, you need to have the source training examples in that language.
What do you think about GPT-2?
I haven’t spent too much time on GPT-2 yet. Primarily, because their results, while impressive, don’t have a lot of practical use for me at the moment. But, I’m sure that might change in the near future.
Let me quote their main Github page to illustrate my main problem.
“The dataset our GPT-2 models were trained on contains many texts with biases and factual inaccuracies, and thus GPT-2 models are likely to be biased and inaccurate as well.
To avoid having samples mistaken as human-written, we recommend clearly labeling samples as synthetic before wide dissemination. Our models are often incoherent or inaccurate in subtle ways, which takes more than a quick read for a human to notice.”
You mentioned user reviews can be a source of generating a description about the product user reviews are for. If each review is short, say 100 characters, how many reviews do you think we need to generate good descriptions?
If the reviews are short but diverse where each reviewer focuses on a different aspect of the experience, you could combine them to generate longer summaries. Your system first needs to group the reviews that are similar to get this work.
In terms of SEO, will it help If we put multiple text summarization is from a product page within the website and we put the output in the main category page. Will it boost the category page rankings even though there is some amount of internal duplication?
This is similar to how the Wikipedia example I share works. So, it should work in theory. You could use abstractive summarization where the summary produces novel sentences to avoid any concerns about duplicate content. I don’t know if it would boost the page rankings, though. You’d need to set up an experiment and try.
Get started with DeepCrawl
if you’re interested in learning about how DeepCrawl can help by finding missing metadata and crawling images to export for captioning, then why not get started with a DeepCrawl account today.